An overview of the Tesseract OCR (optical character recognition) engine, and its possible enhancement for use in Wales in a pre-competitive research stage
Prepared by the
Language Technologies Unit (Canolfan Bedwyr), Bangor University
April 2008
This document was prepared as part of the SALT Cymru project, funded by the
Welsh Assembly Government under the Knowledge Exploitation Fund’s
Knowledge Exchange Programme, reference HE 06 KEP 1002
What is OCR technology?
OCR technology allows the conversion of scanned images of printed text or symbols
(such as a page from a book) into text or information that can be understood or edited using a computer program. The most familiar example is the ability to scan a paper document into a computer where it can then be edited in popular word processors such as Microsoft Word. However, there are many other uses for OCR technology, including as a component of larger systems which require recognition capability, such as the number plate recognition systems, or as tools involved in creating resources for SALT development from print based texts.
Availability
General Availability
Commercial OCR technologies, of which OCR engines is the core component, are widely available. These commercial engines are highly developed and offer considerable accuracy when working with texts from major languages. With English text for example, the top commercial engines have an accuracy of over 98%. Some companies specializing in OCR technologies offer software developer kits (SDKs) which allow software developers to license the use of the OCR technology in their own systems.
Language Availability
As previously mentioned, the accuracy of major-language commercial OCR is very high.
This accuracy is achieved through the combination of language independent algorithms for identifying the likely value of a character with language specific information such as wordlists that improve the results of these algorithms.