Optical Character Recognition

Optical Character Recognition technology is still an area where extensive research is required. Here's some information on what it is, the problems involved, and the strides technology has taken towards its perfection.
Optical Character Recognition (OCR) is a method of making printed, typewritten, or handwritten data understandable and readable with a computer. The intention of OCR is to store the data in a digital format, from where it can be edited on a machine and, most importantly, made accessible with keywords. The process generally involves deciphering the data by a machine, converting it into a machine-readable format, and then storing it.

How is it Done?

The first step is to scan and process the document. Then, a layer of the OCR text (the OCR symbols) is added behind each image that's present in the scanned document. In order to make sure that the characters are recognized properly, another filter may be used in conjunction with the first.

With the filters in place, individual characters are identified from a dictionary that is present in the software. The process is to match a pattern with a pre-existing one in the dictionary, in order to find out what the character stands for. It is then converted into readable text. The text is what is visible to the user and this is the result of the OCR.

If the document is too smudgy, high-end technologies such as multi-light image capture technology might be employed. This is also helpful when the document has shadows on it due to page fold areas.

Problems in OCR

The benefits of OCR are obviously quite clear, but there is still a lot of advancement to be done in the field. It is not a perfect science yet, and every document scanned is rife with several errors. There are many reasons why perfection is proving to be elusive:
  • People have hugely different styles of writing. To add to that, most people do not write with the same speed, conciseness, and density of ink. Usually, there is no similar pattern that can be discerned between the writing styles of two different individuals. That makes it very difficult for any software to recognize common patterns. Today, OCR works much better for discrete handwriting than for cursive writing. The stringier the handwriting, the more difficult it is to identify for the software.
  • OCR works well only if the letters are clearly discernible. This has to do with a lot of things, with the color and the tidiness of the paper it is printed on, to the oldness of the paper. It is very difficult to identify the symbols on a dirty and smudged paper.
  • Another problem might be the unevenness of the paper on which the matter to be recognized is present. The paper might be creased or if it is a page of a book, it will be very difficult to identify the letters that are present in the central area of the book, where shadows might be created due to the inward slope.
  • The major failing yet, is in finding a common language for all the forms of OCR to recognize the patterns in the text. Most methods involve the use of several coded symbols to achieve character recognition. Whatever success has been achieved yet, is due to the establishment of these symbolic patterns.
Where is OCR Heading Today?

As mentioned before, it has not yet achieved perfection. Users should be prepared for several errors. That is the reason why OCR always follows a human review.

Since OCR tries to concern itself with vastly different kinds of material, the success in various fields differs vastly too.
  • In Text Identification: Among the written scripts, understanding Latin script has been honed to near perfection. There is only a 1% error rate in Latin recognition, as Latin alphabets are simpler (with fewer strokes, curves, and lines) than others used globally. Scripts such as Chinese are very difficult for OCR. Printed text is better recognized than handwritten text.
  • In Music Identification: The music industry has attempted to remove the lines from the sheet music to enable it for OCR. This has given a fair degree of success. However, it is very difficult to understand handwritten music. Photoscore Ultimate 5 from Neuratron is the only one software application in the world that does it. But the output is not even close to being perfect.
  • In Magnetic Ink Identification: Magnetic ink character identification is very important in banks where checks need to be processed. There are special fonts such as E-13B and CMC-7 that re used for this process. This kind of identification enjoys a high degree of authenticity to the real matter.
Another area where OCR is very important is direct hand-input data such as that written with a stylus on a palmtop. Today, many companies have perfected this technology, but a lot depends on how uniformly the person can write. Training may be required first for the operating system to understand the writing style of the person, and then the writer might have to change certain things for the OS to understand. This technique is known as Intelligent Character Recognition (ICR) and is widely used nowadays.