Traditional Culture Encyclopedia - Traditional stories - What is meant by Optical Character Recognition

What is meant by Optical Character Recognition

Optical Character Recognition (Optical Character Recognition) refers to the process by which electronic devices (e.g., scanners or digital cameras) examine characters printed on paper, determine their shapes by detecting patterns of darkness and lightness, and then translate the shapes into computerized text using character recognition methods; i.e., for printed characters, the paper document using optical That is, for the printed characters, using optical means to convert the text in the paper document into black and white dot matrix image files, and through the recognition software will be converted into text format, for further editing and processing of word processing software technology. How to get rid of errors or the use of auxiliary information to improve the recognition of the correct rate is the most important topic of OCR, ICR (Intelligent Character Recognition) of the term is also therefore generated. The main indicators to measure the performance of an OCR system are: rejection rate, false recognition rate, recognition speed, user interface friendliness, product stability, ease of use and feasibility.

Principle of work:

One OCR recognition system, its purpose is very simple, just to make a conversion of the image, so that the image of the graphics continue to be saved, there are forms of information in the form and the image of the text, all into the computer text to achieve the reduction of the storage of image data, recognition of text can be reused and analyzed, and of course, can also be saved due to the keyboard input of manpower and time. The new system will be a new generation of computer software.

From the image to the result output, must go through the image input, image pre-processing, text feature extraction, comparison recognition, and finally by manual correction will recognize the wrong text correction, the results will be output.

Image Input

The subject matter to be processed by OCR has to be transferred to the computer through optical instruments such as image scanners, fax machines or any photographic equipment. Advances in technology, scanners and other input devices have been made more and more sophisticated, thin, short, high quality, OCR has considerable help, the resolution of the scanner to make the image more clear, scanning speed to enhance the efficiency of OCR processing.

Image preprocessing: Image preprocessing is one of the most problematic modules in an OCR system. The image must first be separated from the picture, form and text area, and even the direction of the article, the outline of the article and the main body of the content to distinguish between the size of the text and the font of the text can also be judged as the original document.

The following pre-processing of the image to be recognized can reduce the difficulty of the feature extraction algorithm and improve the accuracy of the recognition.

Binarization: Because the color image contains too much information, in the image before the recognition of printed characters, the need for binarization of the image processing, so that the image contains only black foreground information and white background information, to enhance the efficiency and accuracy of the recognition process.

Image noise reduction: Because the quality of the image to be recognized is limited by the input device, the environment, and the printing quality of the document, before the recognition of printed characters in the image, it is necessary to denoise the image to be recognized according to the characteristics of the noise to improve the accuracy of the recognition process.

Tilt correction: As the scanning and shooting process involves manual operation, the image to be recognized by the input computer will be more or less tilted, in the image of the printed characters in the image before the recognition process, it is necessary to carry out the image direction detection, and correct the image direction.

Text feature extraction: recognition rate alone, feature extraction can be said to be the core of the OCR, with what features, how to extract, directly affect the recognition of the good and bad, but also so in the early stages of the OCR research, feature extraction of the research report is particularly large. And features can be said to recognize the chips, simple distinction can be divided into two categories: one for the statistical features, such as the text area of the black/white point ratio, when the text is distinguished into several regions, this a regional black/white point ratio of the joint, a numerical value of the space vectors, in the comparison, the basic mathematical theory is enough to cope with. Another type of features for the structure of the features, such as text image line, after the acquisition of the word stroke endpoints, the number of intersections and the location, or to the stroke segment as a feature, with a special method of comparison, comparison, the market recognition of the online handwriting input software is based on this structure of the main method.

Comparison database: when the input text counts the features, whether it is with statistical or structural features, there must be a comparison database or feature database for comparison, the database content should contain all the words to identify the set of text, according to the same features as the input text of the feature extraction method of the group.

Comparison recognition

This is a module that can give full play to the theory of mathematical operations, according to the different characteristics of the features, the choice of different mathematical distance function, the more famous comparison methods are, the comparison of the Euclidean space method, relaxation comparison method (Relaxation), Dynamic Programming comparison method (Dynamic Programming, DP), as well as The database establishment and comparison of neural network, HMM (Hidden Markov Model)...and other famous methods, in order to make the recognition results more stable, there is also the so-called expert system (Experts System) has been put forward, the use of a variety of characteristics of the comparison method of the complementary nature of the difference, so as to make the recognition of the results of the degree of confidence is particularly high.

Word post-processing: Because the recognition rate of OCR can not reach 100%, or want to strengthen the correctness of the comparison and confidence value, some of the error or even help to correct the function, but also become a necessary module in the OCR system. Word post-processing is an example, the use of comparison after the recognition of the text and its possible similar candidate word group, according to the recognition of the text before and after to find out the most logical words, do correct the function.

Word database: A word database created for word post-processing.

Manual Correction

The last hurdle of OCR, before this, the user may just take a mouse, follow the rhythm of the software design operation or just watch, and here there may be a special need to spend the user's spirit and time to correct or even find may be the place of OCR error. A good OCR software, in addition to a stable image processing and recognition of the core, in order to reduce the error rate, the manual correction of the operation process and its functions, but also affects the efficiency of OCR processing, therefore, the text image and recognition of the text of the contrast, and its screen information placed in the position, and each recognition of the text of the candidate word function, the refusal to recognize the word function, and the word post-processing after the intentional labeling of the possible problematic Words, are designed for users to minimize the use of the keyboard a function, of course, does not mean that the system does not show the text must be correct, as completely by the keyboard input staff will also have the time of the error, this time to re-calibrate a time or to allow a slight error, it depends entirely on the needs of the use of units.

Results of the output

Some people as long as the text file for part of the text of the reuse, so as long as the general text file, some people want to look beautiful and input files exactly the same, so the original reproduction function, some people pay attention to the text of the form, so the combination of Excel and other software. No matter how to change, are just changes in the output file format. If you need to restore the same format as the original, then after the recognition, you need to manually layout, time-consuming and labor-intensive.