Traditional Culture Encyclopedia - Traditional culture - What is OCR?

What is OCR?

You may have installed the OCR device or downloaded the OCR installer.

OCR (Optical Character Recognition) is a theme of pattern recognition (PR). Its purpose is to let the computer know what it sees, especially the written materials.

Because OCR is a tug-of-war technology, how to debug or use auxiliary information to improve the recognition rate is the most important subject of OCR, and the word ICR (Intelligent Character Recognition) comes into being. According to the different media in which written materials exist and the different ways of obtaining these materials, various applications are derived.

The following is the basic introduction of OCR, including its technical introduction and its application.

I. Development of optical character recognition

To talk about the development of OCR, as early as the 1960s and 1970s, countries all over the world began to study OCR. At the beginning of the research, most of them focused on the method of character recognition, and the recognized characters were only numbers from 0 to 9. Taking Japan, which also has box symbols, as an example, the basic recognition theory of OCR began to be studied around 1960. At first, numbers were taken as the object. Until 1965 to 1970, some simple products began to appear, such as the postcode recognition system of printed characters, which recognized the postcode on the mail and helped the post office to distribute regional letters. Therefore, postal code has always been an address writing method advocated by various countries.

OCR can be said to be an uncertain technical research. The correct rate is like an infinite approximation function. If we know its approximate value, we can only approach it, but we can't reach it. We are always 100% fighting. Because there are too many factors involved, such as the habit of the writer or the printing quality of the document, the scanning quality of the scanner, the identification method, the samples for research and testing, and so on. , will affect its accuracy. Therefore, OCR products need a strong recognition core, and the convenience of operation and use, the debugging functions and methods provided by the products are also important factors to determine the quality of the products.

The purpose of an OCR recognition system is very simple, that is, to transform the image, so that the graphics in the image can be kept, and the data in the table and the characters in the image can be turned into computer characters, which can reduce the storage of image data, reuse and analyze the recognized characters, and of course save the manpower and time of keyboard input.

From the image to the result output, we should go through image input, image preprocessing, text feature extraction, comparison and recognition, and finally correct the typo by manual correction and output the result.

Here is an introduction:

Image input: The subject to be OCR processed must be transmitted to the computer through optical instruments, such as image scanner, fax machine or any photographic equipment. With the progress of science and technology, input devices such as scanners have become more and more exquisite, light and short, and of high quality, which is of great help to OCR. The resolution of the scanner makes the image clearer and the frequency scanning speed faster, which improves the efficiency of OCR processing.

Image preprocessing: Image preprocessing is the module that needs to solve the most problems in OCR system. The process from obtaining non-black and non-white binary images or gray-scale color images to independently generating text images belongs to image preprocessing. Including image normalization, denoising, image correction and other image processing, as well as graphic analysis, text lines and text separation and other file preprocessing. In image processing, the theory and technology have reached a mature stage, so there are many link libraries available in the market or on the website. In the pretreatment of documents, it depends on various skills; The image should first separate the picture, table and text area, and even distinguish the typesetting direction, theme and content theme of the article, so that the size and font of the text can be judged as the original document.

Character feature extraction: In terms of recognition rate alone, feature extraction can be said to be the core of OCR. What features and how to extract them directly affect the quality of recognition, so there are many research reports on feature extraction in the early stage of OCR research. Features can be said to be chips for recognition, and simple distinction can be divided into two categories: one is statistical features, such as the black/white point ratio of a text area. When the text is divided into several regions, the combination of black/white point ratio in each region becomes a numerical vector of space, and the basic mathematical theory is enough for comparison. Another kind of feature is structural feature, such as the number and position of stroke endpoints and word intersections obtained after thinning text images, or comparing them with stroke segments by special comparison methods. The recognition methods of online handwriting input software in the market are mostly based on this structural method.

Contrast database: After calculating the features of input characters, whether using statistical features or structural features, there must be a contrast database or feature database for comparison. The contents of the database should include all the character sets to be recognized and the feature groups obtained by the same feature extraction method as the input characters.

Contrast recognition: This is a module that can give full play to mathematical operation theory. According to different characteristics, different mathematical distance functions are selected. The famous comparison methods are: European spatial comparison method, relaxation method, dynamic programming method (DP), and the establishment and comparison of neural network-like databases, hmm (hidden Markov model) and other famous methods. In order to make the recognition results more stable, some people also put forward the so-called expert system, which makes use of the differences and complementarities of various feature comparison methods to make the recognition results have particularly high confidence.

Text post-processing: Because the recognition rate of OCR cannot reach 100%, or in order to strengthen the correctness and confidence value of comparison, some functions of debugging and even helping to correct errors have become essential modules in OCR system. Word post-processing is an example. Using the compared recognized words and their possible similar candidate words, the most logical words can be found and corrected according to the recognized words before and after.

Thesaurus: A thesaurus established for word post-processing.

Manual correction: the last level of OCR. Before that, users may just hold the mouse and follow the rhythm of software design, or just watch. Here, it may take the user's spirit and time to correct or even find out the possible problems of OCR. A good OCR software not only has a stable image processing and recognition core to reduce the error rate, but also the operation flow and function of manual correction affect the processing efficiency of OCR. Therefore, the comparison between the text image and the recognized characters, the position of the screen information, the candidate character function of each recognized character, the function of refusing to recognize the characters, and the potentially problematic characters are specially marked after the text post-processing. Are designed for users to use the keyboard as little as possible. Of course, it doesn't mean that the text that the system doesn't display is necessarily correct, just like the staff that is completely input by the keyboard will make mistakes. At this time, it depends entirely on the needs of users.

Result output: Actually, output is a very simple matter, but it depends on what users do with OCR. Some people only want the text file to be reused as a part of the text, so as long as the general text file is exactly the same as the input file, some people want to reproduce the original text, and some people pay attention to the text in the table, so they should combine Excel and other software. No matter how it changes, it is only a change in the format of the output file.

Supplement: Of course you can delete it if you don't need it!