Traditional Culture Encyclopedia - Traditional stories - Optical character recognition

Optical character recognition

Ocr character recognition refers to optical character recognition technology.

The full name of OCR is optical character recognition, which is the most commonly used and efficient text scanning technology at present. It can identify and extract the text content in a picture or PDF, output a text document, verify the user information conveniently, or edit the content directly.

The typical OCR technical route is divided into five steps, namely, input, image processing, text detection, text recognition and output. Every process needs the deep cooperation of algorithms, so from the bottom of technology, from pictures to text output, we have to go through some processes.

Ocr technical process

Image input, reading files with different image formats.

Image preprocessing mainly includes image binarization, denoising, tilt correction and so on.

Layout analysis, which divides document pictures into paragraphs and lines.

Character cutting deals with the problem that it is difficult to cut characters simply because of the adhesion and broken pen.

Character feature extraction: extracting multidimensional features from character images.

Character recognition: rough template classification and fine template matching are carried out on the feature vector extracted from the current character and the feature template library to recognize characters.

Page recovery: identify the typesetting of the original document and output the identification result to the text document according to the original typesetting format.

Post-processing correction, which corrects the recognition results according to the relationship between specific language contexts.