Traditional Culture Encyclopedia - Traditional festivals - Application of Double-layer PDF in Digitization of Geological Data

Application of Double-layer PDF in Digitization of Geological Data

Guo Huijin Jia Ma Guofeng Feifei Jess Zhang

(National Geological Archives)

On the basis of expounding the characteristics and application prospect of double-layer PDF and OCR technology, this paper probes into the significance of double-layer PDF conversion of geological data and digital graphics data. The selection of conversion method is put forward, and the OCR digital processing system and the method to improve the recognition rate are introduced in detail. Finally, the significance of double-layer PDF in the construction of geological archives is put forward.

Double-layer PDF;; OCR; discrimination

At present, geological data collection agencies are stepping up their digitization work. By the end of 20 13, more than 20 provincial archives have completed the digitization of their collections, and the digitization of geological data in the National Geological Archives is coming to an end. The massive data formed has become an important data resource for the socialized service of geological data information. This kind of digital data is static, which is beneficial to reading and using, but it can't be searched in full text, which is not conducive to further analysis and processing. Therefore, on the basis of existing data, OCR recognition is carried out, and it is converted into double PDF files, which realizes the transformation from static to dynamic, establishes a full-text database, and completes the retrieval of full-text information of geological data, thus becoming a geological data collection institution to promote data digitization.

1 About double-layer PDF and OCR technology

Double-layer PDF is a searchable PDF file generated by OCR recognition based on scanned data, that is, the upper layer is the original image, and the lower layer is the recognition result, with one-to-one correspondence. Double-layer PDF files can not only 100% retain the original layout effect, but also support functions such as selection, copying and retrieval. Such PDF files can eventually be stored in CD-ROM, hard disk or disk array, and managed scientifically by establishing an index database.

OCR (Optical Character Recognition) refers to the process that electronic equipment (such as scanner or digital camera) checks the characters printed on paper, determines their shapes by detecting light and dark patterns, and then translates the shapes into computer characters through character recognition. That is, the process of scanning text data and then analyzing image files to obtain text and layout information. With the rapid development of computer network, information electronization has become an inevitable trend of the times. As the most important and concentrated carrier of information, the process of electronic writing is particularly important. OCR technology is the most important link in the process of e-book, which changes the traditional concept of data input in paper media. Through OCR technology, users can convert the image information of newspapers, books, manuscripts, tables and other printed materials obtained by optical input methods such as cameras and scanners into text information that can be recognized and processed by computers. Therefore, compared with the traditional manual input method, OCR technology greatly improves the efficiency of data storage, retrieval and processing.

2 Application status

PDF is widely used in governments, finance, law, engineering technology, medical care and many other departments all over the world, and has become the standard modern official document format specification of governments, academic departments and other units, so PDF electronic format documents will be the main body of future archives departments. The appearance of double-layer PDF effectively solves the contradiction between identification cost and reading utilization rate, and it is a potential resource format. The application of OCR technology in foreign countries has been relatively mature, and companies all over the world, including IBM, Motorola, Hewlett-Packard and Microsoft, have successively carried out research in this field and bound OCR technology into their own products.

Nowadays, OCR technology is also widely used in China. The research of information retrieval technology, that is, the research of double-layer PDF retrieval technology, China "863" project has started to test and evaluate Chinese OCR, automatic word segmentation, automatic summarization, automatic search and automatic positioning before 2008. On this basis, China has gradually established a series of digital-based implementation cases, such as digital libraries, digital archives, digital newspapers and periodicals, and digital campus networks, such as the full-text databases of the General Administration of Press and Publication, the Ministry of Foreign Affairs, and the Central Committee of the Communist Youth League. Full-text databases of China Youth in 75 years and Xinhua Digest in 20 years. As early as 1999, the National Library established the "National Library Document Digitization Center" to digitize and identify all kinds of collected documents. On this basis, there are three categories: bibliographic stacks, bibliographic databases and full-text databases, which have gradually become the central hub of online information resources in China.

With the comprehensive popularization of information construction in China, the application prospect of OCR technology is broader. The concepts of digital library, digital archives and digital archives also make OCR play a more and more unique role in the digitalization of paper archives, which not only saves manpower and material resources, but also maximizes the utilization value of archival information resources and can better serve the people.

3 the significance of two-layer PDF conversion of digital data

3. 1 is an important content of geological information construction.

With the improvement of social informatization, people are more and more dependent on information resources, and the demand for efficient management, retrieval and utilization of archives resources is becoming more and more urgent. Digitalization is an important content of informatization construction, and the core of informatization construction is resource construction. Resource construction includes three major tasks: first, the scanning and digitization of paper materials and the construction of directory database; The second is the filing and management of electronic documents; The third is the construction of full-text database and full-text retrieval system. According to the progress of digital work in archives, taking into account users' utilization needs, in order to obtain electronic information in real text form, make data digitization more effective and thorough, and maximize users' utilization, it is necessary to apply OCR technology to double-layer PDF conversion of scanned raster files, and then carry out full-text database construction and full-text retrieval of geological data.

3.2 is the premise of full-text retrieval of geological data and full-text database construction.

Practice has proved that full-text retrieval based on double-layer PDF documents effectively improves the efficiency of query utilization. By indexing the data of the archived database and the text layer of the double-layer PDF document, the pressure on the database and the system can be effectively reduced without accessing the database. It can support at least100000 data, millisecond query time and thousands of concurrent accesses per second, thus achieving the goal of large capacity and high speed. It can also adapt to Linux and Windows platforms and support various database interfaces. It has the structure and function of a general search engine, can segment the user's input, can carry out multi-keyword search and keyword combination search, and is humanized; At the same time, it can mine user data according to customer needs and improve the value of archives full-text retrieval system.

3.3 is the premise of modern data center standardization.

To build a modern data center, we must first standardize the storage structure of electronic files, that is, establish a general and widely used electronic file information storage and exchange format. As the latest standard for the long-term preservation of electronic files in electronic file management, PDF format has been fully implemented in the world, which has the advantages of strong compatibility, strong original record and perfect security control strategy, and is the best choice for the long-term preservation of electronic files. Therefore, it is imperative to convert the digital data in the collection into PDF format.

4 Double-layer PDF conversion method

4. 1 Current Common Two-layer PDF Conversion Method

At present, the conversion technology of double-layer PDF in China is mature. Judging from the existing technical conditions, it can be roughly divided into the following three types:

4. 1. 1 software conversion

Adobe Acrobat, ABBYY FineReader 12 (Chinese and English recognition), Readiris Corporate 12 (English recognition rate is high), Foxit Phantom 5 (text layer can be displayed separately), Tsinghua wentong TH-OCR XP8 (high recognition rate), Hanwang text network 5800 (good layout recognition rate, pure Chinese recognition rate). However, the recognition rate is directly proportional to the original paper materials (such as printing mode, clarity, paper quality, etc.). ) and the technical level of operators. If the paper original is of good quality, the recognition rate is relatively high; If the quality is poor, the recognition rate is low.

4. 1.2 process processing

According to the relevant technical requirements, the image is processed by a brand-new OCR recognition process, and the PDF file is regenerated, which has the characteristics of high text accuracy and accurate text positioning. This method is equivalent to the whole process of making double-layer PDF files, which has a large workload, long time consumption and high cost.

4. 1.3 recognition and reconstruction

Regenerate the PDF file to realize the restoration and reconstruction of the layout font, font size and color. The correct rate of words is high, and the page is clear, but it is quite different from the original picture, and it is mainly used in books.

4.2 Double-layer PDF conversion of geological data

20 1 1, on the basis of scanning digitization, the National Pavilion began to carry out the conversion experiment of double-layer PDF. The first method is mainly used for software conversion, that is, after automatic OCR processing, the software directly forms a double-layer PDF file. Because geological data are different from ordinary document files, paper styles and printing methods are diverse, there are many handwritten and old materials, and there are many special symbols such as strata and mathematics, which brings difficulties to OCR automatic recognition. Single software identification can't meet the requirement of full-text retrieval with more than 90% recognition rate.

On the basis of conversion test, we get the following conclusions:

1) Geological data itself is diverse, and the actual recognition rate is mainly affected by printing quality, stratigraphic age and other factors, while the recognition rate of old data and poor paper data is generally low; Influenced by the writer's writing habits and writing clarity, the recognition accuracy of handwritten documents is generally below 30%; The recognition accuracy of mimeographed documents is generally below 50%; The recognition rate of printed, lead printed and offset printed documents is relatively high, generally reaching more than 90%. No matter what kind of documents, the recognition rate of punctuation marks is very low, and the recognition rate of special symbols such as strata and mathematical symbols is almost zero.

2) At present, the recognition technology is not up to 100%, and the initial recognition results must be manually proofread according to the actual needs to meet the requirements of full-text retrieval.

3) The scanned geological data files are large in quantity and capacity, and the conversion speed is affected by the computer response speed. It is necessary to select high-profile computers for large-scale conversion and identification. Batch conversion and manual identification are time-consuming and labor-intensive, and special funds are needed to support the work.

4.3 Introduction and function of OCR digital processing system

After comparing the current domestic double-layer PDF conversion methods, combined with the complex characteristics of geological data and the study of data test results, it is suggested that the double-layer PDF conversion of geological data mainly adopts the method of combining software identification with process processing, that is, OCR digital processing system can ensure the high efficiency and high quality of double-layer PDF conversion. The system mainly includes the following main processes:

Figure 1 OCR digital processing system schematic diagram

1) image processing. In order to improve the recognition rate, the image is "de-blued and decontaminated" to remove the noise that affects the recognition rate, such as pits and underscores. The image processing quality is automatically monitored by the image quality control program.

2) Layout analysis. Automatically understand and locate the layout, judge whether the border area is horizontal text area, vertical text area, table area or image area, and identify areas with different attributes with different color wireframes. Automatic layout analysis runs in the background, and the operator can confirm it in the foreground, and if necessary, manual intervention can be added to the results of automatic layout analysis.

3) recognition. By converting text images into computer text internal codes, printed and handwritten Chinese (including simplified and traditional), Chinese-English mixed characters and tables can be recognized, and the recognized text internal codes can be GB codes, BIG5 codes, GBK codes or Unicode codes. The recognition process runs in the background.

4) Vertical proofreading. It has a strong ability of error detection and correction, that is, text images identified as the same word in one or more images are displayed together, and suspicious words are marked with prominent colors, which is convenient for operators to find errors and make corrections.

5) Horizontal proofreading. It is a traditional manual proofreading method, and the operator directly compares the recognition result text with the image to find out the recognition error text. The system automatically calls up the image corresponding to the text for comparison. At the same time, use eye-catching colors to indicate the recognition reliability of the text.

6) Layout restoration. Recover the recognized and modified text into a digital document with the same format as the scanned manuscript in RTF, PDF, HTML and SGML/XML, which can be read and searched by the computer.

7) Data warehouse. Preserving digital documents by page restoration.

4.4 methods to improve OCR recognition rate

The double-layer PDF generated by OCR digital processing system has the lowest error rate in the text layer, which is one in ten thousand. It can present the original shading and color features, and can be used for full-text retrieval and copying. The retrieved information can accurately locate the characters, which is convenient for finding the target information quickly. In order to reduce the workload of horizontal proofreading, that is, manual proofreading and improve work efficiency, it is necessary to fundamentally improve the recognition rate. After experiments, the following methods can improve the OCR recognition rate of raster files.

1) Image color setting. Although gray scale or color mode can restore the original appearance of paper materials to the greatest extent, it is our first choice for scanning digitalization, but these two color modes will increase the background noise that affects the recognition rate. If you only do text recognition and general black-and-white illustration selection, it is suggested that the image color of the scanning program can be set to black and white to increase the recognition rate. But the final image color setting should be set according to the specification requirements of various specific jobs.

2) Resolution setting. We all know that the lower the scan resolution setting, the faster the scanning speed, but it also leads to poor image quality and low accuracy of character recognition. On the contrary, the resolution is high and the scanning speed is slow, but the accuracy of character recognition is high. But this is not absolute, because after the resolution is set too high, tiny flaws on the paper may also be regarded as punctuation marks or Chinese characters, and the accuracy of character recognition will be reduced. After repeated tests, the resolution is set to 300dpi, which is the best balance between scanning speed and character recognition accuracy.

3) Image processing. Image processing here refers to tilt correction and decontamination before scanning the output image. Tilt correction is to adjust the direction of text to make it positive, which can help OCR recognition.

After the double-layer PDF conversion is completed, the data management system can be linked with the PDF file, and the data content, metadata and other related information can be linked to form a data packet. Then create an index file by calling the original data of the full-text database, and finally realize full-text retrieval. Through the realization of full-text database and full-text retrieval, high recall and precision are obtained, which greatly improves the utilization value of geological data, promotes the compilation and research of geological data, and lays the foundation for the research and in-depth service of geological data information aggregation.

refer to

[1] Xu. Application of OCR technology in the process of archives digitization [J]. Archives Management, 20 1 1( 1).

[2] Xu. Application of OCR technology in the process of archives digitization [J]. Art and Technology, 20 1 1(2).

Zhang Xuan. Research progress and prospect of OCR technology [J]. Science and Technology, 20 10(4).

[4] Guo Double-layer PDF technology and its application in archives digitization [J]. New observation, 20 13( 1).

[5] Liu. File storage format and PDF document [J]. Archives Research, 2002(2).