Traditional Culture Encyclopedia - Traditional customs - Pixel picture material-how to do simple verification code recognition with Python

Pixel picture material-how to do simple verification code recognition with Python

Pixelstudio imports the photo size of 1

First of all, we open the software and click the Add Picture function in the toolbar below.

Then you can add pictures directly from the material library, and click the following to import pictures from the device from the mobile phone.

After clicking the slave device, you can import the pixel map stored in the mobile phone or directly import the photos in the album, but the size should be controlled within 256*256 pixels.

As shown in the figure, you can adjust the size and position of the imported picture material.

How to save a picture in a pixel wallpaper Using PixelStylePhotoEditor, you can save a picture document in the format of TIFF/JPEG/PNG/GIF/BMP/PSDB/JPEG2000/PDF/SVG.

Step 1: Select "File"-> in the menu bar; "storage"

Step 2: In the Save dialog box, type a name for the image.

Step 3: Select the file format to save.

Exportable file formats and corresponding options.

Tip:

1. If you haven't finished editing and want to open the file next time, please save the document as a PixelStyle image (PSDB).

Step 4: Click the Save button to save the changes to the document.

How to use Python to identify 1__ as a simple verification code?

Verification code is a very common and important thing on the internet at present, which acts as the function of many systems. However, with the development of OCR technology, the security problems exposed by verification codes become more and more serious. This paper introduces a complete process of character verification code recognition, which has certain reference significance for verification code security and OCR recognition technology.

Then, after a year's research, the author obtained a more powerful direct end-to-end verification and identification technology based on CNN convolutional neural network (the article is not mine, and later I sorted out the source code, and the introduction and source code are here):

Tensorflow end-to-end character verification code recognition source code arrangement based on python language (github source code sharing)

2__ diameter

Keywords: security, character picture, verification code recognition, OCR, Python, SVM, PIL

3__ _ _

The materials used in this study come from the public image resources of an old Web framework website.

This article only grabbed the public picture resources of this website, and _ ⑽ taught _ did any unnecessary operations.

This article leaked the identity information of the website when writing related reports.

The author of this article _ Ya _ website related personnel have actively moved to the new system in view of the loopholes in this system.

The main purpose of this report is only for _CR to exchange and learn _ ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich, ostrich.

4__

For the introduction of the non-technical part of the verification code, please refer to a popular science article written before:

Internet security firewall (1)- popular science of network verification code

It explains the types, usage scenarios, functions, main identification technologies, etc. The verification code, but does not involve any technical content. The content of this chapter, as its international hook, gives the corresponding identification solution, so that readers can have a deeper understanding of the function and security of verification code.

5_ Jing

To achieve the purpose of this paper, only simple programming knowledge is needed, because the field of machine learning is booming now, and there are many packaged open source solutions for machine learning. Ordinary programmers don't need to understand complicated mathematical principles, that is, they can realize the application of these tools.

Main development environment:

python3.5

PythonSDK version

PIL

Picture processing library

libsvm

Open source svm machine learning library

The installation of the environment is not the focus of this article, so it is omitted.

6__ No.9 Middle School

Generally speaking, the identification process of character verification code is as follows:

Prepare original picture materials

Image preprocessing

Image character cutting

Picture size standardization

Image character mark

Character image feature extraction

Generating a training data set corresponding to the features and the marks.

Train feature tag data to generate recognition model.

Prediction of new unknown picture sets using recognition model

Achieve the purpose of returning the correct character set according to the "diagram"

7__ Yao scratched his head

7. 1__ Yao Nan

Because this paper is mainly for the purpose of primary study and research, it needs to be written by _ Ba, but it won't be too difficult. "So I found a relatively representative and simple character verification code directly on the Internet (feeling is to find loopholes).

Finally, I found this picture of verification code on an old website (estimated to be the framework of the website decades ago).

Original drawings:

Enlarge the clear picture:

This picture can meet the requirements, and careful observation has the following characteristics.

Features favorable for identification:

Composed of pure Arabic numerals

The number of words is 4.

These characters are arranged regularly.

Font is a unified font.

The above is the important reason why this verification code is simple, which will be used in the subsequent code implementation.

Characteristics of unfavorable identification:

There is interference noise in the background of the picture.

Although this is an unfavorable feature, the interference threshold is too low and can be removed by a simple method.

7.2__ Death Curtain Cycle

Because you need a lot of information when doing training, it is impossible to save it in the browser manually, so it is suggested to write an automatic download program.

The main steps are as follows:

The interface for generating random picture verification code is obtained through the browser's bag grabbing function.

Batch request interface to get pictures.

Save the picture to a local disk directory.

These are some basic IT skills, so I won't go into details in this article.

The code of network request and file saving is as follows:

defdownloads_pic(**kwargs):

Picture Name = ('Picture Name', none)

URL = ' http and _ code _ captcha/'

res=(url，stream=True)

withopen(pic_path+pic_name+'。 bmp '，' WB ')ASF:_ _ for chunkin _ content(chunk _ size = 1024):_ _ _ if chunk:# filteroutkeep-alivenewchunks _ _ _ _ f . write(chunk)

F. Rinse ()

f.close()

Cycle n times, and you can save n copies of verification materials.

The following are the renderings of dozens of collected material libraries saved to local files:

8__ Plan _

Although the current machine learning algorithm is quite advanced, in order to reduce the complexity of later training and improve the recognition rate, it is still necessary to preprocess the picture to make it more friendly to machine recognition.

The processing steps of the raw materials are as follows:

Read the original picture materials.

Color image binarization into black and white image

Remove background noise

8. 1__ prayer instrument

The main steps are as follows:

Convert RGB color map into gray map

According to the set threshold, the gray image is converted into a binary image.

Image= (image path)

Imgry=('L')# Converted to gray image table=get_bin_table ().

Out= (table,' 1')

The definition of the binary function quoted above is as follows:

123456789 10 1 1 12 13 14

Def _ et _ bin _ table (threshold =140): _ _ "_ _ Get the mapping table from gray level to binary value _ _: param threshold: _ _: return: _ _" _ _ table = _] _ for _ _ n _ An Ge (20. Threshold: _ _ _ _ _ (0) _ _ _ _ _ Otherwise: _ _ _ _ _ (1) _ Return.

After PIL conversion, it becomes a binary image: 0 means black, 1 means white. After binarization, 6937_ South China is noisy;

1 1 1 1000 1 1 1 1 1 1000 1 1 1 1 1 1 10000 1 1 1 1 1000000 1 1

1 1 10 1 1 10 1 1 1 10 1 1 10 1 1 1 / kloc-0/ 10 1 165438+ 0 10 1 1 1 100 1 10 1 1 1

100 1 1 100 1 1 1 10 1 1 1 10 10 1 10 / kloc-0/ 10 10 10 1 10 1 1 10 1 1 1

1 10 1 1 1 1 1 1 1 1 10 1 10 10 1 1 / kloc-0/ 1 1 1065438+ 00 1 1 1 1 1 1 1 10 1 1 1 1

1 10 1000 1 1 1 1 10 1 1 100 1 1 1 1 1 100 1 1 165 438+0 1 1 1 1 10 1 1 1 1

1 100 1 1 10 1 1 1 1 100000 1 1 1 1 1 / kloc-0/ 100 10 1 1 1 1 10 1 1 1 1 1

1 10 1 1 1000 1 1 1 1 1 1 1 10 10 1 1 0 10 1 10 165 438+0 1 1 1 10 1 1 1 1 1

1 10 1 1 1 10 1 1 1 1 1 1 1 1 10 1 / kloc-0/ 1 10 1 1 1 10 1 1 1 1 1 10 1 1 1 1 1

1 10 1 1 1 10 1 1 1 10 1 1 100 1 1 1 / kloc-0/0 1 1 165438 +00 1 1 1 1 1 10 1 1 100

1 1 10000 1 1 1 1 1 10000 1 1 10 1 10000 1 / kloc-0/ 10 1 1 10 1 1 1 1 1

If you are nearsighted and then stay away from the screen, you can vaguely see the tomb of 6937_.

8.2 _ _ ココ

After it is converted into a binary image, noise needs to be removed. The material selected in this paper is relatively simple, and most of the noise is the simplest one-aluminum ⒌, so a lot of noise can be removed by detecting these isolated points.

There is a mature algorithm about how to remove complex noise and even interference lines and color blocks: _ every time _loodFill, and you can continue to study it in the future if you are interested.

In order to simplify the problem, this paper simply uses a simple self-reflection _ _ _ to solve this problem:

Count the black dots in nine squares around a certain _ _

If there are less than two black spots, prove that this point is an isolated point, and then find all the isolated points.

Remove all isolated points at once.

The specific algorithm principle will be introduced in detail below.

As shown in the figure below, all pixels are divided into three categories.

Vertex a

Boundary point b of non-vertex

Internal point c

The schematic diagram of category points is as follows:

These include:

Class A points calculate three adjacent points (as shown in the red box above).

Class B points calculate five adjacent points (as shown in the red box above).

Class C points calculate 8 adjacent points (as shown in the red box above).

Of course, due to the different directions of reference points in the calculation area, Class A points and Class B points will also be subdivided:

Class A points are subdivided into: upper left, lower left, upper right and lower right.

Class B points are further subdivided into: up, down, left and right.

Class c points do not need subdivision.

Then these subdivision points will become the standard for subsequent coordinate acquisition.

The main algorithms implemented by Python are as follows:

defsum_9_region(img，x，y):_ " " "

9 Neighborhood box, Tian Zi box centered on the current point, number of black spots.

:paramx:

: parameters:

: Return: _ """

#todo determines the lower limit of the length and width of the picture.

Cur_pixel=((x, y))# The value of the current pixel.

Width =

Height = _ ifcur _ pixel = =1:# If the current point is a white area, the neighborhood value will not be calculated.

Return0_ify==0:# First line

Ifx==0:# Top left vertex, 4 neighborhood

Point # 3 next to the center point

Sum = cur _ pixel \ _ _+((x，y+ 1)) \ _ _+((x+ 1，y)) \ _ _+((x+ 1，y+)

Sum = cur _ pixel \ _ _+((x，y+ 1)) \ _ _+((x- 1，y)) \ _ _+((x- 1，y+)

Sum = ((x- 1，y)) \ _ _+((x- 1，y+ 1))\ _ _+cur _ pixel \ _ _+((x，y+)

Ifx==0:# Lower left vertex

Point # 3 next to the center point

Sum = cur _ pixel \ _ _+((x+ 1，y)) \ _ _+((x+ 1，y- 1)) \ _ _+((x，y-)

Sum = cur _ pixel \ _ _+((x，y- 1)) \ _ _+((x- 1，y)) \ _ _+((x- 1，y-)

Sum = cur _ pixel \ _ _+((x- 1，y)) \ _ _+((x+ 1，y)) \ _ _+((x，y- 1))。

Ifx==0:# Left non-vertex

Sum = ((x，y- 1))\ _ _ _ _ _ _+cur _ pixel \ _ _+((x，y+ 1)) \ _ _+((x+ 1，y-)

#print('%s，%s'%(x，y))

Sum = ((x，y- 1))\ _ _ _ _ _ _+cur _ pixel \ _ _+((x，y+ 1)) \ _ _+((x- 1，y-)

sum=((x- 1，y- 1))\ _ _ _ _ _ _ _+((x- 1，y))\ _ _ _ _ _ _ _+((x，y- 1))\ _ _ _ _ _ _ _+cur _ pixel \ _ _ _ _ _ _+((x，y+ 1))\ _ _ _ _ _ _ _+((x+ 1)

Tip: This place is quite a test of people's carefulness and patience. The workload of this place is still quite large, and it took half an evening to finish.

After calculating the number of peripheral pixel black points of each pixel (note: the picture black point value converted by PIL is 0), it is only necessary to filter out the number of 1 or 2_ outliers _U _ points. Sorry, it can basically meet the needs of this paper.

The preprocessed image is as follows:

Compared with the original picture at the beginning of the article, the pictures of verification codes generated by _ Al _ ⒌ _ have been removed.

9__ address count

Because the personality type "Pro" can be regarded as a series of "Pro", in order to simplify the research object, we can also decompose these pictures into "Pupa", that is, "Word" and address tolerance.

Therefore, our research object is the combination of _ _ _ _ _ _” 10/0 Arabic numeral _ _, which greatly simplifies and reduces the processing objects.

9. 1__ refers to the position.

In real life, the generation of character verification codes is varied, with various distortions and deformations. There is no uniform method for the algorithm of character segmentation. This algorithm also requires developers to carefully study the characteristics of the character image to be recognized.

Of course, the research object selected in this paper simplifies the difficulty of this step as much as possible, which will be introduced slowly below.

Open the verification code picture with picture editing software (PhoneShop or other), enlarge it to pixel level, and observe some other parameters:

The following parameters can be obtained:

The size of the whole picture is 40* 10.

The single character size is 6* 10.

The left and right characters and the left and right edges are 2 pixels apart.

The characters are immediately above and below the edge (that is, separated by 0 pixels).

In this way, it is easy to locate the pixel area occupied by each character in the whole picture, and then it can be segmented. The specific code is as follows:

defget_crop_imgs(img):_ " " "

According to the characteristics of the picture, this should work according to the specific verification code. # See schematic diagram.

: parameters:

: Return: _ """

child _ img _ list =[]_ for iinrange(4):

X=2+i*(6+4)# See the schematic diagram.

y=0

child_img=((x，y，x+6，y+ 10))

child _ img _(child _ img)_ return child _ img _ list

Then you can get the cut _ pupa _ picture element:

9.2__ Scopolamine

Based on the discussion in this part, I believe you have learned that if the interference (distortion, noise, interference color block, interference line) of the verification code is not strong enough, you can get the following two conclusions:

There is little difference between a 4-character verification code and a 40,000-character verification code.

Pure letters

Case-insensitive. The classification number is 26.

Case sensitive. The classification number is 52.

Pure numbers. The classification number is 10.

A combination of numbers and case-sensitive letters. The classification number is 62.

Pure numbers _ _ numbers and letter combinations _ are difficult to understand.

It doesn't make much sense without increasing the difficulty of forming "Ganpo" and only increasing the amount of calculation.

10__ bazaar

The size of the research object selected in this paper is the specification of the unified state: 6* 10, so this part does not need additional treatment. But for some distorted and scaled verification codes, this part will also be the difficulty of image processing.

1 1__P camel thief Fan Jie

In the previous link, the processing and segmentation of a single picture have been completed. _ Dong Nautilus P_ training began later.

The whole training process is as follows:

A lot of pre-processing and cutting to the atomic level of picture material preparation.

Manually classify the material pictures, that is, label them.

Define the recognition characteristics of a single picture.

The SVM training model is used to train the labeled feature files, and the model files are obtained.

12__ Yao naoji

In the training stage, this paper re-downloaded 4-digit verification pictures with the same pattern, totaling 3000. Then the 3000 pictures are processed and cut to get 12000 atomic diagrams.

In this picture of 12000, some strong interference materials that will affect training and recognition are deleted, and the effect after cutting is as follows:

13__ Yao Motan

Because of this recognition method used in this paper, the machine didn't have any concept of numbers at first. So you need to identify the material manually and tell _ Yan which is 1.

This process is called ammonia haze thinning.

The specific labeling method is:

Create a directory for each number from 0 to 9, and the directory name is the corresponding number (equivalent to a label).

Manual judgment _ calculation, and drag the picture to the specified digital directory.

Each catalog contains about 100 pieces of material.

Generally speaking, the more labeled materials, the stronger the resolution and prediction ability of the trained model. For example, in this paper, when there are more than ten labeled materials, the recognition rate of new test pictures is basically zero, but when it reaches 100, it can reach nearly 100%.

14__ Gesha

For the cut single-character picture, the pixel-level magnification is as follows:

From a macro point of view, the essence of different digital pictures is to fill the corresponding pixels with black according to certain rules, so these features are ultimately carried out around the pixels.

The figure picture _6 pixels, 10 pixels high, can theoretically define 60 features in the simplest and rudest way: the pixel value above 60 pixels. But obviously, such a high dimension will inevitably lead to too much calculation, and the dimension can be appropriately reduced.

By consulting the corresponding literature [2], another simple and rude feature definition is given:

The number of black pixels per line can get 10 features.

According to the number of black pixels in each column, six features can be obtained.

Finally, a set of features of 16 dimension is obtained, and the implementation code is as follows:

defget_feature(img):_ " " "

Obtaining the characteristic value of the specified image,

1. According to each row of pixels, if the height is 10, there are 10 dimensions, and then there are 6 columns, totaling *** 16 dimensions.

:paramimg_path:

: return: a list with dimension 10 (height) _ """

Width, height =

pixel_cnt_list=[]

height = 10 _ foryinrange(height):

Pix _ CNT _ x = 0 _ _ forxirange (width): _ _ if ((x, y)) = = 0: # black dot.

pix_cnt_x+= 1

Pixel _ CNT _ (pix _ CNT _ x) _ forxinrange (width):

Pix _ CNT _ y = 0 _ _ foryin range (height): _ _ if ((x, y)) = = 0: # black dot.

pix_cnt_y+= 1

pixel _ CNT _(pix _ CNT _ y)_ return pixel _ CNT _ list

Then the picture material is characterized, and the document is invaded according to _ibSVM_ to pay the mother a tip.

Previous article:Characteristics of Mongolian customs
Next article:What is the key business of Pinduoduo?