Traditional Culture Encyclopedia - Traditional stories - Target detection series (1): R-CNN

Target detection series (1): R-CNN

Object detection is a very important field in computer vision. Before the emergence of convolutional neural networks, some traditional methods were used to manually extract image features for target detection and location. These methods are not only time consuming, but also have low performance. After the emergence of convolutional neural network, the field of target detection has undergone earth-shaking changes. The most famous target detection systems are RCNN series, YOLO and SSD. This article will introduce the beginning of RCNN series, which is called RCNN.

The technological evolution of RCNN series can be seen in the evolution of target detection technology based on deep learning: R-CNN, FAST R-CNN, Fast R-CNN.

Target detection is divided into two steps: the first step is to classify the image, that is, what is the content in the image; The second step is to locate the image and find out the specific position of the object in the image. Simply put, it is what is in the image and where it is.

However, the size and position of objects in different pictures may be different (multi-scale), and the placement angle and posture may also be different. Multiple categories can appear in a picture at the same time. This makes the target detection task extremely difficult.

In technical terms, the above tasks are: image recognition+location.

Two different branches perform different functions, classification and positioning respectively. Regression Branches and Classification) * * * Branches * * Enjoy the parameter values of the convolution part of the network.

It's still the idea of classification recognition+regression positioning just now. It's just that now we get boxes in different locations in advance and then input this box into the network, instead of directly inputting the original picture into the network like Idea 1. Then calculate the score of this box and get the box with the highest score.

As above, it is used to identify and locate cats in the same image. Select four corners and four boxes for classification and regression. Its scores are 0.5, 0.75, 0.6 and 0.8 respectively, so the score in the lower right corner is the highest, and the black box in the lower right corner is selected as the prediction of the target position (the positioning task is completed here).

Here is another question-how to take the box when detecting the position, and how big is the box? Above we took four corners of 22 1x22 1 in the image of 257x257. If you scan from the upper left corner to the lower right corner in windows of different sizes, the amount of data will be very large. Moreover, if the multi-scale problem is considered, the image needs to be scaled to different levels for calculation, which greatly increases the amount of calculation. How to get frames can be said to be one of the core problems of target detection. RCNN, fast RCNN, fast RCNN constantly optimizes the solution of this problem, which will be discussed later.

Summarize the train of thought:

For a picture, cut out the picture with boxes of various sizes, input it into CNN, and then CNN will output the category of the box and its position score.

For the selection of detection frames, we usually use some method to find out the frames that may contain objects (that is, candidate frames, such as 1000 candidate frames), which can overlap and contain each other, so as to avoid violent enumeration of all frames.

After that, let's take a closer look at the implementation of RCNN series. Firstly, this paper introduces RCNN method.

Compared with previous target detection algorithms, R-CNN not only greatly improves the accuracy, but also greatly improves the operational efficiency. The process of R-CNN is divided into four stages:

We have briefly introduced the selective search method, through which we screened out about 2k candidate boxes. However, the sizes of the searched rectangular boxes are different. However, in AlexNet, because of the existence of the last fully connected layer, there is a fixed requirement on the image size, so before inputting the candidate frames, the author unified the sizes of these candidate frames-scaled them to a unified size. The author of this paper uses two methods:

(1) anisotropic scaling

Because the image distortion may affect the subsequent CNN model training, the author also tested the isotropic scaling method. There are two ways:

In addition, the author also tried border filling. Lines 1 and 3 in the above schematic diagram are the results of the combination of padding=0, and the results of lines 2 and 4 adopt padding= 16. After the final experiment, the author found that anisotropic scaling and filling = 16 had the highest accuracy.

Convolutional neural network training is divided into two steps: (1) pre-training; (2) Fine-tuning.

Firstly, the model is trained on a large data set (the winder model in R-CNN is AlexNet), and then fine-tuning (or migration learning) is carried out with the trained model, that is, the model parameters are initialized with the pre-trained model parameters, and then the model parameters are trained on the target data set.

In addition, in the training, the author tried to use different fully connected layers, and found that one fully connected layer is better than two fully connected layers, which may be caused by over-fitting after using two fully connected layers.

Another interesting thing is that for CNN model, the features learned from convolution layer are actually the basic feature extraction layer, similar to the traditional image feature extraction algorithm. Finally, the fully connected layer learns the characteristics of specific tasks. For example, face gender recognition, the features of convolution layer learning in front of a CNN model are similar to those of face learning, and then the features of full connection layer learning are the features used for gender classification.

Finally, the trained model is used to extract features from the candidate frames.

About positive and negative samples: Because the selected bounding box can't be exactly the same as manual labeling, it is necessary to set IOU threshold to label the bounding box in CNN training stage. The author of this paper sets the threshold to 0.5, that is, if the overlapping area of the candidate bounding box and the artificial label is greater than 0.5, it is marked as an object category (positive sample), otherwise we regard it as a background category (negative sample).

The author trained a binary SVM for each category. The method of defining positive and negative samples here is different from the definition method of convolutional network training above. In this paper, the author tried several IoU thresholds (0. 1~0.5). Finally, through training, it is found that the effect is the best when the IoU threshold is 0.3 (the accuracy decreases by 4 percentage points when choosing 0, and by 5 percentage points when choosing 0.5). That is, when IoU is less than 0.3, we regard it as a negative sample, otherwise it is a positive sample.

The measure of target detection problem is overlapping area: many seemingly accurate detection results are often due to inaccurate candidate frames and small overlapping area. Therefore, a position thinning step is required.

In the process of realizing boundary regression, two subtle problems were found. Firstly, regularization is very important: we set λ= 1000 based on the verification set. The second problem is that you must be careful when choosing which training pairs (P, G) to use. Intuitively, if p is far away from all the truth values of the detection box, then the task of converting p into the truth value g of the detection box is meaningless. Using an example like P will lead to hopeless learning problems. Therefore, only when the suggestion P is at least close to the true value of a detection box can we perform the learning task. "Nearby" means that P is assigned to the detection box truth value G with the largest IoU (in the case of more than one overlap), and only when the overlap is greater than the threshold (based on the verification set, the threshold we use is 0.6). All unallocated suggestions were abandoned. We do it once for each target category to learn a set of category-specific check box regression quantities.

During the test, we grade each suggestion and predict its new detection box once. In principle, we can iterate this process (that is, reevaluate the newly predicted detection box, then predict a new detection box from it, and so on). However, we find that iteration does not improve the results.

Using selective search method, the probability of 2000 regions is extracted from the test image, and the probability of each region is normalized to 227x227, and then it is propagated forward in CNN to extract the features obtained in the last layer. Then, for each category, the SVM classifier trained for this category is used to score the extracted feature vectors, and the scores of this category proposed by all regions in the test picture are obtained. Then, greedy non-maximum suppression (NMS) is used to remove overlapping redundant frames. Then we can get the bounding box by canny edge detection (and then B-box regression).

Reference:

Rich feature levels are used for accurate object detection and semantic segmentation.

RCNN—— The Pioneering Work of Introducing CNN into Target Detection —— Xiao Lei's article

Evolution of target detection technology based on deep learning: R-CNN, fast R-CNN and fast R-CNN.

Translation of R-CNN papers