Traditional Culture Encyclopedia - Traditional festivals - Introduction to the basic knowledge of detection (1) —— Model architecture
Introduction to the basic knowledge of detection (1) —— Model architecture
So how do you understand a picture? According to the needs of follow-up tasks, there are mainly three levels.
One is classification, that is, the image is structured into a certain kind of information, and the image is described by a predetermined string or instance ID. This task is the simplest and most basic task for image understanding, and it is also the primary task for deep learning model to make a breakthrough and realize large-scale application. Among them, ImageNet is the most authoritative evaluation set. ILSVRC produces a large number of excellent deep network structures every year, which provides the foundation for other tasks. In the application field, face and scene recognition can be classified as a classification task.
The second is testing. The classification task focuses on the whole, giving the description of the whole picture, while the detection focuses on specific objects and targets, which require both category information and position information. Compared with classification, detection gives an understanding of the foreground and background of the picture. We need to separate the object of interest from the background and determine the description (category and location) of this object. Therefore, the output of the detection model is a list, and each item in the list gives the category and position of the detected object with a data group (usually expressed by the coordinates of a rectangular detection box).
The third is subdivision. Segmentation includes semantic segmentation and instance segmentation. The former is the extension of foreground-background separation, which needs to separate image parts with different semantics, while the latter is the extension of detection task, which needs to describe the outline of the target (more detailed than the detection frame). Segmentation is a pixel-level description of an image, which gives meaning to each pixel category (instance) and is suitable for scenes with high understanding requirements, such as road and non-road segmentation in unmanned driving.
The two-stage model is named because of its two-stage processing of pictures, which is also called region-based method. We choose R-CNN series as the representative of this type.
Two contributions of this paper: 1)CNN can be used to locate and segment objects based on regions; 2) When the number of supervised training samples is insufficient, the model pre-trained on additional data can achieve better results by fine-tuning. The first contribution influenced almost all the subsequent two-stage methods, while the second contribution used the model trained in Imagenet as the basic network, and also used the practice of fine-tuning in detecting problems in the subsequent work.
Traditional computer vision methods often use well-designed artificial features (such as SIFT and HOG) to describe images, while deep learning methods advocate feature acquisition. From the experience of image classification task, the effect of automatic feature acquisition by CNN network has surpassed that of manual design. In this paper, convolutional networks are applied to local areas to give full play to their ability to learn high-quality features.
R-CNN abstracts detection into two processes. One is to propose some regions that may contain objects based on pictures (that is, local cropping of pictures, called region proposal). This paper adopts selective search algorithm. The second is to run the best performance classification network (AlexNet) in these proposed areas, and get the categories of objects in each area.
In addition, there are two practices worth noting in the article.
The first is data preparation. Before entering CNN, we need to mark the suggested regional proposal according to the real situation on the ground, and the index used here is IoU (intersection exceeds union). IoU calculates the ratio of the intersection area of two regions to their sum, and describes the degree of overlap between the two regions.
It is especially mentioned that the selection of IoU threshold has a great influence on the results. There are two thresholds here, one is used to identify positive samples (such as IoU with ground truth value greater than 0.5), the other is used to mark negative samples (such as background class, such as IoU less than 0. 1), and the one between them is hard negative. If the tag is positive, it contains too much.
The other point is the bounding box regression of position coordinates. This process is from regional suggestion to real adjustment on the ground, plus log/exp transformation to keep the losses in a reasonable order, which can be regarded as standardized operation.
The idea of R-CNN is straightforward, that is, transforming the detection task into a regional classification task, which is a test of the deep learning method in the detection task. There are also many problems in the model itself, such as the need to train three different models (proposal, classification and regression) and the performance problems caused by too many repeated calculations. Nevertheless, many practices in this paper still have a wide impact on the depth model revolution in the exploration mission, and many follow-up works are also aimed at improving this work. This paper can be called "the first paper".
The article points out that the reason why R-CNN is time-consuming is that CNN does every proposal by itself and does not enjoy the calculation. Therefore, it is proposed to put the basic network into the R-CNN subnet after the full graph is run, and * * * enjoys most of the calculations, so it has the name of Fast.
The above picture shows the architecture of Fast R-CNN. The feature map is obtained from the image by the feature extractor, and the selective search algorithm is run on the original image to map the ROI (actually a coordinate group, which can be mixed with regional suggestions) to the feature map. After merging the regions of interest for each region of interest, the feature vectors with equal length are obtained. Sort out the positive and negative samples of these feature vectors (keep a certain proportion of positive and negative samples), introduce them into parallel R-CNN sub-networks in batches, and classify and regress at the same time to unify the losses of the two.
The discussion at the end of the paper also has certain reference significance:
This structure of fast R-CNN is the embryonic form of meta-structure adopted by mainstream two-stage detection task method. This paper will propose that feature extractor, object classification &; Localization is unified in a whole structure, and the efficiency of feature utilization is improved by * * * shared convolution calculation, which is the biggest contribution.
Fast R-CNN is the basic work of two-stage method, and the proposed RPN network replaces the selective search algorithm, so that the detection task can be completed end-to-end by neural network. Roughly speaking, the faster R-CNN = RPN+ Fast R-CNN enjoys the convolution calculation with RCNN***, which makes the amount of calculation introduced by RPN very small, so that the faster R-CNN can run at a speed of 5fps on a single GPU and reach the SOTA (the most advanced level) in accuracy.
The main contribution of this paper is to propose a regional proposal network to replace the previous SS algorithm. The RPN network models the proposed task as a binary classification (whether it is an object or not).
The first step is to generate anchor frames with different sizes and aspect ratios on a sliding window (as shown in the right half of the above figure), set the threshold of IoU, and calibrate the positive and negative of these anchor frames according to Ground Truth. Therefore, the sample data transmitted to the RPN network are sorted into anchor boxes (coordinates) and whether there are objects in each anchor box (two types of tags). RPN network maps each sample to a probability value and four coordinate values. The probability value reflects the probability that there is an object in the anchor box, and four coordinate values are used to regress and define the position of the object. Finally, the loss of binary classification and coordinate regression is unified as the target training of RPN network.
The region obtained by RPN is suggested to be filtered according to the probability value, and after a similar labeling process, it is transmitted to R-CNN subnet for multi-classification and coordinate regression, and the losses of the two are also combined through multi-task loss.
The success of faster R-CNN lies in "deepening" the detection task with RPN network. The idea of using sliding window to generate anchor boxes is also increasingly adopted by later works (YOLO v2, etc. ). This work laid a two-stage method meta-structure of "RPN+RCNN", which influenced most of the subsequent work.
The single-stage model has no intermediate region detection process, and the prediction results are obtained directly from the pictures, which is the so-called region-free method.
YOLO is the pioneering work of single-stage method. It represents the detection task as a unified, end-to-end regression problem, so it is named because it can get the position and classification at the same time only by processing the image once.
YOLO's main advantages:
1. Prepare data: scale the picture and divide it into equal grids, and each grid is allocated to the sample to be predicted according to the IoU of the ground truth.
2. Convolution network: modified from GoogLeNet, each grid predicts a conditional probability value for each category, and B boxes are generated based on the grid, each box predicts five regression values, four represent positions, and the fifth represents the probability and position accuracy of the box containing objects (note that it is not a certain class) (expressed by IoU). When testing, the score is calculated as follows:
The first term on the left side of the equation is predicted by the grid, the last two terms are predicted by each box, and the score of each box containing different types of objects is obtained by conditional probability. Therefore, the number of predicted values output by the convolution network * * * is S×S×(B×5+C), where S is the number of grids, B is the number of boxes generated by each grid, and C is the number of categories.
3. Post-processing: NMS (non-maximum suppression) filtering is used to obtain the final predicted frame.
The loss function is divided into three parts: coordinate error, object error and category error. In order to balance the influence of unbalanced categories and large and small objects, weight is added to the loss function, with length and width as the roots.
YOLO put forward a new viewpoint of single stage. Compared with the two-stage method, its speed advantage is obvious and its real-time performance is impressive. However, YOLO itself has some problems, such as rough grid division and the number of boxes generated by each grid, which limits the detection of small-scale objects and similar objects.
Compared with YOLO, SSD has the following outstanding features:
SSD is an early master of single-stage model, which achieves the accuracy close to that of two-stage model and is one order of magnitude faster than that of two-stage model. Subsequent single-stage model work is mainly based on SSD improvement.
Finally, we make a simple summary of the basic characteristics of the detection model.
The detection model consists of backbone network and detection head as a whole. The former, as a feature extractor, gives images with different sizes and different abstract levels; The latter learns the category and location association according to these representations and supervision information. The two tasks of category prediction and position regression are often carried out in parallel, which constitutes the loss of multi-task in joint training.
On the other hand, the single-level model has only one class prediction and position regression, so the convolution operation is more enjoyable, faster and takes up less memory. Readers will see in the next article that the two models are also absorbing each other's advantages, which also makes the boundary between them more blurred.
- Related articles
- Automotive ignition system in the distributor center high-voltage wire and high-voltage branch line of the circuit direction is how?
- What numbers are drawn in Seven Star Lottery?
- What are the local specialties?
- Difference between self-locking support and steel support
- As an architecture student, do you know what BIM is now?
- What does left and right really mean? Both politically and culturally
- The Origin of Four Traditional Festivals in China
- How to make the classic recipe of Walnut Crumble of Married Woman Cake?
- Why do Japanese middle school students always wear sailor suits?
- What aspects does the digital transformation of enterprises include?