Traditional Culture Encyclopedia - Traditional stories - Target detection: introduction of YOLO and SSD

Target detection: introduction of YOLO and SSD

As one of the three major tasks of computer vision (image classification, target detection and image segmentation), the task of target detection is to locate and classify the target of interest from the image. Traditional vision schemes involve Hough transform, sliding window, feature extraction, boundary detection, template matching, Hal features, DPM, BoW, traditional machine learning (such as random forest, AdaBoost) and other technologies or methods. With the support of convolutional neural network, the task of target detection has made great progress in recent years. It is widely used, for example, in the field of automatic driving, target detection is used for unmanned vehicles to detect other vehicles, pedestrians or traffic signs.

Commonly used target detection frameworks can be divided into two categories. One is the two-stage/two-trigger method, which is characterized by separating the detection and classification of the region of interest. Representative ones are R-CNN, FAST R-CNN and Fast R-CNN. The other is one-stage/one-time method, which uses a network to detect and classify regions of interest at the same time, represented by YOLO(v 1, v2, v3) and SSD.

The two stages appear earlier because it needs to separate the detection and classification of the region of interest. Although the accuracy is relatively high, the real-time performance is relatively poor, which is not suitable for application scenarios such as autonomous driving and unmanned vehicle perception. So this time we mainly introduce SSD and YOLO series framework.

SSD and 20 16 were put forward by W. Liu et al. in the article SSD: Single Multi-box Detector. Although it was put forward a little later than YOLO(v 1) in the same year, it is faster and more accurate.

The framework of SSD adds some additional structures to a basic CNN network (the author uses VGG- 16, but other networks can be used instead), which makes the network have the following characteristics:

Multi-scale feature map detection

The author added some characteristic layers after VGG- 16, and the size of these layers gradually decreased, which allowed us to make predictions at different scales. The deeper and smaller the feature map, the larger the predictable object.

Convolutional network prediction

Different from YOLO's fully connected layer, SSD classifier uses convolution to predict each channel feature map used for prediction, in which the number of prior frames placed in each cell is the number of prediction categories.

Set transcendence box

For each cell on the feature graph, we place a series of previous boxes. Then, for each previous frame corresponding to each cell on the feature map, we predict the dimension offset of the previous frame and the confidence of each category. For example, for a new feature map, if each feature map corresponds to a previous box and the category to be predicted is classified, the output size is. (reflected in the training process)

Among them, if the center position and width and height of the previous frame are used to represent the center position and width and height of the predicted frame, the dimension offsets of the actual prediction are respectively:

The following figure is a frame of SSD. First, the first five layers are convolved with a VGG- 16, and then a series of convolution layers are cascaded, in which six layers are convolved separately (or the average pool of the last layer) to predict, and the output of one is obtained, and then the final result is obtained through maximum suppression (NMS).

There are four characteristic graphs used for network detection, the sizes of which are,,, and; Each unit of these characteristic graphs corresponds to a preset prior box of,,, and, so the network * * * predicts a boundary box, and the output dimension (before maximum suppression) is.

to be continued

Reference:

CSDN blog of chenxp23 1 1: Paper reading: SSD: single multi-box detector.

Xiaojiang Tiger Column: Target Detection |SSD Principle and Implementation

LittleYii's CSDN blog: target detection paper reading: YOLOv 1-YOLOv3 (1)

Other related articles of the author:

Image Segmentation: Full Convolution Neural Network (FCN) Detailed Explanation

PointNet: Detailed explanation of 3D point cloud classification and segmentation model based on deep learning

Vision-based robot indoor positioning