Traditional Culture Encyclopedia - Traditional customs - Target detection -YOLOv3

Target detection -YOLOv3

The traditional target detection algorithm has limited application scenarios and high maintenance costs. Applying deep learning method to target detection not only has good algorithm adaptability, but also can carry out migration learning and reduce the cost.

In the deep learning target detection algorithm, anchor-based methods are mainly divided into one-step method and two-step method.

The two-stage method first selects the region of interest, then further classifies and regresses the candidate boxes, and finally outputs the selected boxes and the corresponding classification. The two-level model includes R-CNN series, such as R-CNN, FAST-RCNN, Fast-RCNN, etc. The advantage of two-stage model is high accuracy, but slow speed.

One-step regression classification is directly carried out on the anchor, and the final target frame and category are obtained. Algorithms include YOLOv2, v3, SSD, RetinaNet and so on. The reasoning speed of the one-stage model is faster, but the relative accuracy will decrease.

In addition, there are some anchor-free methods, including key point-based detection algorithm and center-based detection algorithm.

Here are some basic concepts and abbreviations:

Bounding box bounding box

Anchor points: anchor frame

Roi: Region of Interest Specific Region of Interest.

Regional proposal: candidate regions

The proposed network of the region extracts the network of the candidate region.

Iou: inter action over union (overlapping area/joint area) intersection ratio, which predicts the quality of the frame.

Maps: average accuracy

NMS: Non-maximal inhibition Non-maximal inhibition

YOLO series models have fast reasoning speed on the basis of maintaining certain accuracy. The reasoning speed of YOLOv3 in the figure below is much faster than other models, so it has a good application in the field of real-time monitoring.

YOLO's name comes from You Only Watch Once, which tells the essence of YOLO from the name.

YOLOv 1 divides the image into S*S networks, and the center of the real frame of the object falls on the corresponding anchor frame to detect the object.

Each grid will predict a bounding box and its corresponding confidence, where the confidence reflects the model's belief in the capture of the object contained in this box and the accuracy of its prediction of this object. So the confidence is equal to. If the object does not exist, then the confidence level should be equal to zero.

Each bounding box predicts 5 values. The (x, y) coordinates represent the center of the frame relative to the boundary of the grid cell. W, y are the predicted width and height relative to the whole image. Finally, confidence prediction indicates the IOU between the predicted frame and any real frame.

YOLOv2 is optimized on the basis of v 1. DarkNet 19 is adopted as the backbone network, and the input image size is increased from 224 to 448. The network structure is set to the full convolution network structure plus batch norm, and the anchor points are calculated by Kmeans clustering method, and multi-scale training is introduced to make the network learn images of different scales during the training process. However, there are still some areas that need to be improved, such as low recall rate of small targets, poor detection effect of close-range group targets, and room for optimization of detection accuracy.

YOLOv3 uses a deeper backbone network, DarkNet53, and adds multi-scale prediction to the COCO data set for clustering. The anchor of different scales in 9 uses sigmoid activation function in classification, which supports multiple classification of targets. YOLOv3 has fast reasoning speed, high cost performance and strong versatility. The disadvantages are low recall rate, poor positioning accuracy, and relatively weak ability to detect groups and small objects approaching or sheltering.

YOLOv3 has made many changes on the basis of v 1.

Boundary box prediction

YOLOv3 uses the bounding box of cluster prediction as the anchor box. Four coordinate values of the network prediction bounding box. If the cell deviates from the upper left corner of the image and the width and height of the previous bounding box are 0, the prediction is as follows:

YOLOv3 predicts the objectivity score of each bounding box by logistic regression. If a bounding box overlaps with the real box more than others, its objectivity score should be 1. Other boxes will be ignored, although they also overlap with the actual boxes.

Category prediction

Sigmoid function is used, but softmax is not used because it is unnecessary.

Prediction of different scales

YOLOv3 uses k-means clustering to determine the prior of bounding box, selects 9 clusters and 3 scales, and then evenly divides the clusters on the whole scale. On the COCO data set, nine clusters are (10× 13), (16×30), (33×23), (30×6 1), (62×45) and (.

Feature extraction

YOLOv3 uses Darknet-53, which is characterized by adding residuals, which is deeper than the previous network (it has 53 convolution layers, so it is called Darknet-53).

Borrow a picture to see the whole process of YOLOv3:

Each output branch corresponds to the previous box of three sizes (total * * * 3 3 = 9 scales). After 32 times down-sampling grid, each grid corresponds to the area of 32×32 on the input image, which is suitable for detecting large-size targets, and 8 times down-sampling grid is suitable for detecting small-size targets.

The height h and width w of the output feature are equivalent to dividing the image into H*W grids, instead of drawing grids directly on the image. That is to say, what you get after 32 times of downsampling is equivalent to drawing a grid on the input image, and each grid corresponds to a point on the output feature map.

The C channel of the feature map represents the information of the prediction frame, including coordinate information, target confidence and classification.

C=B*( 1+4+class_num), where b is the number of anchor frames allocated on the feature map.

There are three loss functions: classification loss, positioning loss and objectivity loss. Classification uses sigmoid activation function, and the loss is sigmoid cross entropy. Position loss uses sigmoid function and sigmoid cross entropy loss on X and Y, L 1 loss on W and H, and objective loss uses sigmoid activation function and sigmoid cross entropy loss.

For frames that overlap with real frames, all three losses should be calculated.

For frames without real frame overlap, only objectivity (0) is calculated; Ignore boxes that overlap with real boxes but are not the best match.