Traditional Culture Encyclopedia - Traditional festivals - Image Segmentation: A Detailed Description of Full Convolution Neural Network (FCN)

Image Segmentation: A Detailed Description of Full Convolution Neural Network (FCN)

As one of the three tasks of computer vision (image classification, object detection and image segmentation), image segmentation has made great progress in recent years. This technology is also widely used in unmanned areas, such as identifying passable areas and lane lines.

Full convolutional networks (FCN) is a framework for image semantic segmentation, which was proposed by Jonathan Long of the University of California, Berkeley and others in the article "Full convolutional networks for semantic segmentation" in 20 15. Although many articles have introduced this framework, I still want to sort out my understanding here.

The whole network structure is divided into two parts: full convolution part and deconvolution part. In the full volume part, some classic CNN networks (such as AlexNet, VGG, GoogLeNet, etc. ) is borrowed, and the last fully connected layer is replaced by convolution, which is used to extract features and form hot spots. The deconvolution part is to sample the small-sized hot spot map and get the semantic segmentation image of the original size.

The input of the network can be color images of any size; The output is the same size as the input, and the number of channels is n (number of target categories)+1 (background).

The purpose of the network in the convolution part of CNN is to allow the input picture to be of any size beyond a certain size, rather than fully connected.

Because our heat map becomes very small in the convolution process (for example, the length and width become the original image), in order to get the dense pixel prediction of the original image size, we need to up-sample.

An intuitive idea is bilinear interpolation, which can be easily realized by inverse convolution with a fixed convolution kernel. Deconvolution can also be called deconvolution, which is usually called transposed convolution in recent articles.

In practical application, the author does not fix the convolution kernel, but makes the convolution kernel a learnable parameter.

If the feature map of the last layer is divided into the original size by the up-sampling technique mentioned above, many details will be lost because the feature map of the last layer is too small. Therefore, the author proposes to join the Skips structure, which combines the last prediction (with more global information) with the shallower prediction (with more local details), so that local prediction can be made while observing the global prediction.

FCN still has some shortcomings, such as:

The results obtained are not accurate enough and sensitive to details;

Without considering the relationship between pixels, there is a lack of spatial consistency.

Reference: zomi, full-volume product network FCN detailed explanation: Zhihu column