Traditional Culture Encyclopedia - Traditional stories - Explanation of classical deep neural network architectures - VGG, ResNet, Inception

Explanation of classical deep neural network architectures - VGG, ResNet, Inception

The deep neural network abstraction of the problem rests on the fact that we can construct a generalized function approximation with a suitable neural network architecture that allows us to map from the input sample space to the target sample space. This simple-sounding task is computationally and time-consuming to complete model trials and iterations during actual construction. The feasibility of transfer learning, however, allows us to leverage existing architectures that have performed well in specific classification tasks to accomplish similar or even seemingly completely unrelated tasks. In the course of my learning, I read a number of papers related to these classic network architectures, and I'll make a note of them here.

The most striking feature of VGG Net compared to previous classic network architectures is the extensive use of small 3x3 (and 1x1 in some architectures) convolutional kernels with the same padding to maintain w and h before and after the convolution, and the fact that the scaling of the feature map is left to the 2x2 max pooling layer, which is what basically all convolutional neural networks have since then. Since then, basically all convolutional neural networks have 3x3 kernels. Because of this simple, small kernel structure, VGG is the most classic deep neural network of its generation.

The reason for using small convolutional kernels in deep neural networks is that small convolutional kernels can be stacked on top of each other in the input layer to achieve the same size of sensory field as large convolutional kernels, and because the increase in network layers synchronizes with the increase in the network's model capacity and complexity, furthermore, by stacking convolutional kernels on top of each other, the parameters of the model can be reduced. Further, by stacking multiple layers of convolutional kernels, the model parameters can be reduced: for example, for inputs and outputs with C channels, a 7x7 convolutional kernel requires 7x7xCxC = 49C 2 parameters, whereas by stacking 3 layers of 3x3 convolutional kernels, the number of parameters required is 3 x [3x3xCxC] = 27C 2 parameters.

In the VGG architecture, the authors used 1x1 convolution mainly to increase the nonlinearity in the network, using a 1x1 convolutional structure with the same number of channels as the original input features, and not changing the number of feature expressions before and after the convolution was performed, but in the authors' experience, the network with the 1x1 structure did not perform as well as the 3x3 network under the same architecture, and therefore the subsequent widely used VGG architectures were simply a 1x1 structure. VGG architectures that have been widely adopted since then are purely 3x3 networks.

One noteworthy detail is that to make the network scale-invariant, the authors first scaled all the images to 384x384 during training, then randomly cropped a 224x224 region of the image as the input to the network, and then fine-tuned the network with images that had been scaled to a specified range of sizes.

Another detail is that the authors used a number of clever tricks such as Ensemble and multi-crop to improve the results of the tests, but these improvements are generally only meaningful in competitions, and are rarely used in real production environments.

ResNet was based on the discovery that deep neural networks should intuitively outperform networks with similar architectures but fewer layers, but in practice, as the layers of the network deepen, the effect of the vanishing gradient becomes more pronounced, making the network extremely difficult to train. This phenomenon, in the authors' view, reflects the difficulty of constructing an approximate constant mapping through a neural network with nonlinear activation, so we can do the opposite: we want the neural network to learn the difference between this particular mapping and the constant mapping, and then the whole learning process is made easier by the fact that it is given a reference frame, which is a brilliant idea!

On this basis, the ResNet network is built based on the basic units in the diagram above.

The Inception series ****now*** consists of five articles:

The first of these is a brief introduction to the Inception architecture, the second is on improving the Inception network and discovering Batch Normalization, a widely used method for improving network robustness, and the third is Rethinking Inception. The third paper, Rethinking the Inception Architecture for Computer Vision, is much more informative than the first two papers in that the authors give a lot of advice on building deep convolutional neural networks, and in this paper they have further improved the first version of the Inception Module in the figure below by changing the The first version of the Inception Module, shown below, has been further improved by replacing the 5x5 convolutional kernel with a stack of two 3x3 convolutional layers, making it a very good paper that is worth reading again and again.

Compared to VGG Net, the Inception network is no longer a stack of basic convolutional neural networks; instead, it is a stack of different variants of the Inception Module. Although the Inception network is structurally more complex, the number of parameters is actually smaller than VGG due to the extensive use of 1x1 convolutional kernels.

An unavoidable question here, aside from blindly complicating the network, is: why does the Inception network perform better?

One argument is that it's hard to know how to choose the right convolutional kernel when we're building a network, whereas the Inception Module lets us try out a number of different choices, and lets the network figure out for itself which way is more appropriate.

Another way of putting this comes from the 5th post in this series, where Francois Chollet, author of keras, gives the explanation that in a traditional convolutional neural network, the convolutional kernel needs to build feature recognition not only in the width and height directions, but also in the depth (channel) direction. Once again, the representation of knowledge determines the ease of learning, can we simplify this task by separating feature recognition in these two directions? This is the core idea of the Inception network and the subsequent Xception network derived from it.

Previous article:How about China Electronic Devices Industry Co., Ltd.?
Next article:Multiplier principle