Traditional Culture Encyclopedia - Traditional stories - A First Look at the CV Field - Image Classification (I)

A First Look at the CV Field - Image Classification (I)

Image classification is the most fundamental task in computer vision, of which it can be divided into three main categories: image classification at the cross-species semantic level, subclass fine-grained image classification, and instance-level image classification.

Recognizing different classes of objects at the level of different species, such as cat and dog classification, such a classification task is characterized by a large interclass variance, and a small intraclass variance, such as the typical cifar10 is differentiated within vehicles as well as animals, all of which are semantically completely distinguishable objects

Fine-grained image classification, which is the classification of subclasses of a large category of classification, e.g., classification of different birds, different dogs, different cars, and so on. For example, the Caltech-UCSD Birds-200-2011 dataset is a bird book containing 200 classes and 11,788 images, with 15 localized region locations and 1 labeled box for each image. This fine-grained level of detection requires a much finer classifier design

If we need to distinguish between different individuals, not just species classes or subclasses, its a recognition problem, for example the most typical task is face recognition. Face recognition is very meaningful for the field of computer vision landing, it can accomplish a lot of tasks, such as security and stability, attendance clocking, face unlocking and other application scenarios are closely related to face recognition, an instance-level image classification task.

The MNIST dataset was a baseline at the time, containing 60,000 training data, 10,000 test data, and all images were grayscale images, with a size of 32*32. In this dataset, traditional methods performed well, such as SVM and KNN, and the SVM method reduced the MNIST classification error rate to 0.56%, which was higher than that of the original method at the time. 0.56%, more than the artificial neural network at that time.

Later on, after many iterations, LeNet5 was born in 1998, which is a classical convolutional neural network with some important features:

Although the error rate of LeNet5 is around 0.7%, which is not as good as SVM methods, with the development of the network structure, the neural network methods quickly outperformed all the other methods, with very good results.

In order to land more complex image classification tasks in industry, Fei-Fei Li and others spent several years organizing, and in 2009, the ImageNet dataset was released.The ImageNet dataset*** has more than 14 million images*** with more than 20,000 categories, but the benchmarks commonly used in the paper are all 1,000 categories.

AlexNet came out of nowhere in 2012 and was the first true deep network, with 3 more layers compared to LeNet5's 5 layers, a much larger number of parameters in the network, and the inputs went from 28 to 224, as well as the introduction of GPUs, which has allowed deep learning to carry on from this point on in the GPU-is-king training era.

AlexNet has the following features:

VGGNet explores the relationship between the depth of a convolutional neural network and its performance, and successfully constructs a convolutional neural network with a depth of 16 to 19 layers, proving that increasing the depth of the network can affect the final performance of the network to a certain extent, resulting in a significant decrease in the error rate, while at the same time being very expansive, and very generalizable to migrate to other images. The generalization to other images is also very good. To date, VGG is still used to extract image features.

VGGNet can be seen as a deepened version of AlexNet, both consisting of two main parts: a convolutional layer, and a fully connected layer. All use 3×3 convolutional kernels and 2×2 maximum pooling kernels, simplifying the structure of the convolutional neural network.VGGNet is a good demonstration of how the performance of a network can be improved by simply increasing the number of layers and depth of the network based on previous network architectures. Although simple, it was exceptionally effective, and VGGNet is still chosen as a benchmark model for many tasks today.

GoogLeNet also deepens the network layers, but GoogLeNet makes a bolder attempt at network structure, with a depth of only 22 layers, and in terms of the number of parameters, GoogleNet has 5 million parameters, AlexNet has 12 times the number of parameters of GoogleNet, and VGGNet has again 3 times the number of parameters of AlexNet. AlexNet, so GoogleNet is a better choice when memory or computational resources are limited; however, in terms of modeling results, GoogLeNet's performance is superior.

In general, the most direct way to improve the performance of a network is to increase the depth and width of the network, where depth refers to the number of network layers and width refers to the number of neurons. However, this approach has the following problems:

(1) too many parameters, if the training dataset is limited, it is easy to produce overfitting;

(2) the larger the network, the more parameters, the greater the computational complexity, which is difficult to apply;

(3) the deeper the network, the easier it is to have the problem of gradient dispersion (the more the gradient traverses backward, the more likely it disappears), and difficult to optimize the model.

The solution to these problems is, of course, to increase the depth and width of the network while reducing the parameters, and in order to reduce the parameters, it is natural to think of turning the full connection into a sparse connection. However, the actual amount of computation is not improved when the full connection is sparse, because most hardware is optimized for dense matrices, and sparse matrices consume less time to compute even though the amount of data is less. A more general approach is to use the dropout method, which is equivalent to finding a thinner network from the original network (to be investigated)

The GoogLeNet team proposed the Inception network architecture, which is to construct a kind of "foundation" neuron structure. The GoogLeNet team proposed the Inception network architecture, which is to construct a "base neuron" structure to build a sparse, highly computationally efficient network structure.

What is Inception, which has evolved through V1, V2, V3, V4, and other versions, and is constantly being improved, and is described below

By designing a sparse network structure that produces dense data, we can increase the performance of the neural network while ensuring efficient use of computational resources. Google proposed the basic structure of the most primitive Inception:

The structure stacks the commonly used convolution (1x1, 3x3, 5x5), pooling operations (3x3) in CNNs (convolution, pooling with the same dimensions after adding the channels together), which on the one hand increases the width of the network and on the other hand increases the adaptability of the network to the scale.

The network in the convolutional layer is able to extract every detail of the input, while the 5x5 filter is able to cover most of the input of the receiver layer. A pooling operation can also be performed to reduce the size of the space and reduce overfitting. On top of these layers, a ReLU operation is done after each convolutional layer to increase the nonlinear features of the network

However, in the original version of this Inception, all the convolutional kernels were done on all the outputs of the previous layer, and the computation required for that 5x5 convolutional kernel would have been too large, resulting in a large thickness of the feature maps, which is avoided by adding a 5x5 kernel before 3x3, 5x5 before max, 5x5 before max, 5x5 before max, and 5x5 before max. In order to avoid this situation, a 1x1 convolutional kernel is added before 3x3, before 5x5, and after max pooling to reduce the thickness of the feature map, which also forms the network structure of Inception v1

The above diagrams are described as follows:

(1) GoogLeNet adopts a modularized structure (the Inception structure), which is convenient for adding and modifying;

(2) GoogLeNet adopts a modularized structure, which is easy to add and modify.

(2) Instead of a fully-connected layer, average pooling was used at the end of the network, an idea from NIN (Network in Network), which proved to increase the accuracy by 0.6%. However, a fully-connected layer was actually added at the end, mainly to facilitate flexible adjustment of the output;

(3) Although fully-connected was removed, Dropout was still used in the network;

(4) In order to avoid gradient vanishing, the network added two additional auxiliary softmaxes to conduct the gradient forward (auxiliary classifiers). The auxiliary classifier is to use the output of one of the intermediate layers as a classification and add it to the final classification result by a smaller weight (0.3), which is equivalent to doing model fusion, and at the same time adds back-propagated gradient signals to the network, as well as providing additional regularization, which is beneficial for the training of the whole network. And in the actual test, these two extra softmaxes are removed.

The solution for Inception V2 is to modify the internal computational logic of Inception and propose a special "convolutional" computational structure.

2.1 Factorizing Convolutions

The GoogLeNet team proposed replacing a single 5x5 convolutional layer with a small network of two consecutive 3x3 convolutional layers, which reduces the number of parameters while maintaining the range of the sensory field

2.2 Reducing the size of the feature map

If you want to make the feature map smaller, you can use a smaller network. p>

If you want to make the image smaller, there are two ways: pooling and then Inception convolution, or Inception convolution and then pooling. However, method 1 (left) of pooling will lead to a bottleneck in feature representation (missing features), and method 2 (right) is a normal shrinkage, but it is computationally intensive. In order to maintain the feature representation and reduce the computational effort, the network structure was changed to the following diagram, which uses two parallelized modules to reduce the computational effort (convolution, pooling, and then merging)

Inception V2 was used as a modified version of GoogLeNet, and the network structure is shown below:

One of the most important improvements of Inception V3 is the following Factorization, which breaks down 7x7 into two 1D convolutions (1x7,7x1), and the same for 3x3 (1x3,3x1), which has the advantage of speeding up the computation, but also splitting 1 convolution into 2 convolutions, which makes the network depth increase further, and increases the network's nonlinearities (ReLU is performed for each additional layer).

Inception V4 mainly utilizes Residual Connection to improve the V3 structure to get Inception-ResNet-v1, Inception-ResNet-v2, and Inception-v4 networks.

Previous article:How to evaluate the campus culture of Huake Songkran Festival?
Next article:How to write an expository article about folk customs? Emergency, emergency, emergency, emergency