Traditional Culture Encyclopedia - Traditional festivals - Paper: Hierarchical bilinear pool for fine-grained visual recognition.

Paper: Hierarchical bilinear pool for fine-grained visual recognition.

Fine-grained visual recognition is challenging because it is highly dependent on modeling of various semantic parts and fine-grained feature learning. The model based on bilinear pool has been proved to be effective in fine-grained recognition, but most previous methods ignore the fact that some feature interactions between layers and fine-grained feature learning are interrelated and mutually reinforcing. This paper proposes a new model to solve these problems. Firstly, a cross-layer bilinear pool method is proposed to capture the feature relationship of interlayer parts. Compared with other methods based on bilinear pool, it has better performance. Secondly, we propose a new hierarchical bilinear pool structure, which integrates multiple cross-layer bilinear features to enhance its representation ability. Our formula is intuitive and effective, and the most advanced results are obtained on the widely used fine-grained recognition data set.

Using local features in fine-grained classification has great limitations.

Therefore, the classification method of image-level labels is used. For example, Simon and Rodner[26] proposed a constellation model, which uses convolutional neural networks (CNN) to find the constellation of neural activation patterns. Zhang et al. [36] proposed an automatic fine-grained image classification method, which combined with depth convolution filter to select and describe components. These models use CNN as a local detector, which has greatly improved the fine-grained recognition. Different from the component-based method, we regard the activation of different convolution layers as a response to different component attributes, instead of explicitly locating the target component, but use a cross-layer bilinear pool to capture the inter-layer interaction of component attributes, which proves to be very useful for fine-grained recognition.

Some studies [3, 6, 17, 12] introduce bilinear pool framework to model local objects. Although some promising results have been reported, further improvement still has the following limitations. First of all, most existing models based on bilinear pool only take the activation of the last convolution layer as the representation of the image, which is not enough to describe the semantic part of the object. Secondly, the convolution activity in the middle is ignored, which leads to the loss of discriminant information of fine-grained classification, which is of great significance to fine-grained visual recognition.

As we all know, there is information loss in CNN. In order to minimize the loss of useful information for fine-grained recognition, we propose a new hierarchical bilinear pool structure to integrate multiple cross-layer bilinear features to enhance their representation ability. In order to make full use of the activation of the intermediate convolution layer, all cross-layer bilinear features are connected before the final classification. Note that the features of different convolution layers are complementary, and they are helpful for discriminant feature learning. Therefore, the network benefits from the mutual enhancement of inter-layer feature interaction and fine-grained feature learning. Our contributions are summarized as follows:

1. We have developed a simple but effective cross-layer bilinear pool technology, which supports the interaction of features between layers and learns fine-grained representation in a mutually reinforcing way.

2. A hierarchical bilinear pool framework based on cross-layer bilinear pool is proposed, which integrates multiple cross-layer bilinear modules to obtain complementary information from the middle convolution layer, thus improving the performance.

3. We have done comprehensive experiments on three challenging data sets (chicks, Stanford cars and fgvc planes), and the results prove the superiority of our method.

The rest of this paper is organized as follows. The second part is a review of related work. Section 3 introduces the proposed method. Section 4 gives the experiment and result analysis, and section 5 gives the conclusion.

In the next article, we will briefly review the previous work from two interesting angles related to our work, including fine-grained feature learning and feature fusion in CNN.

1. In order to better model the subtle differences of fine-grained categories, Lin et al. [17] proposed a bilinear structure, which aggregated paired features through two independent CNN, and this structure generated a very high-dimensional quadratic expansion feature by using the outer product of feature vectors.

2. Advanced [23] uses tensor to approximate the second-order statistics and reduce the feature dimension.

3. Kong et al. use low rank approximation to covariance matrix, which further reduces the computational complexity.

4.Yin et al. aggregate higher-order statistics by iteratively applying tensor sketch compression to features.

5. The work of [22] takes bilinear convolutional neural network as the baseline model, and uses the method of ensemble learning to weight it.

6. The square root normalization of matrix is proposed in [16], and it is proved that it is a supplement to the existing normalization.

However, these methods only consider the characteristics of a single convolution layer, which is not enough to capture the various discriminant parts of the object and to simulate the subtle differences between subcategories. Our method overcomes this limitation by combining feature interaction between layers and fine-grained feature learning in a mutually reinforcing way, so it is more effective.

3, 7, 19, 33 Study the effectiveness of different convolution layer feature maps in CNN.

The author regards each convolution layer as an attribute extractor of unused object parts, and models their direct interaction in an intuitive and effective way.

In this section, we establish a hierarchical bilinear model to overcome the above limitations. Before putting forward our hierarchical bilinear model, we first introduce the general formula of decomposing bilinear pool for fine-grained image recognition in section 3. 1. On this basis, we put forward the cross-layer bilinear pool technology in section 3.2, which can jointly learn the activation of different convolution layers and capture the cross-layer interaction of information, so as to obtain better representation ability. Finally, our hierarchical bilinear model combines multiple cross-layer bilinear modules to generate more detailed partial descriptions, thus achieving better fine-grained recognition.

The decomposition of bilinear pool has been applied to the task of answering visual questions. Kim et al. [1 1] proposed an effective attention mechanism for multimodal learning by using Hadamard product decomposition of bilinear pool. The basic formula of decomposed bilinear pool technology for fine-grained image recognition is introduced. Assuming that the output feature map of the convolution layer of an image I after filtering by CNN is X Rh w c, H, W, C, we will express the C-dimensional descriptor of the spatial position on X as X = [x 1, x2, ..., XC] T.

Where Wi is the projection matrix and Zi is the output of the bilinear model. We need to learn W = [W 1, W2, ..., Wo] and get an O-dimensional output Z. According to the matrix decomposition in [24], the projection matrix Wi in the equation. (1) can be decomposed into two single rank vectors.

Where Ui ∈ Rc and Vi ∈ Rc. Therefore, the output characteristic z ∈ Ro is given by the following formula.

Where U Rc d and V Rc d are projection matrices, P Rd o is classification matrix, o is Hadamard product, and d is a hyperparameter that determines the embedding dimension of joints.

Fine-grained subcategories usually have similar appearances and can only be distinguished by subtle differences in local attributes, such as the color, shape or beak length of birds. Bilinear pool is an important fine-grained identification technology. However, most bilinear models only focus on learning features from a single convolution layer, completely ignoring the cross-layer interaction of information. The activation of a single convolution layer is incomplete, because each object part has multiple attributes, which are very important for the molecular classification of regions.

In fact, in most cases, we need to consider multiple factors of part features to determine the category of a given image. Therefore, in order to capture some finer-grained features, we develop a cross-layer bilinear pool method, which regards each convolution layer in CNN as a partial attribute extractor. Then, the features of different convolution layers are integrated by element multiplication, and the inter-layer interaction model of some attributes is established. According to formula (3), it can be rewritten as:

The cross-layer bilinear pool proposed in section 3.2 is intuitive and effective, and its representation ability is better than the traditional bilinear pool model without increasing training parameters. This enlightens us that it is beneficial to capture the distinguishing properties between fine-grained sublayers by using the interaction of interlayer characteristics between different pleats. Therefore, we extend the cross-layer bilinear pool to integrate more intermediate convolution layers, which further improves the speed of feature representation. In this section, we propose a generalized hierarchical bilinear model, which combines more convolution layer features by cascading multiple cross-layer bilinear pool modules. Specifically, we divide the cross-layer bilinear pool module into interactive stage and classification stage, and the formula is as follows:

Where p is the classification matrix, and u, v, s, … are the projection matrices of the convolution layer feature vectors x, y, z, … respectively. The overall process of HBP framework is shown in figure 1.

In this section, we will evaluate the performance of HBP model in fine-grained records. 4. Section1first introduces the data set and implementation details of HBP. In Section 4.2, a model configuration study was conducted to investigate the effectiveness of each component. Section 4.3 gives a comparison with the latest method. Finally, in section 4.4, qualitative visualization is used to explain our model intuitively.

Data sets: cub200-20 1 130, StandFordcars 15, FGVC-Aircraft2 1.

Experiment: The baseline model VGG- 16 pre-trained by ImageNet classification data set is used to evaluate HBP, and the last three fully connected layers are deleted. It can also be applied to Inception and ResNet, and the input image size is 448. Our data expansion follows the usual practice, that is, random sampling (cutting 448 448 from 5 12 S, where s is the largest image edge) and horizontal flipping are used in training, and only central cutting is used in reasoning. First, we use logistic regression to train the classifier, and then use the random gradient descent method with batch size of 16, momentum of 0.9, weight attenuation of 5 10 4 and learning rate of 10 3 to fine-tune the whole network, and periodically anneal it to 0.5.

Cross-layer bilinear pool (CBP) has a user-defined projection dimension D. In order to study the influence of D and verify the effectiveness of the proposed framework, we have done a lot of experiments on cub200 -20 1 1[30] data sets, and the results are shown in Figure 2. Note that we use relu5 3 of FBP, relu5 2 and relu5 3 of CBP, relu5 1 of HBP, relu5 2 and relu5 3 to get the results in Figure 2, and we also provide the following quantitative experiments of layer selection. In VGG- 16[27], we mainly pay attention to relu5 1, relu5 2 and relu5 3, because their shallow layers contain more biased information. In Figure 2, we compare the performance of CBP and FBP. On this basis, we further discuss the HBP method of multi-layer combination. Finally, we analyze the influencing factors of hyperparameter D. From Figure 2, we can draw the following important conclusions:

First of all, under the same D, our CBP is obviously superior to FBP, which shows that the interaction between features of each layer can enhance the recognition ability.

Secondly, HBP is further superior to CBP, which proves that activating the intermediate convolution layer is effective for fine-grained recognition. This can be explained by the loss of information in the propagation process of cellular neural network, so the recognition features that are very important for fine-grained recognition may be lost in the middle convolution layer. Compared with CBP, our HBP considers more characteristic interactions of the intermediate convolution layer, so it has stronger robustness because HBP shows the best performance. In the next experiment, HBP was compared with other most advanced methods.

Thirdly, when d changes from 5 12 to 8 192, increasing d can improve the accuracy of all models, and HBP reaches the saturation of d = 8 192. Therefore, d = 8 192.

Then, we carry out quantitative experiments on cub200 -20 1 1[30] data set to analyze the influencing factors of this layer. The accuracy of Table 2 is obtained under the same embedding dimension (d = 8 192). We consider the combination of CBP and HBP at different levels. The results show that the performance gain of the framework mainly comes from the interaction between layers and multi-layer combination. Because HBP-3 shows the best performance, we used relu5 1, relu5 2 and relu5 3 in all the experiments in Section 4.3.

We also compare our cross-layer integration with finite element fusion based on supersequence [3]. In order to make a fair comparison, we re-recognize the hypersequence as the characteristic connection of relu5 3 and relu5 2, and then decompose the bilinear pool (named HyperBP) under the same experimental settings. As can be seen from Table 3, the result of our CBP is slightly better than that of HyperBP, which is close to the parameter of 1/2, which once again shows that our integration framework is more effective in capturing the feature relationship between layers. This is not surprising, because our CBP is consistent with human perception to some extent. Contrary to HyperBP algorithm, when more convolution layers are integrated and activated [3], the result is worse. Our HBP algorithm can capture the complementary information of the middle convolution layer, and the recognition accuracy is obviously improved.

Results cub- 200-20 1 1. CUB data set provides real ground annotation of bounding box and bird part. The only monitoring information we use is the image-level class label. The classification accuracy of cub200 -20 1 1 is shown in Table 4. The table is divided into three parts according to rows: the first part summarizes the method based on labeling (using object bounding box or partial labeling); The second method is unsupervised part-based method; Finally, the results of the pool-based method are given.

As can be seen from the results in Table 4, PN-CNN[2] uses human-defined bounding boxes and powerful ground real-part super vision. SPDA- CNN[35] uses the ground truth part, and B-CNN [17] uses the bounding box with very high-dimensional feature representation (250K dimension). Compared with PN- CNN[2], SPDA-CNN[35] and B-CNN[ 17], the proposed HBP (RELU 53+RELU 52+RELU 51) can achieve better results even without considering bbox and partial interference, which proves the effectiveness of our model. Compared with STN[9], using a stronger initial network as the benchmark model, we get 3.6% family blood pressure (relu5 3+relu5 2+relu5 1) relative to the identity of ac-assistant pastor. We even surpassed RA-CNN[5] and MA-CNN[37], and recently proposed the most advanced unsupervised partial cause method, with relative accuracy of 2. 1% and 0.7% respectively. Compared with pool-based B-CNN[ 17], CBP[6] and LRBP[ 12] baselines, we mainly benefit from better inter-layer interaction and multi-layer integration of result features. We also surpassed BoostCNN[22], which can enhance multiple bilinear networks trained on multiple scales. Although HIHCA[3] puts forward the idea of fine-grained recognition similar to the feature interaction model, our model can achieve higher accuracy because of the mutual promotion framework of feature interaction and discriminant feature learning between layers. Note that HBP (Relu53+Relu52+Relu51) performs better than CBP(relu5 3+relu5 2) and FBP(relu5 3), which shows that our model can capture the complementary information between layers.

The achievements of Stanford automobile company. The classification accuracy of Stanford cars is shown in Table 5. Different automobile parts are different and complementary, so the localization of objects and parts may play an important role here. Although there is no definite partial detection of HBP, our detection results are the best among the most advanced detection methods at present. Based on the interactive learning of interlayer features, we even improve the relative accuracy of 1.2% compared with PA-CNN[ 13] which uses artificially defined bounding boxes. Compared with the unsupervised part-based method, we can observe obvious improvement. Our HBP is also superior to the pool-based methods BoostCNN[22] and KP[4].

As a result, FGVC- aircraft. Due to subtle differences, different aircraft models are difficult to identify, for example, by calculating the number of windows in the model. Table 6 summarizes the classification accuracy of fgvc aircraft. Nevertheless, our model has reached the highest level, and the classification accuracy is the highest among all methods. Compared with annotation-based MDTP[32] method, partial learning-based MA-CNN[37] method and pool-based BoostCNN[22] method, we can observe steady improvement, which highlights the effectiveness and robustness of the proposed HBP model.

In order to better understand our model, we visually fine-tune the model responses of different layers in the network on different data sets. By calculating the average amplitude of feature activation, the activation graph channel is obtained. In Figure 3, we randomly select some images from three different data sets and visualize them.

All the visualization results show that the proposed model can recognize the messy background and be strongly activated in highly specific scenes. The active areas highlighted in project 1, project 2 and project 3 are closely related to the semantic parts of the cub's head, wings and chest. Automobile front bumper, wheel and lamp; Aircraft cockpit, tail stabilizer and engine. These parts are the key to distinguish categories. More importantly, our model is highly consistent with human perception, and it solves the detailed problem when perceiving scenes or objects. As can be seen from fig. 3, the deconvolution layers (relu5 1, relu5 2, relu5 3) provide the approximate position of the target object. On this basis, the projection layers (project5 1, project5 2, project5 3) further determine the essential part of the object, and distinguish its categories through the continuous interaction and integration of features of different parts. This process accords with human perception and nature is influenced by the maxim of gestalt: in the whole front part, it also provides an intuitive explanation of why the classification of our frame model is unclear, and some of them are detected and some of them are different.

In this paper, a hierarchical bilinear pool method is proposed, which combines the interaction between layers and the learning of discriminant features to realize the fine-grained fusion of multi-layer features. The proposed network does not need bounding box/component annotation and can be trained end to end. A large number of experiments on birds, cars and planes have proved the effectiveness of our framework. In the future, we will expand our research in two directions. How to effectively integrate more layer features to obtain multi-scale part representation, and how to learn better fine-grained representation by combining effective part positioning methods.