Traditional Culture Encyclopedia - Traditional stories - [ECCV2020] Paper Translation:Character Region Attention For Text Spotting

[ECCV2020] Paper Translation:Character Region Attention For Text Spotting

The Scene Text Detector consists of text detection and recognition modules. Much research has been conducted to unify these modules into end-to-end trainable models for better performance. Typical architectures place the detection and recognition modules in separate branches, and RoI pooling is often used to allow branches to *** enjoy visual features. However, there is still an opportunity to create more complementary connections between modules when employing a recognizer that uses an attention-based decoder and detector to represent spatial information about a character region. This is possible because the two modules ***share a ***similar subtask that will find the location of the character region. Based on these insights, we constructed the tightly coupled single-pipeline model. This structure is formed by using the detection output as the recognizer input,and propagating the recognition loss during the detection phase. The use of the character score map helps the recognizer to better focus on character centroids, and the propagation of recognition loss to the detector module enhances the localization of character regions. Additionally, the enhanced *** hedging stage allows for feature correction and boundary localization for arbitrarily shaped text regions. Extensive experiments demonstrate the state-of-the-art performance of the publicly available linear and curvilinear benchmark datasets.

Scene text localization, including text detection and recognition, has recently attracted a lot of attention due to its various applications in on-the-fly translation, image retrieval and scene parsing. While existing text detectors and recognizers are effective on horizontal text, it is still a challenge to find curved text instances in scene images.

To discover curved text in an image, a classical approach is to cascade existing detection and recognition models to manage text instances on each side. Detectors [32, 31, 2] attempt to capture the geometric properties of curved text by applying sophisticated post-processing techniques, while recognizers apply multidirectional coding [6] or employ correction modules [37, 46, 11] to enhance the accuracy of recognizers on curved text.

With the development of deep learning, research has been conducted on combining detectors and recognizers into end-to-end networks that can be *** trained together [14, 29]. Having a unified model not only improves the efficiency and speed of model sizing, but also helps the model to learn *** enjoyable features, which improves the overall performance. To benefit from this property, attempts have also been made to use end-to-end models [32, 34, 10, 44] to handle curved text instances. However, most of the existing work only uses RoI pooling to ****-enjoy the underlying features between detection and recognition of branches. Instead of training the entire network, the detection and recognition losses are used to train the *** enjoyment feature layer during the training phase.

As shown in Fig. 1, we propose a novel end-to-end character-region attentional text localization model, called CRAFTS. instead of segregating the detection and recognition modules in two separate branches, we build a single pipline by creating complementary connections between the modules. we observe that the recognizer [1], which uses an attention-based decoder, and the detector [1], which packages the Detector [2] with character spatial information *** enjoy a common subtask which is used to localize character regions. By tightly integrating the two modules, the output of the detector level helps the recognizer to better identify character centroids, and the loss propagated from the recognizer to the detector level enhances the localization of character regions. Moreover, the network is able to maximize the quality of the feature representations used in the public **** subtask. To the best of our knowledge, this is the first end-to-end work on building tightly coupled losses.

Our contributions are summarized as follows:

(1) We present an end-to-end network that can detect and recognize arbitrarily shaped text.

(2) By utilizing spatial character information from detectors on the correction and recognition modules, we construct complementary relationships between the modules.

(3) We build a single pipline by propagating the recognition loss across all features of the entire network.

(4) We achieve state-of-the-art performance on the IC13, IC15, IC19-MLT, and TotalText [20, 19, 33, 7] datasets, which contain a large number of horizontal, curved, and multilingual texts.

Text Detection and Recognition Methods

The detection network uses regression-based [16, 24, 25, 48] or segmentation-based [9, 31, 43, 45] methods to generate text bounding boxes. Some recent methods such as [17, 26, 47] use Mask-RCNN [13] as a base network and gain advantages from regression and segmentation methods by employing multi-task learning. As far as the unit of text detection is concerned, all methods can also rely on the use of word-level or character-level [16, 2] predictions for subclassification.

Text recognizers typically use CNN-based feature extractors and RNN-based sequence generators and classify by their sequence generators. Connectionist temporal categorization (CTC) [35] and attention-based sequential decoders [21, 36]. Detection models provide information about text regions, but it is still a challenge for recognizers to extract useful information from arbitrarily shaped text. To help the recognition network to deal with irregular text, some studies [36, 28, 37] utilize Spatial Transformer Networks (STN) [18]. Moreover, papers [11, 46] further extend the use of STNs by iteratively performing correction methods. These studies show that running STNs recursively helps the recognizer to extract useful features in extremely curved text. In [27], a recursive RoIWarp layer is proposed, which crops individual characters before recognizing them. This work demonstrates that the task of finding character regions is closely related to the attention mechanisms used in attention-based decoders.

One approach to constructing a text localization model is to place the detection and recognition networks sequentially. The well-known two-stage structure couples the TextBox ++ [24] detector and the CRNN [35] recognizer. In brief, the approach achieves good results.

End-to-end use of RNN-based recognizers

EAA [14] and FOTS [29] are end-to-end models based on the EAST detector [49]. The difference between these two networks is the recognizer. The FOTS model uses a CTC decoder [35], while the EAA model uses an attention decoder [36]. Both works implement affine transformation layers to merge the ****-enjoyment functions. The proposed affine transformations work well on horizontal text, but show limitations when dealing with arbitrarily shaped text. TextNet [42] proposes a spatially aware text recognizer with perspective RoI transformations in the feature pooling layer. The network retains the RNN layer to recognize text sequences in 2D feature maps, but still shows limitations in detecting curved text due to the lack of expressive quadrangles.

Qin et al [34] proposed an end-to-end network based on Mask-RCNN [13]. Given box proposals, features are merged from a ****-enjoyment layer and background clutter is filtered out using an ROI masking layer. The proposed method improves its performance by ensuring that attention is only in the text region.Busta et al. proposed the Deep TextSpotter [3] network and extended their work in E2E-MLT [4]. The network consists of an FPN-based detector and a CTC-based recognizer. The model predicts multiple languages in an end-to-end manner.

End-to-end with CNN-based recognizers

When dealing with arbitrarily shaped text, most CNN-based models have an advantage in recognizing character-level text. MaskTextSpotter [32] is a model for recognizing text using segmentation methods. While it has advantages in detecting and recognizing individual characters, it is difficult to train the network because character-level annotations are not usually provided in public **** datasets. CharNet [44] is another segmentation-based approach that allows character-level prediction. The model is trained in a weakly supervised manner to overcome the lack of character-level annotations. During training, the method performs iterative character detection to create pseudo-ground-truths.

Although segmentation-based recognizers have had great success, the method suffers when the number of target characters increases. As the number of character sets increases, the segmentation-based model requires more output channels, which increases memory requirements. journal version of MaskTextSpotter [23] extends the character set to handle multiple languages, but the authors added an RNN-based decoder instead of using the CNN-based recognizer they originally proposed. Another limitation of segmentation-based recognizers is the lack of contextual information in the recognition branch. Due to the lack of sequential modeling like RNN, the accuracy of the model decreases with noisy images.

TextDragon [10] is another segmentation-based approach for localizing and recognizing text instances. However, there is no guarantee that predicted character segments will cover individual character regions. To address this issue, the model merges CTC to remove overlapping characters. The network shows good detection performance, but shows limitations in recognizers due to the lack of sequential modeling.

The CRAFT detector [2] was chosen as the base network due to its ability to represent semantic information about character regions. The output of the CRAFT network represents the central probability of character regions and the connections between them. Since the goal of both modules is to locate the center of a character, we envision that this character centering information can be used to support the attention module in the recognizer. In this work, we made three changes to the original CRAFT model; backbone replacement, connection representation and orientation estimation.

Backbone Replacement

Recent studies have shown that explicit feature representations defined by detectors and recognizers can be captured using ResNet50 [30, 1]. Therefore, we swap the backbone network from VGG-16 [40] to ResNet50 [15].

Connected Representation

Vertical text is not common in Latin texts, but it is often found in East Asian languages (e.g., Chinese, Japanese, and Korean). In this work, sequential character regions are connected using a binary centerline. The reason for this change is that using the original affinity diagram on vertical text often produces an undefined perspective transformation, which generates invalid box coordinates. To generate a ground truth connectivity map, a line segment of thickness t is drawn between neighboring characters. Here, t = max ((d 1 + d 2) / 2 * α, 1), where d 1 and d 2 are the diagonal lengths of the neighboring character boxes and α is the scaling factor. Using this equation makes the width of the centerline proportional to the size of the character. We set α to 0.1 in our implementation.

Orientation Estimation

It is important to obtain the correct orientation of the text box, since the recognition phase needs to define explicit box coordinates in order to recognize the text correctly. For this purpose we have added two channels of output in the detection phase.The channels are used to predict the angle of the character along x-axis and y-axis. In order to generate the ground truth of the oriented graph.

***The enjoyment phase consists of two modules: the text correction module and the character region attention (character region attention: CRA) module. To correct arbitrarily shaped text regions, thin-plate spline (TPS) [37] transformations are used. Inspired by [46], our correction module incorporates iterative TPS to better represent text regions. By attractively updating the control points, the curved geometry of the text in the image can be improved. Through empirical studies, we find that three TPS iterations are sufficient for correction.

The typical TPS module takes word images as input, but we provide character region maps and connection maps because they encapsulate geometric information about text regions. We use twenty control points to closely cover the curved text regions. To use these control points as detection results, they are converted to the original input image coordinates. Optionally, we perform a 2D polynomial fit to smooth the boundary polygons. An example of the iterative TPS and the final smoothed polygon output is shown in Figure 4.

The modules of the recognition stage are formed based on the results reported in [1]. The recognition phase contains three components: feature extraction, sequence modeling and prediction. Since the feature extraction module uses high-level semantic features as input, it is lighter than the recognizer alone.

The detailed architecture of the feature extraction module is shown in Table 1. After extracting the features, a bidirectional LSTM is applied to the sequence modeling and then an attention-based decoder is used for final text prediction.

At each time step, the attention-based recognizer decodes the textual information by masking the attentional output to the features. Although the attention module works well in most cases, it fails to predict characters when attention is unaligned or missing [5, 14]. Figure 5 shows the effect of using the CRA module. Properly placed attention points allow for reliable text prediction.

The final loss L used for training consists of the detection loss and the recognition loss, taken as L = Ldet + Lreg. The overall flow of the recognition loss is shown in Figure 6. The loss flows through the weights in the recognition phase and propagates to the detection phase through the character region attention module.

On the other hand, the detection loss is used as an intermediate loss, so the detection and recognition losses are used to update the weights before the detection phase.

English datasets The IC13 [20] dataset consists of high resolution images, 229 images for training and 233 images for testing. Rectangular boxes are used to annotate word-level text instances. IC15 [20] contains 1000 training images and 500 test images. Quadrilateral boxes are used to annotate word-level text instances. TotalText [7]

Has 1255 training images and 300 test images. Unlike the IC13 and IC15 datasets, it contains curved text instances and is annotated using polygonal points.

Multi-language dataset The IC19 [33] dataset contains 10,000 training and 10,000 test images. The dataset contains text in 7 different languages and is annotated using quadrilateral points.

We jointly train the detectors and recognizers in the CRAFTS model. To train the detection phase, we follow the weakly supervised training method described in [2]. The recognition loss is computed by cropping word features in each image with batch random sampling. The maximum number of words per image is set to 16 to prevent out-of-memory errors. Data enhancement in the detector applies techniques like cropping, rotation and color change. For the recognizer, the corners of the ground truth box are disturbed in a range between 0% and 10% of the shorter length of the box.

The model was first trained on the SynthText dataset [12] with 50k iterations and then we further trained the network on the target dataset. The Adam optimizer was used and On-line Hard Negative Mining (OHEM) [39] was applied to enforce a 1:3 ratio of positive and negative pixels in the detection loss. When fine-tuning the model, the SynthText dataset is mixed in a 1:5 ratio. We use 94 characters to cover letters, numbers, and special characters, and 4267 characters for the multilingual dataset.

Horizontal datasets (IC13, IC15)

To reach the IC13 benchmark, we use the model trained on the SynthText dataset and fine-tune it on the IC13 and IC19 datasets. During ;inference, we tuned the longer side of the input to 1280.

The results show a significant improvement in performance compared to previous state-of-the-art techniques.

The model trained on the IC13 dataset was then fine-tuned on the IC15 dataset. During evaluation, the input size of the model is set to 2560x1440. note that we perform generic evaluation without a generic lexical repertoire. The quantitative results for the IC13 and IC15 datasets are presented in Table 2.

A heat map was used to illustrate the character region map and connectivity map, and weighted pixel angle values were visualized in HSV color space.

As shown, the network successfully localizes polygonal regions and identifies characters in curved text regions. The two figures in the upper left corner show examples of fully rotated and highly curved text that were successfully recognized.

Attention aided by character-region attention

In this section, we investigate how character-region attention (CRA) affects the performance of the recognizer by training a separate network without CRA.

Table 5 shows the effect of using CRA on the benchmark dataset. Without CRA, we observe a decrease in performance on all datasets. In particular, on the visionary dataset (IC15) and the curved dataset (TotalText), we observe a larger gap compared to the horizontal dataset (IC13). This implies that feeding character attention information improves the performance of the recognizer when dealing with irregular text. (? The experimental data in the table is more effective for farsighted text, I wonder how this conclusion was reached?)

Importance of Orientation Estimation

Orientation estimation is important because there is a lot of multi-directional text in scene text images. Our pixel-by-pixel averaging scheme is useful for the recognizer to receive well-defined features. When orientation information is not used, we compare the results of the models. On the IC15 dataset, the performance decreases from 74.9% to 74.1% (-0.8%), and on the TotalText dataset, the h-mean value decreases from 78.7% to 77.5% (-1.2%). The results show that using the correct angle information can improve the performance of rotated text.

Inference speed

Since the inference speed varies with the input image size, we measured the FPS at different input resolutions, with the longer side of each resolution being 960, 1280, 1600, and 2560. the test results yielded an FPS of 9.9, 8.3, 6.8, and 5.4.For all experiments, we use Nvidia P40 GPUs and Intel?Xeon?CPUs.Compared to the 8.6 FPS of the VGG-based CRAFT detector [2], the ResNet-based CRAFTS network achieves a higher FPS on the same size of the inputs.Moreover, the direct use of the correction module from the control points alleviates the need for post-processing of polygon generation.

The problem of granularity difference

We hypothesize that the granularity difference between the ground-truth and the prediction frames results in relatively low detection performance on the IC15 dataset. Character-level segmentation methods tend to generalize character connectivity based on spatial and color cues, rather than capturing the full range of features of word instances. As a result, the output does not follow the annotation style of the boxes required for benchmarking. Figure 9 shows the failure cases in the IC15 dataset, which demonstrates that when we observe acceptable qualitative results, the test results are marked as incorrect.

In this paper, we present an end-to-end trainable single-pipeline model that tightly couples the detection and recognition modules. *** Character region attention in the enjoyment phase leverages the character region graph to help the recognizer correct and better engage text regions. In addition, we designed the recognition loss to propagate through the detection phase and enhance the character localization ability of the detector. In addition, the correction module in the *** enjoyment phase allows fine localization of curved text and eliminates the need to develop manual post-processing. Experimental results validate the state-of-the-art performance of CRAFTS on various datasets.