[21.03] ABINet
Thinking more!
Visual recognition may have reached its limits when it comes to understanding text in challenging scenarios. Researchers are now considering how to enable models to "read" and interpret text more like humans.
Problem Definition
Previous research has often relied solely on visual features for text recognition. However, when text is blurred, occluded, or distorted, models struggle to recognize it accurately. This is because visual features are compromised, making it difficult for the model to detect text reliably.
To address this limitation, researchers have begun incorporating language models to help models understand text contextually. For instance, if we see a blurred text like “APP?E,” a language model can help us infer that the intended word is likely “APPLE.”
Human reading behavior is generally characterized by:
- Autonomy: Humans can learn from both visual and linguistic cues independently, with language knowledge supporting visual understanding of text.
- Bidirectionality: When text is unclear, humans use surrounding context to fill in the blanks.
- Iterative Process: Humans can continuously refine their understanding through repeated reasoning and correction.
This paper’s authors propose that models should replicate these human reading characteristics:
- Autonomy: The model uses separate vision and language modules to model text independently by its visual and linguistic features.
- Bidirectionality: The authors introduce a “Bidirectional Cloze Network” that learns from context on both sides of the text.
- Iterative Process: The model refines its predictions iteratively, gradually improving its accuracy.
With this approach in mind, the authors have defined the conceptual foundation. The remaining step is to prove its effectiveness through experimentation!
Solution
The authors propose a new model architecture: ABINet, designed to tackle the challenges of scene text recognition.
Visual Model
The visual model in ABINet functions similarly to traditional network architectures, utilizing ResNet for feature extraction to transform an input image into a feature representation:
where and represent the image dimensions, and is the feature dimension.
The visual features are then converted to character probabilities and transcribed in parallel using a positional attention mechanism:
Here's a breakdown of each component in the attention calculation:
-
: This is the positional encoding of the character sequence, encoding each character's position in the sequence to help the model understand the order of characters. Specifically, represents the sequence length, or the number of characters to be processed, while denotes the feature dimension.
-
: The "key" is derived from through a function , which in this case is implemented as a mini U-Net structure. This reduces the feature dimension of to match the character sequence length, ensuring consistency between the spatial dimensions of the image features and the character sequence.
-
: The "value" is obtained by applying an identity mapping on , meaning that is passed directly without additional transformations. This gives the same dimensions as , preserving consistency in feature space and spatial resolution.
The purpose of this process is to map the visual features into character probability predictions based on positional information. This enables the model to perform accurate character recognition by leveraging both the sequence order and spatial information from the image.
In essence, the positional attention mechanism transforms the image features into corresponding character probabilities, enabling the model to recognize characters accurately by combining their sequential position and image-based features.
Language Model
The language model in ABINet is integrated after the visual model, using several key strategies to improve text recognition performance:
-
Autonomous Strategy:
As shown in the diagram, the language model functions independently as a spelling correction model. It takes character probability vectors as input and outputs the probability distribution for each character. This independence allows the language model to be trained on unlabeled text data separately, enhancing the model's interpretability and modularity. Additionally, the language model can be replaced or adjusted independently.
To ensure this autonomy, the authors introduce a technique called Blocking Gradient Flow (BGF), which prevents gradients from the visual model from flowing into the language model, thus preserving the language model's independence.
This approach enables the model to leverage advancements in natural language processing (NLP). For instance, it allows the use of various pre-trained language models to boost performance as needed.
-
Bidirectional Strategy:
The language model calculates conditional probabilities for bidirectional and unidirectional representations, denoted as for bidirectional and for unidirectional. Bidirectional modeling provides richer semantic information by considering both left and right contexts.
Similar to the Masked Language Model (MLM) in BERT, which uses a [MASK] token to predict a character , direct use of MLM is computationally expensive because each string would need to undergo multiple mask operations to predict each character. To improve efficiency, the authors propose the Bidirectional Cloze Network (BCN), which achieves bidirectional representation without repetitive masking.
BCN adopts a Transformer decoder-like structure but with unique modifications. Instead of causal masking, BCN uses a custom attention mask to prevent each character from "seeing" itself, thus avoiding information leakage.
The mask matrix in BCN’s multi-head attention block is constructed as follows:
Here, , where is the probability distribution of character , and is a linear mapping matrix.
The multi-head attention computation is given by:
In this, represents positional encodings in the first layer and outputs from the previous layer in subsequent layers, while and come from the linear mapping of character probabilities .
The BCN’s cloze-like attention mask enables the model to learn a stronger bidirectional representation, capturing more complete semantic context than unidirectional models.
tipKey Points:
- BCN uses a decoder structure but without causal masking, enabling parallel decoding.
- Instead of [MASK], BCN applies a diagonal mask to improve computational efficiency.
-
Iterative Strategy:
The authors propose an Iterative Correction strategy to address noise in the visual model’s input, which can lower prediction confidence during Transformer’s parallel decoding. In the first iteration, represents the visual model's probability predictions. In subsequent iterations, comes from the previous fusion model's predictions.
tipThis approach is akin to using multiple Transformer layers, with the authors referring to this as “iterative correction.”
-
Fusion
Since the visual model is trained on image data and the language model on text data, these two sources of information need to be aligned. The authors employ a "gated mechanism" to fuse the visual and language features effectively, balancing their contributions in the final prediction.
Visual features and language features are concatenated, and a linear mapping compresses the concatenated features to match the dimensions of and .
The gated vector is calculated as:
Here, is the Sigmoid function, which bounds between 0 and 1, controlling the balance between the visual and language features in the final output.
The fused feature is obtained by weighted combining and as follows:
where represents element-wise multiplication. When is close to 1, visual features have a stronger influence, while a value close to 0 gives more weight to language features .
-
Supervised Training
ABINet is trained end-to-end with a multi-task objective, combining the losses from the visual, language, and fused features.
The objective function is: