Skip to main content

[21.07] TextBPN

Try Some GCN!

Adaptive Boundary Proposal Network for Arbitrary Shape Text Detection


This paper introduces several concepts to address the problem of text detection, primarily focusing on two key techniques: LSTM, which is used to capture sequential information about text boundaries, and GCN (Graph Convolutional Networks), used to refine boundary features. While these topics could be discussed in great depth, the assumption is that readers are already familiar with these concepts.

Let's dive directly into the details.

Problem Definition

We have reviewed several text detection methods, and most of them deal with overlapping text regions by first identifying the "text core" and then expanding outwards to find the text boundaries. These approaches are often slow, or as the paper describes: "complex and inefficient."

To overcome these issues, the authors propose a novel adaptive boundary proposal network for arbitrary-shaped text detection, which directly generates accurate boundaries for text without any post-processing.

Problem Solution

Model Architecture

textbpn arch

The architecture may seem complex at first glance, so let’s break it down into parts.

Backbone + Neck

textbpn backbone

The feature extraction uses a ResNet-50 backbone, and the feature fusion in the neck is done using a U-Net-like architecture through concatenation. The final output from the neck consists of feature maps with 32 channels, retaining the same size as the input image.

Boundary Proposal Generation

textbpn boundary proposal

The boundary proposal generation consists of multiple dilated convolutions, including two 3x3 convolution layers (with different dilation rates) and one 1x1 convolution layer. These layers use shared features extracted from the backbone to generate three outputs: classification map, distance field map, and direction field map.

  1. Classification Map: A binary map where each pixel represents the confidence of belonging to text (binary classification: text/non-text).

  2. Direction Field Map: The direction field map VV is a 2D unit vector (x,y)(\vec{x}, \vec{y}) that represents the direction of each text pixel pointing to its closest boundary pixel. For each text instance TT, the unit vector from text pixel pp to its nearest boundary pixel BpB_p is computed as:

    Vgt(p)={BppBpp,pT(0,0),pTV_{gt}(p) = \begin{cases} \frac{\overrightarrow{B_p p}}{\|\overrightarrow{B_p p}\|}, & p \in T \\ (0, 0), & p \notin T \end{cases}

    where Bpp\|\overrightarrow{B_p p}\| is the distance between pixel pp and the boundary pixel BpB_p.

  3. Distance Field Map: The distance field map DD normalizes the distance between each text pixel pp and its nearest boundary pixel BpB_p. The formula is:

    Dgt(p)={BppL,pT0,pTD_{gt}(p) = \begin{cases} \frac{\|\overrightarrow{B_p p}\|}{L}, & p \in T \\ 0, & p \notin T \end{cases}

    where LL is the scale of the text instance TT defined as:

    L=max(Dgt(p)),pTL = \max(D_{gt}(p)), p \in T

textbpn boundary proposal

Based on the distance field map DD, candidate boundary proposals are generated using a fixed threshold thdth_d, but these proposals may include false positives. Therefore, the average confidence of each candidate proposal is computed from the classification map, and proposals with low confidence (below threshold thsth_s) are discarded.

These steps and formulas constitute the boundary proposal generation process, allowing effective text region detection while handling boundary issues between adjacent text instances.

Adaptive Boundary Deformation

The concept of adaptive boundary deformation builds on past research, including:

Using these methods, the model refines boundary proposals by predicting control points for each instance. Based on CNN features, the model performs per-vertex offset prediction. Each boundary proposal samples NN control points, which form a closed polygon considering both topological and sequential contexts.

For each control point cpi=[xi,yi]Tcp_i = [x_i, y_i]^T, a feature vector fif_i is generated using the 32-dimensional shared features FsF_s from the CNN backbone and 4-dimensional prior features FpF_p (such as pixel classification, distance field, and direction field):

fi=concat{Fs(xi,yi),Fp(xi,yi)}f_i = \text{concat}\{F_s(x_i, y_i), F_p(x_i, y_i)\}

Once control point predictions are obtained, boundary refinement takes place. This is done using GCN and RNN, as illustrated below:

Adaptive Boundary Deformation

The encoder combines GCN and RNN to fully utilize both topological and sequential contexts of the boundary. The formula is:

X=RNN(X)GCN(X)Conv1x1(X)X' = \text{RNN}(X) \oplus \text{GCN}(X) \oplus \text{Conv1x1}(X)

The RNN uses a bidirectional LSTM (B-LSTM) with a hidden size of 128. The input, for example, could be: [batch size, control point sequence, control point features].


The GCN consists of four graph convolution layers with ReLU activation. The formula for the graph convolution layer is:

Xg=ReLU((XGX)W)X_g = \text{ReLU}((X \oplus GX)W)

Where G=D1/2A~D1/2G = D^{-1/2} \tilde{A} D^{-1/2}, A~=A+IN\tilde{A} = A + I_N, with AA representing the adjacency matrix and INI_N being the identity matrix. DD is the degree matrix, and each control point is connected to four neighboring control points to handle contextual information.


In addition to the RNN and GCN branches, a 1x1 convolution layer forms a residual connection. The features from all branches are concatenated and passed to the decoder, which predicts the offset for each control point.

Multiple iterations of boundary deformation are performed to refine the initial boundary proposals, progressively improving text detection precision.

tip

Here, RNN is used for sequence handling, and GCN for processing information flow between neighboring nodes. Given the advancements in 2021, using transformers for these tasks could be a potential improvement direction in the future.

Training Datasets

  • SynthText: A large synthetic dataset with approximately 800,000 images, containing natural scenes mixed with text in random fonts, sizes, colors, and orientations.
  • CTW1500: A challenging dataset focusing on long curved text, featuring 1,000 training and 500 test images annotated with 14-point polygons to describe arbitrary curve shapes.
  • Total-Text: A dataset for curved text detection, with 1,255 training and 300 test images, including horizontal, multi-directional, and curved text.
  • MSRA-TD500: A multilingual dataset with long text lines in arbitrary orientations, featuring 300 training and 200 test images. The HUST-TR400 dataset is used for additional training due to its small size.
  • ICDAR 2017 MLT: A large-scale multilingual dataset containing 7,200 training, 1,800 validation, and 9,000 test images, covering nine languages.

Discussion

Ablation Studies

To verify the effectiveness of the adaptive deformation model, the authors conducted ablation studies on Total-Text and CTW-1500 datasets.

  1. Effectiveness of Adaptive Deformation Model:

    ablation deformation

    Four encoder variations were compared: Fully Connected Network (FC) + Conv1x1, RNN, Circular Convolution, and GCN.

    The adaptive deformation model achieved the best performance on both datasets, with an F-measure improvement of 0.94% on Total-Text compared to RNN and 0.85% compared to GCN, without significantly increasing detection time.

  2. Impact of Control Point Number (N):

    ablation contral point

    Different numbers of control points (ranging from 12 to 32) were tested. The F-measure decreased when the number of control points was too large or too small. At 20 control points, the model achieved the best performance on both datasets, so this number was fixed for experiments.

  3. Impact of Iteration Count:

    ablation iteration

    Performance improved with more iterations but plateaued after three iterations. Considering the balance between speed and performance, three iterations were chosen.

  4. Impact of Prior Information:

    ablation prior

The importance of different prior information (e.g., classification map, distance field, direction field) was evaluated. Adding distance and direction fields significantly boosted performance, with F-measure improvements of 5.45% and 1.55%, respectively.

Results on Curved Text

total-text

Total-Text Results

ctw1500

CTW1500 Results


The short side of images was scaled to 640, with the long side capped at 1024 pixels. The distance field threshold thdth_d was fixed at 0.3.

On Total-Text, the classification threshold thsth_s was set at 0.825, and TextBPN achieved an F-measure of 87.37% with SynthText pretraining, which increased to 87.85% with MLT17 pretraining, outperforming other methods.

On CTW-1500, with ths=0.8th_s = 0.8, TextBPN achieved an accuracy of 87.81% and an F-measure of 85.0%, significantly surpassing methods like TextSnake and DB.

Results on MSRA-TD500

msra-td500

On MSRA-TD500, with ths=0.925th_s = 0.925, TextBPN achieved an F-measure of 85.57%, outperforming other techniques such as DB and DRRG.

Visualization Results

textbpn vis

Conclusion

The introduction of GCN and RNN for text detection is indeed a novel strategy. Experimental results demonstrate that this method achieves significant performance improvements on challenging datasets like Total-Text and CTW-1500, particularly in handling irregular and curved text.

Future research could aim to further improve inference speed and flexibility while maintaining detection accuracy, making this method more practical for a wider range of applications.

tip

The classification threshold thsth_s plays a critical role in evaluating the dataset. Improper parameter tuning can significantly degrade the model's performance, potentially causing headaches from an engineering perspective.