[19.08] PAN
Pixel Aggregation Strategy
Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network
The authors, after recently publishing PSENet, have swiftly followed up with the release of PAN.
Problem Definition
This time, the goal is straightforward: the authors felt that the previous model architecture was too slow.
The solution? Replace ResNet50 and VGG16 with ResNet18!
Also, modify the FPN by switching it to depthwise separable convolutions!
In short, the focus is on lightweight design—whatever makes the model faster and more efficient!
Solution
Model Architecture
The image above shows the model architecture of PAN, which looks quite similar to PSENet. To improve speed, the authors propose several modifications.
First, they reduced the parameters of the Feature Pyramid Network (FPN), but since this reduction would lower feature expression capability, they introduced a new fusion strategy to compensate.
The paper introduces several abbreviations, so let’s familiarize ourselves with them first:
- FPEM: Feature Pyramid Enhancement Module
- FFM: Feature Fusion Module
Let's break these down in sequence.
FPEM
This is essentially an FPN, but the convolution layers have been modified. The standard convolutions have been replaced with depthwise separable convolutions to reduce parameters and increase speed.
The authors emphasize that this is a stackable module, where stacking multiple layers enhances feature expression, compensating for the shortcomings of lightweight backbone networks.
Stacking multiple feature fusion modules was a common theme in 2019 papers, such as NAS-FPN and BiFPN.
FFM
After stacking several FPEM modules, you’ll end up with many feature maps.
For example, a single FPEM in the paper produces four feature maps of different scales, with each scale outputting 128 channels. Thus, you have a total of 512 channels. If you stack N layers, you'll end up with 512*N channels, which would be computationally expensive for subsequent processing.
The authors address this by adding up the output feature maps from each FPEM layer, as shown in the image. This means that regardless of how many layers are stacked, you will still only have channels in the final output.
Pixel Aggregation Strategy
Once the backbone and neck parts are completed, the final stage is the prediction head.
Pixel aggregation aims to correctly merge the pixels in the text region with their corresponding kernel to reconstruct the complete text instance. This method borrows from clustering concepts, where a text instance is treated as a cluster, the kernel as the cluster center, and the pixels in the text region as the samples to be clustered.
During training, pixel aggregation uses an aggregation loss to guide the pixels of text towards their kernel and a discrimination loss to maintain distance between different kernels.
-
(1) Aggregation Loss
- The goal of the aggregation loss is to minimize the distance between the pixels of the same text instance and its kernel.
- For each text instance , the loss function is defined as:
where is the total number of text instances, is the -th text instance, is its kernel, and is a pixel within the text instance.
- The distance function is defined as:
where is the similarity vector of pixel , and is the similarity vector of kernel , defined as:
is set to 0.5, filtering out easily classified samples.
-
(2) Discrimination Loss
- The discrimination loss aims to ensure sufficient distance between the kernels of different text instances.
- The discrimination loss is defined as:
where is the distance between kernels and , defined as:
is set to 3, ensuring that the distance between different kernels is no less than this threshold.
During inference, the predicted similarity vectors guide the merging of text pixels with the corresponding kernel. The steps are as follows:
- Find kernel connected components: Identify the connected components in the kernel segmentation results. Each connected component represents a kernel.
- Merge neighboring text pixels: For each kernel , merge neighboring text pixels (using 4-connectivity) if the Euclidean distance between pixel and kernel ’s similarity vector is below a threshold .
- Repeat merging: Repeat step 2 until no more neighboring text pixels can be merged.
This method leverages aggregation and discrimination losses to guide the correct merging of kernels and text pixels, ultimately reconstructing the complete text instance.
Loss Function
The overall loss function is:
where:
- represents the text region loss.
- represents the kernel region loss.
- and are the aggregation and discrimination losses.
- The parameters and are used to balance the importance of these losses.
To handle the imbalance between text and non-text pixels, the paper uses Dice Loss for supervising the segmentation of text regions and kernel regions .
The kernel region's ground truth is obtained by shrinking the original polygon ground truth, following the approach used in PSENet, which shrinks polygons by a ratio .
When calculating , Online Hard Example Mining (OHEM) is employed to ignore easy non-text pixels. When calculating , , and , only text pixels in the ground truth are considered.
Training Datasets
-
SynthText
Used for pre-training the model, SynthText is a large-scale dataset containing approximately 800,000 synthetic images, blending natural scenes with random fonts, sizes, colors, and orientations to create highly realistic images.
-
CTW1500
A challenging dataset for long curved text detection, CTW1500 contains 1,000 training images and 500 test images. Unlike traditional datasets (e.g., ICDAR 2015, ICDAR 2017 MLT), the text instances in CTW1500 are labeled with polygons of 14 points, allowing the description of arbitrary curved text shapes.
-
Total-Text
Total-Text is a newly released dataset for curved text detection, containing horizontal, multi-directional, and curved text instances. The dataset includes 1,255 training images and 300 test images.
-
ICDAR 2015
A commonly used dataset for text detection, ICDAR 2015 contains 1,500 images, with 1,000 for training and the remainder for testing. Text areas are annotated using quadrilateral bounding boxes.
-
MSRA-TD500
This dataset features multi-language, arbitrarily oriented, long text lines. It contains 300 training images and 200 test images, with line-level annotations. Due to the small training set, images from the HUST-TR400 dataset were added as training data.
Training Strategy
PAN uses ResNet or VGG16 as the backbone, pre-trained on ImageNet.
All networks are optimized using Stochastic Gradient Descent (SGD). Pre-training on SynthText involved 50K iterations with a fixed learning rate of . For other experiments, training lasted 36K iterations, starting with a learning rate of