[20.10] Deformable DETR
Deformable Attention Mechanism
Deformable DETR: Deformable Transformers for End-to-End Object Detection
DETR left an excellent foundation for researchers to build upon.
In DETR, the authors used only the basic Transformer architecture for the simplest form of object detection, without employing any advanced tricks. They simply fed images into the model, and out came the object locations and categories.
What a brilliant paper!
This leaves ample room for researchers to improve the model, allowing for countless future publications based on this foundation.
Problem Definition
Compared to existing object detectors, DETR is remarkably slow. Although it runs at half the speed of the popular Faster-RCNN during inference, this doesn’t seem like a big issue.
But the training convergence time is 20 times slower!
What used to take just a day to train now takes 20 days. This is a significant problem—time is precious, and no one wants to wait for weeks to get a model to converge.
Clearly, this issue needs to be addressed.
Solution
The authors believe the issue lies in the attention mechanism of the Transformer. In the Transformer, every pixel attends to every other pixel, which wastes most of the computational power on irrelevant regions.
Thus, instead of using the original Transformer attention mechanism, they borrowed the idea of deformable convolutions and created a "deformable attention mechanism."
At the time of this research, Vision Transformers (ViT) hadn’t been published yet, so the operations were based on individual pixels rather than image patches.
Deformable Attention
In the feature map, for each query element, the authors select a reference point and compute attention only around a few important sampling points near that reference. This differs from the traditional Transformer, which calculates attention over all points in the spatial domain.
Suppose the input feature map is:
where is the number of channels, and and are the height and width of the feature map, respectively.
Each query element consists of a content feature and a 2D reference point . The deformable attention feature is computed as:
Where:
-
is the number of attention heads.
-
is the number of sampling points selected for each query, chosen from a small region around the reference point.
-
is the attention weight for the -th sampling point in the -th attention head, constrained to and normalized such that:
-
is the offset of the -th sampling point in the -th attention head, which can take arbitrary real values.
-
and are learnable weight matrices responsible for linear transformations of the input features.
-
represents the feature value at location . Since this position is a fractional value (i.e., not a discrete grid point), bilinear interpolation is used to calculate it.
The query feature undergoes a linear projection that outputs a tensor of size :
- The first channels encode the offsets for each sampling point.
- The remaining channels are passed through a softmax function to compute the attention weights .
This design ensures that both the offsets and attention weights are learned from the query feature rather than based on fixed rules.
Multi-Scale Computation
Modern object detection frameworks often use multi-scale feature maps to detect objects at different scales. The deformable attention module naturally extends to a multi-scale version, allowing sampling and operations on multiple feature map layers simultaneously.
Assume the input multi-scale feature maps are denoted as , where each feature map . The reference point for the query is represented using normalized coordinates . The multi-scale deformable attention module is computed as:
Where:
- is the number of feature map layers.
- is a scaling function that converts normalized coordinates to the actual coordinates on the -th feature map layer.
- The other notations are similar to the single-scale case.
Model Architecture
As shown in the diagram above, after solving the deformable attention issue, the Transformer module in the original DETR architecture is completely replaced.
This forms the Deformable DETR.
If you're not interested in theoretical details, you can check out the official implementation on Github.
Training Strategy
The authors conducted experiments on the COCO 2017 dataset.
The backbone used was ResNet-50, pretrained on ImageNet. The neck directly extracted features from the multi-scale feature map without using an FPN.
Deformable attention settings:
- Number of attention heads: M = 8
- Number of sampling points per query: K = 4
Parameters in the deformable transformer encoder are shared across different feature layers.
Most hyperparameters and training strategies followed the setup of DETR, with a few exceptions:
- Focal Loss was used for bounding box classification with a weight of 2.
- The number of object queries was increased from 100 to 300.
The model was trained for 50 epochs, with the learning rate reduced to 0.1 of its initial value at epoch 40. The Adam optimizer was used with a base learning rate of , , , and weight decay of .
The learning rate for the linear projections that predict reference points and sampling offsets was further scaled by a factor of 0.1.
Discussion
Comparison with DETR
From the table above, compared to Faster R-CNN + FPN, DETR requires more training epochs to converge and performs poorly on small object detection. Deformable DETR, however, achieves better performance with a similar number of training epochs to Faster R-CNN, especially for small object detection.
The detailed convergence curve is shown below:
Through iterative bounding box refinement and a two-stage mechanism, this method further enhances detection accuracy.
Deformable DETR has FLOPs comparable to Faster R-CNN + FPN and DETR-DC5, but it runs 1.6 times faster than DETR-DC5 and is only 25% slower than Faster R-CNN + FPN.
The slower speed of DETR-DC5 is primarily due to the large memory access requirements in the Transformer attention mechanism. Deformable Attention alleviates this issue, although the unordered memory access still makes it slightly slower than traditional convolutions.
Ablation Studies
The table above shows the ablation study of different design choices in the deformable attention module.
Replacing single-scale input with multi-scale input significantly improves detection accuracy, increasing the average precision (AP) by 1.7%, especially for small objects (APS), which improves by 2.9%. Increasing the number of sampling points further improves AP by 0.9%.
Using multi-scale deformable attention enables information exchange across different feature layers, providing an additional 1.5% AP boost. Since cross-layer information exchange is already incorporated, adding FPNs does not provide further improvement. Without multi-scale attention and with , the deformable attention module degrades into deformable convolution, leading to a significant drop in accuracy.
SoTA Comparison
In the table above, Deformable DETR uses both iterative bounding box refinement and the two-stage mechanism.
With ResNet-101 and ResNeXt-101, the method achieves 48.7 AP and 49.0 AP, respectively, without any additional optimization techniques. When ResNeXt-101 with DCN is used, accuracy increases to 50.1 AP.
With test-time augmentations (TTA), the method achieves 52.3 AP.
TTA (Test-Time Augmentations) is a technique where an image is augmented multiple times during testing, and the results of these predictions are averaged to increase model robustness and accuracy.
Conclusion
Deformable DETR significantly reduces the training time compared to traditional DETR. It also performs much better on small object detection.
Though Deformable DETR introduces deformable attention to improve inference speed, it still lags slightly behind traditional convolutional networks like Faster R-CNN due to the unordered memory access. Therefore, speed is still a concern for real-time applications.
Deformable DETR opens up new directions for designing more efficient and practical end-to-end object detectors, with broad research and application potential.