Fast DETR
Add Some Gaussian!
Fast Convergence of DETR with Spatially Modulated Co-Attention.
Continuing our discussion on DETR, we already know how DETR works, so let’s jump straight to the problem at hand.
Defining the Problem
Similar to Deformable DETR, the authors of this paper also observed a major issue:
- DETR's convergence speed is too slow.
The root of the problem lies in the fact that DETR's cross-attention mechanism does not consider predicted bounding boxes. This results in the need for multiple iterations to generate appropriate attention maps for each object query.
Thus, the solution to faster convergence might lie in improving the cross-attention mechanism.
Deformable DETR changed the fundamental way attention operates.
This paper introduces prior knowledge into the cross-attention mechanism.
Solving the Problem
Spatially Modulated Co-Attention
The core idea behind the Spatially Modulated Co-Attention (SMCA) mechanism is:
Combining learnable cross-attention maps with hand-crafted query spatial priors.
SMCA dynamically predicts the initial center and scale of the box corresponding to each object query, generating 2D spatial Gaussian-weighted maps. These weight maps are element-wise multiplied with the co-attended feature maps of object queries and image features, allowing for more efficient aggregation of query-relevant information from the visual feature map.
In simple terms, DETR lacks prior knowledge, so its convergence is slow.
To speed it up, we provide it with some prior knowledge.
Initial Predictions
Each object query dynamically predicts the center and scale of the object it is responsible for. The predicted center and scale of the object query can be expressed as:
Where the object query is projected via a two-layer MLP and sigmoid function to predict normalized centers within the range of . The normalized centers are then inversely transformed to obtain the real center coordinates in the image.
Additionally, object queries predict the width and height ratio of the object, generating a 2D Gaussian weight map that reweights the cross-attention map, emphasizing features near the predicted object location.
Gaussian Weight Map
Once the object center and scale are predicted, SMCA generates a 2D Gaussian weight map , which can be expressed as:
Where are spatial indices, and is a hyperparameter that adjusts the width of the Gaussian distribution.
assigns higher weights to locations near the center and lower weights to positions farther from the center.
Using the spatial prior , SMCA modulates the cross-attention map . The modulation is performed as follows:
Here, SMCA adds the logarithm of the spatial weight map to the cross-attention scores and applies softmax normalization over all spatial locations, increasing the focus on the predicted bounding box region and reducing the search space of cross-attention, thus speeding up the model's convergence.
Compared to the original attention mechanism, the main addition here is the term.
Multi-Head SMCA
In the multi-head version of SMCA, different cross-attention heads are modulated based on their respective spatial weight maps.
Each attention head starts from the shared center and predicts a head-specific offset as well as a head-specific scale .
A Gaussian spatial weight map is generated for each head, centered at with the predicted scales.
The multi-head cross-attention feature map is expressed as:
To improve object detection performance, SMCA integrates multi-scale features. A CNN extracts multi-scale visual features with downsampling rates of 16, 32, and 64, respectively. These features are directly extracted from the CNN backbone without using a Feature Pyramid Network (FPN).
Each object query dynamically selects the most relevant scale by generating attention weights for different scales:
Cross-attention at different scales is computed as:
This mechanism allows each object query to dynamically select the most relevant scale while suppressing irrelevant scale features.
After cross-attention computation, the updated object query features