Discussion
Based on our experiments, we have developed a model that performs quite well.
Here, we will discuss some insights and experiences we gained during the training process.
-
While our model can achieve scores close to SoTA, real-world scenarios are much more complex than this dataset. Therefore, we shouldn't overly focus on these scores. Our goal is simply to demonstrate the effectiveness of our model.
-
In our experiments, we found that the current design of our model architecture does not perform well in zero-shot scenarios, meaning the model requires fine-tuning to achieve optimal results in new environments. In the future, we should explore more robust model architectures with better generalization capabilities.
-
As mentioned in the Model Design section, we cannot directly address the challenge of amplification error. Therefore, the stability of the "Heatmap Regression Model" far exceeds that of the "Point Regression Model".
-
We defaulted to using
FastViT_SA24
as the backbone for the heatmap model due to its effectiveness and computational efficiency. -
Through experimentation, we found that a 3-layer
BiFPN
outperforms a 6-layerFPN
, so we recommend usingBiFPN
as the configuration for the Neck section. However, our implementation ofBiFPN
involveseinsum
operations, which may pose challenges for other inference frameworks. Therefore, if you encounter conversion errors when usingBiFPN
, consider switching to theFPN
model. -
Although the "Heatmap Regression Model" demonstrates stability, it requires supervision on high-resolution feature maps, resulting in significantly higher computational costs compared to the "Point Regression Model".
-
However, we cannot overlook the advantages of the "Point Regression Model", including its ability to predict corners beyond the document boundary, lower computational requirements, and a faster and simpler post-processing pipeline. Therefore, we will continue to explore and optimize the "Point Regression Model" to improve its performance.