IDEA-Research
/

dab-detr-resnet-50-pat3

@@ -25,7 +25,7 @@ pipeline_tag: object-detection
 ## Model Details
-![image/png](https://github.com/conditionedstimulus/hf_media/blob/main/dab_detr_convergence_plot.png)
 > We present in this paper a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) and offer a deeper understanding of the role of queries in DETR. This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer. Using box coordinates not only helps using explicit positional priors to improve the query-to-feature similarity and eliminate the slow training convergence issue in DETR, but also allows us to modulate the positional attention map using the box width and height information. Such a design makes it clear that queries in DETR can be implemented as performing soft ROI pooling layer-by-layer in a cascade manner. As a result, it leads to the best performance on MS-COCO benchmark among the DETR-like detection models under the same setting, e.g., AP 45.7\% using ResNet50-DC5 as backbone trained in 50 epochs. We also conducted extensive experiments to confirm our analysis and verify the effectiveness of our methods.
@@ -146,12 +146,12 @@ Images are resized/rescaled such that the shortest side is at least 480 and at m
 ## Evaluation
-![image/png](https://github.com/conditionedstimulus/hf_media/blob/main/https://github.com/conditionedstimulus/hf_media/blob/main/results_dab_detr.png)
 ### Model Architecture and Objective
-![image/png](https://github.com/conditionedstimulus/hf_media/blob/main/model_arch_dab_detr.png)
 Overview of DAB-DETR. We extract image spatial features using a CNN backbone followed with Transformer encoders to refine the CNN features.
 Then dual queries, including positional queries (anchor boxes) and content queries (decoder embeddings), are fed into the decoder to probe the objects which correspond to the anchors and have similar patterns with the content queries. The dual queries are updated layer-by-layer to get close to the target ground-truth objects gradually.

 ## Model Details
+![image/png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dab_detr_convergence_plot.png)
 > We present in this paper a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) and offer a deeper understanding of the role of queries in DETR. This new formulation directly uses box coordinates as queries in Transformer decoders and dynamically updates them layer-by-layer. Using box coordinates not only helps using explicit positional priors to improve the query-to-feature similarity and eliminate the slow training convergence issue in DETR, but also allows us to modulate the positional attention map using the box width and height information. Such a design makes it clear that queries in DETR can be implemented as performing soft ROI pooling layer-by-layer in a cascade manner. As a result, it leads to the best performance on MS-COCO benchmark among the DETR-like detection models under the same setting, e.g., AP 45.7\% using ResNet50-DC5 as backbone trained in 50 epochs. We also conducted extensive experiments to confirm our analysis and verify the effectiveness of our methods.
 ## Evaluation
+![image/png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dab_detr_results.png)
 ### Model Architecture and Objective
+![image/png](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dab_detr_model_arch.png)
 Overview of DAB-DETR. We extract image spatial features using a CNN backbone followed with Transformer encoders to refine the CNN features.
 Then dual queries, including positional queries (anchor boxes) and content queries (decoder embeddings), are fed into the decoder to probe the objects which correspond to the anchors and have similar patterns with the content queries. The dual queries are updated layer-by-layer to get close to the target ground-truth objects gradually.