Transformer-Based Object Detection in Natural Images. State-of-the-Art Architectures and Recent Algorithms

Elbek Asqarov

Authors

Elbek Asqarov

Department of Digital Technologies and Mathematics, Kokand University, Kokand, Uzbekistan

Author

Keywords:

object detection; vision transformer; DETR; self-attention; set prediction; real-time detection; MS-COCO; deep learning.

Abstract

Object detection in natural images is a fundamental computer-vision task that underpins applications ranging from autonomous driving to industrial inspection. Since the introduction of the DEtection TRansformer (DETR), the field has shifted from anchor-based convolutional pipelines toward end-to-end, attention-driven set-prediction frameworks that remove hand-crafted components such as anchor generation and non-maximum suppression. This paper presents a structured review of transformer-based object detectors, tracing their evolution from the original DETR through deformable attention, query-design refinements (DAB-DETR, DN-DETR), contrastive denoising (DINO), collaborative hybrid assignment (Co-DETR), and the most recent real-time variants (RT-DETR, RT-DETRv2, RF-DETR). We organise these methods into a taxonomy according to their core innovations in attention mechanisms, query formulation, and label-assignment strategies, and we compare their reported accuracy and inference speed on the MS-COCO benchmark. The analysis shows that contemporary detection transformers now match or surpass convolutional and YOLO-family detectors in both accuracy and real-time efficiency, with leading models exceeding 60 mean Average Precision on standard benchmarks. We further discuss persistent challenges—small-object localisation, training convergence, and computational cost—and outline promising research directions, including open-vocabulary detection and lightweight deployment.

References

A. Vaswani et al., “Attention is all you need,” in Proc. NeurIPS, 2017, pp. 5998–6008.

A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers for image recognition at scale,” in Proc. ICLR, 2021.

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. ECCV, 2020, pp. 213–229.

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable transformers for end-to-end object detection,” in Proc. ICLR, 2021.

S. Liu et al., “DAB-DETR: Dynamic anchor boxes are better queries for DETR,” in Proc. ICLR, 2022.

F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang, “DN-DETR: Accelerate DETR training by introducing query denoising,” in Proc. CVPR, 2022, pp. 13619–13627.

H. Zhang et al., “DINO: DETR with improved denoising anchor boxes for end-to-end object detection,” in Proc. ICLR, 2023.

Z. Zong, G. Song, and Y. Liu, “DETRs with collaborative hybrid assignments training,” in Proc. IEEE/CVF ICCV, 2023, pp. 6748–6758.

W. Lv et al., “DETRs beat YOLOs on real-time object detection,” in Proc. IEEE/CVF CVPR, 2024.

W. Lv, Y. Zhao, Q. Chang, et al., “RT-DETRv2: Improved baseline with bag-of-freebies for real-time detection transformer,” arXiv:2407.17140, 2024.

Roboflow, “RF-DETR: A real-time, transformer-based object detection model,” Technical Report, 2025.

Z. Liu et al., “Swin Transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF ICCV, 2021, pp. 10012–10022.

Y. Fang et al., “You only look at one sequence: Rethinking transformer in vision through object detection,” in Proc. NeurIPS, 2021.

M. Minderer et al., “Simple open-vocabulary object detection with vision transformers,” in Proc. ECCV, 2022, pp. 728–755.

X. Hou, M. Liu, S. Zhang, P. Wei, and B. Chen, “Salience DETR: Enhancing detection transformer with hierarchical salience filtering refinement,” in Proc. IEEE/CVF CVPR, 2024.

S. Liu, T. Ren, J. Chen, et al., “Detection transformer with stable matching,” in Proc. IEEE/CVF ICCV, 2023.

T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. ECCV, 2014, pp. 740–755.

T. Cao et al., “Object detection based on CNN and vision-transformer: A survey,” IET Computer Vision, 2025.

A survey, “Transformers in small object detection: A benchmark and survey of state-of-the-art,” ACM Computing Surveys, 2025.

T. Khan et al., “Object detection with transformers: A review,” Sensors (MDPI), 2025.

Sustainable Development Goals

Indexing and Abstracting

Browse Articles

Transformer-Based Object Detection in Natural Images. State-of-the-Art Architectures and Recent Algorithms

Authors

Elbek Asqarov

Keywords:

Abstract

References

Downloads

Additional Files

Published

Issue

Section

License

How to Cite