Submitted:
13 June 2023
Posted:
15 June 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Materials and Methods

3.1. TE-FPN

3.2. Instance Based Advanced Guidance Module
4. Results
4.1. COCO and Evaluation Metrics
4.2. Implementation Details
4.3. Main Results
4.4. Ablation Study
4.4.1. TIG-DETR
4.4.2. TE-FPN
4.5. Cityscapes
5. Conclusion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Girshick R. Fast r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2015: 1440-1448.
- Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 779-788.
- Bochkovskiy A, Wang C Y, Liao H Y M. Yolov4: Optimal speed and accuracy of object detection[J]. arXiv preprint arXiv:2004.10934, 2020.
- Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2980-2988.
- Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
- Liu Z, Lin Y, Cao Y, et al. Swin transformer: Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10012-10022.
- Liu S, Huang D, Wang Y. Learning spatial fusion for single-shot object detection[J]. arXiv preprint arXiv:1911.09516, 2019.
- Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.
- Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 3-19.
- Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2117-2125.
- Liu S, Qi L, Qin H, et al. Path aggregation network for instance segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 8759-8768.
- Guo C, Fan B, Zhang Q, et al. Augfpn: Improving multi-scale feature learning for object detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 12595-12604.
- Luo Y, Cao X, Zhang J, et al. CE-FPN: enhancing channel information for object detection[J]. Multimedia Tools and Applications, 2022, 81(21): 30685-30704. [CrossRef]
- Stergiou A, Poppe R, Kalliatakis G. Refining activation downsampling with SoftPool[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 10357-10366.
- He K, Gkioxari G, Dollár P, et al. Mask r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969.
- Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[J]. Advances in neural information processing systems, 2014, 27. [CrossRef]
- Uijlings J R R, Van De Sande K E A, Gevers T, et al. Selective search for object recognition[J]. International journal of computer vision, 2013, 104: 154-171. [CrossRef]
- Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014: 740-755.
- Loshchilov I, Hutter F. Decoupled weight decay regularization[J]. arXiv preprint arXiv:1711.05101, 2017.
- Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767 (2018).
- Zhu X, Su W, Lu L, et al. Deformable detr: Deformable transformers for end-to-end object detection[J]. arXiv preprint arXiv:2010.04159, 2020.
- Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with transformers[C]//Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer International Publishing, 2020: 213-229.
- Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014: 740-755.
- Loshchilov I, Hutter F. Decoupled weight decay regularization[J]. arXiv preprint arXiv:1711.05101, 2017.
- Fu C Y, Liu W, Ranga A, et al. Dssd: Deconvolutional single shot detector[J]. arXiv preprint arXiv:1701.06659, 2017.
- Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 3431-3440.
- Chen T, Saxena S, Li L, et al. Pix2seq: A language modeling framework for object detection[J]. arXiv preprint arXiv:2109.10852, 2021.
- Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30. [CrossRef]
- Tan M, Pang R, Le Q V. Efficientdet: Scalable and efficient object detection[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 10781-10790.
- Liu W, Anguelov D, Erhan D, et al. Ssd: Single shot multibox detector[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 2016: 21-37.
- Wang W, Xie E, Li X, et al. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 568-578.
- Han K, Xiao A, Wu E, et al. Transformer in transformer[J]. Advances in Neural Information Processing Systems, 2021, 34: 15908-15919. [CrossRef]
- He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
- Redmon J, Farhadi A. YOLO9000: better, faster, stronger[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7263-7271.
- Zheng M, Gao P, Zhang R, et al. End-to-end object detection with adaptive clustering transformer[J]. arXiv preprint arXiv:2011.09315, 2020.
- Qiu X, Sun T, Xu Y, et al. Pre-trained models for natural language processing: A survey[J]. Science China Technological Sciences, 2020, 63(10): 1872-1897. [CrossRef]
- Lin T, Wang Y, Liu X, et al. A survey of transformers[J]. AI Open, 2022. [CrossRef]
- Mnih, Volodymyr, Nicolas Heess, and Alex Graves. "Recurrent models of visual attention." Advances in neural information processing systems 27 (2014). [CrossRef]
- Guo, Meng-Hao, et al. "Attention mechanisms in computer vision: A survey." Computational Visual Media 8.3 (2022): 331-368. [CrossRef]
- Wang, Wenhai, et al. "Pyramid vision transformer: A versatile backbone for dense prediction without convolutions." Proceedings of the IEEE/CVF international conference on computer vision. 2021.
- Everingham M, Van Gool L, Williams C K I, et al. The pascal visual object classes (voc) challenge[J]. International journal of computer vision, 2010, 88: 303-338. [CrossRef]
- Dai, Zihang, et al. "Funnel-transformer: Filtering out sequential redundancy for efficient language processing." Advances in neural information processing systems 33 (2020): 4271-4282. [CrossRef]
- Roy, Aurko, et al. "Efficient content-based sparse attention with routing transformers." Transactions of the Association for Computational Linguistics 9 (2021): 53-68. [CrossRef]
- Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2014: 580-587.
- Gao Z, Xie J, Wang Q, et al. Global second-order pooling convolutional networks[C]//Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. 2019: 3024-3033.
- Wang X, Girshick R, Gupta A, et al. Non-local neural networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7794-7803.
- Mnih V, Heess N, Graves A. Recurrent models of visual attention[J]. Advances in neural information processing systems, 2014, 27.
- Jaderberg M, Simonyan K, Zisserman A. Spatial transformer networks[J]. Advances in neural information processing systems, 2015, 28.
- Hearst M A, Dumais S T, Osuna E, et al. Support vector machines[J]. IEEE Intelligent Systems and their applications, 1998, 13(4): 18-28.
- Wu, Bichen, et al. "Visual transformers: Token-based image representation and processing for computer vision." arXiv preprint arXiv:2006.03677 (2020).
- Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 3213-3223.





| Method | Backbone | Schedule | AP | |||||
| Faster R-CNN* | ResNet-50-FPN | ×1 | 36.4 | 58.1 | 39.1 | 21.3 | 40.5 | 44.6 |
| Faster R-CNN* | ResNet-101-FPN | ×1 | 38.6 | 60 | 42.1 | 22.2 | 42.5 | 47.1 |
| Faster R-CNN* | ResNet-101-FPN | ×2 | 39.4 | 61.1 | 43.2 | 22.6 | 42.7 | 50.1 |
| Faster R-CNN* | ResNext-101-32x4d-FPN | ×1 | 40.3 | 62.6 | 43.6 | 24.5 | 42.9 | 49.9 |
| Faster R-CNN* | ResNext-101-64x4d-FPN | ×1 | 41.7 | 64.9 | 44.4 | 24.7 | 45.8 | 51.3 |
| Mask R-CNN* | ResNet-50-FPN | ×1 | 37.1 | 58.9 | 40.3 | 22.3 | 40.5 | 45.5 |
| Mask R-CNN* | ResNet-101-FPN | ×1 | 39.1 | 61.2 | 42.2 | 22.8 | 42.3 | 49.2 |
| Mask R-CNN* | ResNet-101-FPN | ×2 | 40 | 61.8 | 43.7 | 22.7 | 43.4 | 52.1 |
| RetinaNet* | ResNet-50-FPN | ×1 | 35.8 | 55.7 | 38.7 | 19.4 | 39.7 | 44.9 |
| RetinaNet* | MobileNet-v2-FPN | ×1 | 32.9 | 52.1 | 34.9 | 17.9 | 34.8 | 42.6 |
| DETR | ResNet-50 | ×1 | 42 | 62.4 | 44.2 | 20.5 | 45.8 | 61.1 |
| Deformable DETR | ResNet-50 | ×1 | 43.8 | 62.6 | 47.7 | 26.4 | 47.1 | 58 |
| Deformable DETR+IAM | ResNet-50 | ×1 | 44.3 | 62.9 | 48.3 | 26.3 | 47.6 | 60.3 |
| Faster R-CNN*(ours) | ResNet-50-TE-FPN | ×1 | 38.4 | 61 | 41.9 | 23.1 | 41.7 | 47.5 |
| Faster R-CNN(ours) | ResNet-101-TE-FPN | ×1 | 40.2 | 62.6 | 43.6 | 23.5 | 43.5 | 50.9 |
| Faster R-CNN (ours) | ResNet-101-TE-FPN | ×2 | 41.1 | 63.4 | 44.3 | 23.6 | 44.1 | 52.7 |
| Faster R-CNN (ours) | ResNext-101-32x4d-TE-FPN | ×1 | 41.5 | 63.8 | 45.1 | 24.8 | 45.1 | 52.3 |
| Faster R-CNN(ours) | ResNext-101-64x4d-TE-FPN | ×1 | 42.7 | 65.4 | 46 | 25.9 | 45.9 | 53.5 |
| Mask R-CNN(ours) | ResNet-50-TE-FPN | ×1 | 38.9 | 61.1 | 42.4 | 23.2 | 42.2 | 49 |
| Mask R-CNN(ours) | ResNet-101-TE-FPN | ×1 | 40.4 | 63 | 44.2 | 23.7 | 43.3 | 51.4 |
| Mask R-CNN(ours) | ResNet-101-TE-FPN | ×2 | 41.5 | 63.6 | 45.7 | 24.1 | 44.2 | 53.2 |
| RetinaNet(ours) | ResNet-50-TE-FPN | ×1 | 36.9 | 57.9 | 39.6 | 20.8 | 40.1 | 46.4 |
| RetinaNet(ours) | MobileNet-v2-TE-FPN | ×1 | 33.9 | 53.7 | 35.8 | 18.5 | 35.7 | 43.9 |
| TIG-DETR | ResNet-50 | ×1 | 43.1 | 62.1 | 46.2 | 24.7 | 46.8 | 60.5 |
| TIG-DETR | ResNet-50-TE-FPN | ×1 | 44.1 | 62.8 | 48.4 | 25.6 | 47.9 | 62.4 |
| IAM | S-IAM | TE-FPN | AP | |||||
|---|---|---|---|---|---|---|---|---|
| 40.3 | 60.5 | 42.9 | 22.2 | 44.5 | 57.4 | |||
| √ | 43.1 | 62.1 | 46.2 | 24.7 | 46.8 | 60.5 | ||
| √ | 40.7 | 60.6 | 44.1 | 22.1 | 44.4 | 59.1 | ||
| √ | 43.7 | 62.4 | 47.6 | 26.7 | 47.6 | 60.7 | ||
| √ | √ | 44.1 | 62.8 | 48.4 | 26.6 | 47.9 | 62.4 |
| SRS | ETA | LFA | SRS+FWA | AP | |||||
|---|---|---|---|---|---|---|---|---|---|
| 36.2 | 56.1 | 38.6 | 20.0 | 39.6 | 47.5 | ||||
| √ | 36.8 | 59.1 | 39.8 | 20.7 | 40.2 | 48.3 | |||
| √ | 37.0 | 56.7 | 39.9 | 20.8 | 40.3 | 48.1 | |||
| √ | 36.8 | 56.5 | 39.3 | 20.6 | 40.2 | 48.0 | |||
| √ | 37.5 | 57.4 | 40.1 | 21.5 | 40.7 | 49.0 | |||
| √ | √ | 37.5 | 57.5 | 40.2 | 21.4 | 41.0 | 49.4 | ||
| √ | √ | 37.6 | 58.0 | 40.1 | 21.5 | 41.3 | 49.6 | ||
| √ | √ | 37.8 | 57.9 | 40.4 | 21.6 | 41.2 | 49.8 | ||
| √ | √ | √ | 38.2 | 58.8 | 40.9 | 21.9 | 42.3 | 50.4 |
| Method | AP[val] | AP | person | rider | car | truck | bus | train | motorcycle | bicycle | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mask R-CNN [fine-only] | 31.5 | 26.2 | 49.9 | 30.5 | 23.7 | 46.9 | 22.8 | 32.2 | 18.6 | 19.1 | 16.0 |
| Mask R-CNN [COCO] | 36.4 | 32.0 | 58.1 | 34.8 | 27.0 | 49.1 | 30.1 | 40.9 | 30.9 | 24.1 | 18.7 |
| TE-FPN[fine-only] | 34.2 | 29.5 | 54.8 | 34.0 | 27.8 | 52.7 | 25.6 | 35.2 | 23.0 | 21.1 | 19.1 |
| TE-FPN[COCO] | 39.6 | 34.9 | 61.2 | 39.1 | 31.1 | 54.3 | 31.5 | 43.9 | 31.1 | 26.2 | 22.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).