Submitted:
30 March 2025
Posted:
31 March 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Remote Sensing Object Detection
2.2. Vision Mamba
3. Method
3.1. Vision Mamba
3.2. Framework of MambaRetinanet Network
3.3. Synergistic Perception Module (SPM)
3.4. MambaFPN
4. Results and Analysis
4.1. Dataset
4.2. Implementation Details
4.3. Results
4.4. Ablation Experiments
5. Discussion
6. Conclusion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
- Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3974–3983.
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser. ; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30. [Google Scholar]
- Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 16794–16805.
- Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the Proceedings of the AAAI conference on artificial intelligence, 2021, Vol. 35, pp. 3163–3171.
- Tian, Y.; Zhang, M.; Li, J.; Li, Y.; Yang, H.; Li, W. FPNFormer: Rethink the method of processing the rotation-invariance and rotation-equivariance on arbitrary-oriented object detection. IEEE Transactions on Geoscience and Remote Sensing 2024, 62, 1–10. [Google Scholar] [CrossRef]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752 2023.
- Gu, A.; Goel, K.; Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396 2021.
- Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems 2021, 34, 572–585. [Google Scholar]
- Gupta, A.; Gu, A.; Berant, J. Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems 2022, 35, 22982–22994. [Google Scholar]
- Pei, X.; Huang, T.; Xu, C. Efficientvmamba: Atrous selective scan for light weight visual mamba. arXiv preprint arXiv:2403.09977 2024.
- Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. Localmamba: Visual state space model with windowed selective scan. arXiv preprint arXiv:2403.09338 2024.
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
- Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE transactions on geoscience and remote sensing 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
- Han, J.; Ding, J.; Xue, N.; Xia, G.S. Redet: A rotation-equivariant detector for aerial object detection. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2786–2795.
- Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive rotated convolution for rotated object detection. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6589–6600.
- Han, J.; Ding, J.; Li, J.; Xia, G.S. Align Deep Features for Oriented Object Detection. IEEE Transactions on Geoscience and Remote Sensing 2022, 60, 1–11. [Google Scholar] [CrossRef]
- Hou, L.; Lu, K.; Xue, J. Refined one-stage oriented object detection method for remote sensing images. IEEE Transactions on Image Processing 2022, 31, 1545–1558. [Google Scholar] [CrossRef] [PubMed]
- Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1829–1838.
- Cai, X.; Lai, Q.; Wang, Y.; Wang, W.; Sun, Z.; Yao, Y. Poly kernel inception network for remote sensing detection. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 27706–27716.
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
- Guo, H.; Yang, X.; Wang, N.; Song, B.; Gao, X. A rotational libra R-CNN method for ship detection. IEEE Transactions on Geoscience and Remote Sensing 2020, 58, 5772–5781. [Google Scholar] [CrossRef]
- Zhou, Z.; Zhu, Y. RaFPN: Relation-aware Feature Pyramid Network for Dense Image Prediction. IEEE Transactions on Multimedia 2024. [Google Scholar] [CrossRef]
- Zhang, W.; Jiao, L.; Li, Y.; Huang, Z.; Wang, H. Laplacian feature pyramid network for object detection in VHR optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 2021, 60, 1–14. [Google Scholar] [CrossRef]
- Qian, W.; Yang, X.; Peng, S.; Yan, J.; Guo, Y. Learning modulated loss for rotated object detection. In Proceedings of the Proceedings of the AAAI conference on artificial intelligence, 2021, Vol. 35, pp. 2458–2466.
- Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. In Proceedings of the International conference on machine learning. PMLR, 2021, pp. 11830–11841.
- Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning high-precision bounding box for rotated object detection via kullback-leibler divergence. Advances in Neural Information Processing Systems 2021, 34, 18381–18394. [Google Scholar]
- Huang, Z.; Li, W.; Xia, X.G.; Tao, R. A general Gaussian heatmap label assignment for arbitrary-oriented object detection. IEEE Transactions on Image Processing 2022, 31, 1895–1910. [Google Scholar] [CrossRef] [PubMed]
- Xu, C.; Ding, J.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Dynamic coarse-to-fine learning for oriented tiny object detection. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7318–7328.
- Hou, L.; Lu, K.; Xue, J.; Li, Y. Shape-adaptive selection and measurement for oriented object detection. In Proceedings of the Proceedings of the AAAI conference on artificial intelligence, 2022, Vol. 36, pp. 923–932.
- Hou, L.; Lu, K.; Yang, X.; Li, Y.; Xue, J. G-rep: Gaussian representation for arbitrary-oriented object detection. Remote Sensing 2023, 15, 757. [Google Scholar] [CrossRef]
- Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model, 2024, [arXiv:cs.CV/2401.09417].
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 2020.
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Advances in neural information processing systems 2024, 37, 103031–103063. [Google Scholar]
- Yu, W.; Wang, X. Mambaout: Do we really need mamba for vision? arXiv preprint arXiv:2405.07992 2024.
- Han, D.; Wang, Z.; Xia, Z.; Han, Y.; Pu, Y.; Ge, C.; Song, J.; Song, S.; Zheng, B.; Huang, G. Demystify mamba in vision: A linear attention perspective. arXiv preprint arXiv:2405.16605 2024.
- Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722 2024.
- Ruan, J.; Li, J.; Xiang, S. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491 2024.
- Wang, Z.; Zheng, J.Q.; Zhang, Y.; Cui, G.; Li, L. Mamba-unet: Unet-like pure visual mamba for medical image segmentation. arXiv preprint arXiv:2402.05079 2024.
- Xing, Z.; Ye, T.; Yang, Y.; Liu, G.; Zhu, L. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2024, pp. 578–588.
- Wang, Z.; Li, C.; Xu, H.; Zhu, X. Mamba YOLO: SSMs-based YOLO for object detection. arXiv preprint arXiv:2406.05835 2024.
- Chen, T.; Ye, Z.; Tan, Z.; Gong, T.; Wu, Y.; Chu, Q.; Liu, B.; Yu, N.; Ye, J. Mim-istd: Mamba-in-mamba for efficient infrared small target detection. IEEE Transactions on Geoscience and Remote Sensing 2024.
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
- Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object detection in aerial images: A large-scale benchmark and challenges. IEEE transactions on pattern analysis and machine intelligence 2021, 44, 7778–7796. [Google Scholar] [CrossRef] [PubMed]
- Cheng, G.; Wang, J.; Li, K.; Xie, X.; Lang, C.; Yao, Y.; Han, J. Anchor-free oriented proposal generator for object detection. IEEE Transactions on Geoscience and Remote Sensing 2022, 60, 1–11. [Google Scholar] [CrossRef]
- Zhou, Y.; Yang, X.; Zhang, G.; Wang, J.; Liu, Y.; Hou, L.; Jiang, X.; Liu, X.; Yan, J.; Lyu, C.; et al. Mmrotate: A rotated object detection benchmark using pytorch. In Proceedings of the Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 7331–7334.
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 2019, 32. [Google Scholar]
- Zhou, D.; Fang, J.; Song, X.; Guan, C.; Yin, J.; Dai, Y.; Yang, R. Iou loss for 2d/3d object detection. In Proceedings of the 2019 international conference on 3D vision (3DV). IEEE, 2019, pp. 85–94.
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
- Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for detecting oriented objects in aerial images. arXiv preprint arXiv:1812.00155 2018.
- Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 3520–3529.
- Ming, Q.; Zhou, Z.; Miao, L.; Zhang, H.; Li, L. Dynamic anchor learning for arbitrary-oriented object detection. In Proceedings of the Proceedings of the AAAI conference on artificial intelligence, 2021, Vol. 35, pp. 2355–2363.
- Courtrai, L.; Pham, M.T.; Lefèvre, S. Small object detection in remote sensing images based on super-resolution with auxiliary generative adversarial networks. Remote Sensing 2020, 12, 3152. [Google Scholar] [CrossRef]




| Method | Backbone | Plane | BD | Bridge | GTF | SV | LV | Ship | TC | BC | ST |
|---|---|---|---|---|---|---|---|---|---|---|---|
| muti-stage: | |||||||||||
| FR OBB[52] | R50 | 71.61 | 47.20 | 39.28 | 58.70 | 35.55 | 48.88 | 51.51 | 78.97 | 58.36 | 58.55 |
| MR[53] | R50 | 76.20 | 49.91 | 41.61 | 60.00 | 41.08 | 50.77 | 56.24 | 78.01 | 55.85 | 57.48 |
| RT[54] | R50 | 71.81 | 48.39 | 45.88 | 64.02 | 42.09 | 54.39 | 59.92 | 82.70 | 63.29 | 58.71 |
| Oriented R-CNN[55] | R50 | 77.95 | 50.29 | 46.73 | 65.24 | 42.61 | 54.56 | 60.02 | 79.08 | 61.69 | 59.42 |
| one-stage: | |||||||||||
| DAL[56] | R50 | 71.23 | 38.36 | 38.60 | 45.24 | 35.42 | 43.75 | 56.04 | 70.84 | 50.87 | 56.63 |
| SASM[57] | R50 | 70.30 | 40.62 | 37.01 | 59.03 | 40.21 | 45.46 | 44.60 | 78.58 | 49.34 | 60.73 |
| RetinaNet-O[3] | R50 | 70.63 | 47.26 | 39.12 | 55.02 | 38.10 | 40.52 | 47.16 | 77.74 | 56.86 | 52.12 |
| R3Det w/KLD[29] | R50 | 75.44 | 50.95 | 41.16 | 61.61 | 41.11 | 45.76 | 49.65 | 78.52 | 54.97 | 60.79 |
| OrientedReP[21] | R50 | 73.02 | 46.68 | 42.37 | 63.05 | 47.06 | 50.28 | 58.64 | 78.84 | 57.12 | 66.77 |
| S2A-Net[19] | R50 | 77.84 | 51.31 | 43.72 | 62.59 | 47.51 | 50.58 | 57.86 | 80.73 | 59.11 | 65.32 |
| DCFL[31] | R50 | 78.30 | 53.03 | 44.24 | 60.17 | 48.56 | 55.42 | 58.66 | 78.29 | 60.89 | 65.93 |
| Ours: | |||||||||||
| MambaRetinanet | VMamba-T | 79.58 | 56.43 | 46.33 | 63.38 | 45.54 | 52.88 | 56.55 | 83.78 | 66.59 | 63.37 |
| Method | Backbone | SBF | RA | Harbor | SP | HC | CC | Air | Heli | mAP | |
| muti-stage: | |||||||||||
| FR OBB[52] | R50 | 36.11 | 51.73 | 43.57 | 55.33 | 57.07 | 3.51 | 52.94 | 2.79 | 47.31 | |
| MR[53] | R50 | 36.62 | 51.67 | 47.39 | 55.79 | 59.06 | 3.64 | 60.26 | 8.95 | 49.47 | |
| RT[54] | R50 | 41.04 | 52.82 | 53.32 | 56.18 | 57.94 | 25.71 | 63.72 | 8.70 | 52.81 | |
| Oriented R-CNN[55] | R50 | 42.26 | 56.89 | 51.11 | 56.16 | 59.33 | 25.81 | 60.67 | 9.17 | 53.28 | |
| one-stage: | |||||||||||
| DAL[56] | R50 | 20.28 | 46.53 | 33.49 | 47.29 | 12.15 | 0.81 | 25.77 | 0.00 | 38.52 | |
| SASM[57] | R50 | 29.89 | 46.57 | 42.95 | 48.31 | 28.13 | 1.82 | 76.37 | 0.74 | 44.53 | |
| RetinaNet-O[3] | R50 | 37.22 | 51.75 | 44.15 | 53.19 | 51.06 | 6.58 | 64.28 | 7.45 | 46.68 | |
| R3Det w/KLD[29] | R50 | 42.07 | 53.20 | 43.08 | 49.55 | 34.09 | 36.26 | 68.65 | 0.06 | 47.26 | |
| OrientedReP[21] | R50 | 35.21 | 50.76 | 48.77 | 51.62 | 34.23 | 6.17 | 64.66 | 5.87 | 48.95 | |
| S2A-Net[19] | R50 | 36.43 | 52.60 | 45.36 | 52.46 | 40.12 | 0.00 | 62.81 | 11.11 | 49.86 | |
| DCFL[31] | R50 | 43.54 | 55.82 | 53.33 | 60.00 | 54.76 | 30.90 | 74.01 | 15.60 | 55.08 | |
| Ours: | |||||||||||
| MambaRetinanet | VMamba-T | 49.48 | 58.44 | 54.91 | 60.03 | 63.33 | 27.81 | 80.39 | 20.17 | 57.17 | |
| Method | Backbone | mAP |
|---|---|---|
| RetinaNet-O[3] | R50 | 59.16 |
| FR OBB[52] | R50 | 62.00 |
| MR[53] | R50 | 63.41 |
| RT[54] | R50 | 65.03 |
| ReDet[17] | ReR50 | 66.86 |
| DCFL[31] | R50 | 67.37 |
| MambaRetinanet | VMamba-T | 70.21 |
| Method | Backbone | mAP |
|---|---|---|
| RetinaNet-O[3] | R50 | 66.79 |
| KLD [29] | R50 | 72.76 |
| OrientedReP[21] | R50 | 71.94 |
| S2A-Net[19] | R50 | 73.91 |
| DCFL[31] | R50 | 75.35 |
| Oriented R-CNN[55] | R50 | 75.87 |
| ReDet[17] | ReR50 | 76.25 |
| MambaRetinanet | VMamba-T | 77.50 |
| Method | Backbone | mAP |
|---|---|---|
| RetinaNet-O[3] | R50 | 57.55 |
| Oriented R-CNN[55] | R50 | 59.54 |
| RT[54] | R50 | 63.87 |
| DCFL[31] | R50 | 66.80 |
| Oriented R-CNN w/[22] | PKINet-S | 67.03 |
| MambaRetinanet | VMamba-T | 71.50 |
| Method | P1 | P2 | P3 | P4 | P5 | mAP |
|---|---|---|---|---|---|---|
| FPN | Conv | Conv | Conv | Conv | Conv | 51.70 |
| MambaFPN | SPM | SPM | Conv | Conv | Conv | 52.66 |
| SPM | SPM | SPM | Conv | Conv | 54.47 | |
| SPM | SPM | SPM | SPM | Conv | 53.60 | |
| SPM | SPM | SPM | SPM | SPM | 53.61 |
| Method | KernelDesign | mAP |
|---|---|---|
| VSSBlock[36] | 53.03 | |
| EVSSBlock[13] | (3) | 53.00 |
| SPM | (3,3,5,7) | 54.47 |
| (3,5,7,9) | 53.90 | |
| (3,3,5,7,9) | 54.26 | |
| (3,5,7,9,11) | 53.56 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).