Submitted:
12 May 2025
Posted:
12 May 2025
Read the latest preprint version here
Abstract

Keywords:
1. Introduction
- (1)
- Propose SDRFPT-Net, a novel multispectral object detection architecture that effectively extracts and integrates multimodal features through a dual-stream separated spectral structure;
- (2)
- Design the Spectral Recursive Fusion Module (SRFM), achieving high-efficiency deep feature interaction through a hybrid attention mechanism and recursive progressive fusion strategy;
- (3)
- Develop the Spectral Target Perception Enhancement Module (STPEM), enhancing target feature representation and suppressing background interference;
- (4)
- Experimental validation of SDRFPT-Net's effectiveness on multiple public datasets, achieving state-of-the-art detection performance while maintaining computational efficiency;
2. Materials and Methods
2.1. Multispectral Object Detection
2.2. Feature Fusion Strategies
2.3. YOLO Series in Multispectral Object Detection
3. Methodology
3.1. Spectral Hierarchical Perception Architecture (SHPA)
3.1.1. Dual-stream Separated Spectral Architecture Design
- (1)
- It can design specific extraction strategies for the characteristics of different spectral domains, thereby better adapting to the characteristics of data from each modality;
- (2)
- It preserves the unique information of each spectral domain, avoiding the potential loss of information that might occur when processing in a single network;
- (3)
- It captures the feature distributions of different spectral domains through independent parameters, improving the diversity of feature representations.
3.1.2. Multi-scale Spectral Feature Expansion
3.1.3. Feature Aggregation and Detection
3.2. Spectral Recursive Fusion Module (SRFM)
3.2.1. Hybrid Attention Mechanism
3.2.2. Recursive Progressive Fusion Strategy
- First round of cycling: Initial fusion stage. Mainly captures basic intra-modal and inter-modal relationships, establishing initial feature interaction;
- Second round of cycling: Feature reinforcement stage. Based on the already established initial relationships, further strengthens important feature connections, suppressing noise and irrelevant information;
- Third round of cycling: Feature refinement stage. Performs final optimization and fine-tuning on features, forming high-quality fusion representations.
- Multi-scale feature selection: The fusion strategy is applied separately on three scales—P3/8, P4/16, and P5/32—ensuring thorough fusion of features at all three scales;
- Inter-scale information flow: Information exchange between features of different scales is achieved through FPN and PAN structures;
3.3. Spectral Target Percpetion Enhancement Module (STPEM)
3.3.1. Lightweight Mask Prediction
3.3.2. Similarity Calculation and Adjustment
3.3.3. Feature Enhancement Mechanism
4. Experiments
4.1. Datasets and Evaluation Metrics
4.1.1. Datasets
4.1.2. Metrics
4.2. Experimental Setup
4.3. Comparison with State-of-the-Art Methods
4.3.1. On the FLIR-aligned Dataset
4.3.2. On the LLVIP Dataset
4.4. Ablation Studies
4.4.1. Baseline Model Comparison
4.4.2. Ablation Experiments on Hybrid Attention Mechanism
- Self-attention mechanism (B1): Mainly focuses on target contours and edge information, effectively capturing spatial contextual relationships, with strong response to target boundaries, helping to improve localization accuracy;
- Cross-modal attention mechanism (B2): Presents overall attention to target areas, integrating complementary information from RGB and infrared modalities, but with relatively weak background suppression capability;
- Channel attention mechanism (B3): Demonstrates selective enhancement of specific semantic information, highlighting important feature channels, with strong response to specific parts of targets, improving the discriminability of feature representation.
- Self-attention + Cross-modal attention (B4): The feature map simultaneously possesses excellent boundary localization capability and overall target region representation capability. The heatmap shows precise response to target regions with significant background suppression effect. This combination fully leverages the complementary advantages of self-attention in spatial modeling and cross-modal attention in multi-modal fusion, enabling it to reach 0.424 in mAP50:95, approaching the performance of the full attention mechanism.
- Self-attention + Channel attention (B5): The feature map enhances the representation of specific semantic features while preserving target boundary information. The heatmap shows strong response to key parts of targets, enabling the model to better distinguish different categories of targets, achieving 0.409 in mAP50:95, outperforming any single attention mechanism.
- Cross-modal attention + Channel attention (B6): The feature map enhances specific channel representation based on multi-modal fusion, but lacks the spatial context modeling capability of self-attention. The heatmap shows some response to target regions, but boundaries are not clear enough and background suppression effect is relatively weak, which explains its relatively lower performance.
4.4.3. Ablation Experiments on Spectral Hierarchical Recursive Progressive Fusion Strategy
- Complementarity enhancement: the advanced fusion strategy makes features at different scales complementary, with P3 focusing on details and small targets, P4 processing medium targets, and P5 capturing large-scale structures and semantic information;
- Information flow optimization: features at different scales mutually enhance each other, with semantic information guiding small target detection and detail information precisely locating large target boundaries;
- Noise suppression capability: the advanced fusion strategy demonstrates superior background noise suppression capability at all scales, effectively reducing false detections.
- P3 feature layer (high resolution): As the iteration count increases, the feature map gradually evolves from initial dispersed response (n=1) to more focused target representation (n=2,3), with clearer boundaries and stronger background suppression effect. However, when the iteration count reaches 4 and 5, over-smoothing phenomena begin to appear, with some loss of boundary details.
- P4 feature layer (medium resolution): At n=1, the feature map has basic response to targets but is not focused enough. After 2-3 rounds of iteration, the activation intensity of target areas significantly increases, improving target differentiation. Continuing to increase the iteration count to 4-5 rounds, feature response begins to diffuse, reducing precise localization capability.
- P5 feature layer (low resolution): This layer demonstrates the most obvious evolution trend, gradually developing from blurred response at n=1 to highly structured representation at n=3 that can clearly distinguish main targets. However, obvious signs of overfitting appear at n=4 and n=5, with feature maps becoming overly smoothed and target representation degrading.
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
- Feng, D.; Haase-Schutz, C.; Rosenbaum, L.; Hertlein, H.; Glaser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep Multi-Modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges. IEEE Trans. Intell. Transp. Syst. 2021, 22, 1341–1360. [Google Scholar] [CrossRef]
- Zhang, C.; Chen, B.Y.; Lam, W.H.K.; Ho, H.W.; Shi, X.; Yang, X.; Ma, W.; Wong, S.C.; Chow, A.H.F. Vehicle Re-Identification for Lane-Level Travel Time Estimations on Congested Urban Road Networks Using Video Images. IEEE Trans. Intell. Transp. Syst. 2022, 23, 12877–12893. [Google Scholar] [CrossRef]
- Zhang, T.; Wu, H.; Liu, Y.; Peng, L.; Yang, C.; Peng, Z. Infrared Small Target Detection Based on Non-Convex Optimization with Lp-Norm Constraint. Remote Sens. 2019, 11, 559. [Google Scholar] [CrossRef]
- Pang, S.; Ge, J.; Hu, L.; Guo, K.; Zheng, Y.; Zheng, C.; Zhang, W.; Liang, J. RTV-SIFT: Harnessing Structure Information for Robust Optical and SAR Image Registration. Remote Sensing 2023, 15, 4476. [Google Scholar] [CrossRef]
- Xu, Q.; Mei, Y.; Liu, J.; Li, C. Multimodal cross-layer bilinear pooling for RGBT tracking. IEEE Trans. Multimedia 2022, 24, 567–580. [Google Scholar] [CrossRef]
- Zhou, W.; Zhu, Y.; Lei, J.; Wan, J.; Yu, L. CCAFNet: Crossflow and Cross-Scale Adaptive Fusion Network for Detecting Salient Objects in RGB-D Images. IEEE Trans. Multimedia 2022, 24, 2192–2204. [Google Scholar] [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Las Vegas, NV, USA, June, 2016; pp. 779–788. [Google Scholar]
- Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. Arxiv Prepr. Arxiv:2004,10934.
- Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 Ieee/cvf Conference on Computer Vision and Pattern Recognition (cvpr); IEEE: Vancouver, BC, Canada, June, 2023; pp. 7464–7475. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
- Zhang, Y.; Yu, H.; He, Y.; Wang, X.; Yang, W. Illumination-Guided RGBT Object Detection with Inter- and Intra-Modality Fusion. IEEE Trans. Instrum. Meas. 2023, Vol.72, 1–13. [Google Scholar] [CrossRef]
- Zhang, L.; Liu, Z.; Zhang, S.; Yang, X.; Qiao, H.; Huang, K.; Hussain, A. Cross-Modality Interactive Attention Network for Multispectral Pedestrian Detection. Information Fusion 2019, 50, 20–29. [Google Scholar] [CrossRef]
- Zhang, H.; Fromont, E.; Lefevre, S.; Avignon, B. Multispectral Fusion for Object Detection with Cyclic Fuse-and-Refine Blocks 2020.
- Jia, X.; Zhu, C.; Li, M.; Tang, W.; Liu, S.; Zhou, W. LLVIP: A Visible-Infrared Paired Dataset for Low-Light Vision 2023.
- Zhi-she, W.; Feng-bao, Y.; Zhi-hao, P.; Lei, C.; Li-e, J. Multi-sensor image enhanced fusion algorithm based on NSST and top-hat transformation. Optik 2015, 126, 4184–4190. [Google Scholar] [CrossRef]
- Zhang, Q.; Liu, Y.; Blum, R.S.; Han, J.; Tao, D. Sparse representation based multi-sensor image fusion for multi-focus and multi-modality images: a review. Inf. Fusion 2018, 40, 57–75. [Google Scholar] [CrossRef]
- Ma, J.; Zhou, Z.; Wang, B.; Zong, H. Infrared and Visible Image Fusion Based on Visual Saliency Map and Weighted Least Square Optimization. Infrared Phys. Technol. 2017, 82, 8–17. [Google Scholar] [CrossRef]
- Li, H.; Wu, X.-J.; Kittler, J. MDLatLRR: A Novel Decomposition Method for Infrared and Visible Image Fusion. IEEE Trans. Image Process. 2020, 29, 4733–4746. [Google Scholar] [CrossRef]
- Liu, J.; Zhang, S.; Wang, S.; Metaxas, D.N. Multispectral Deep Neural Networks for Pedestrian Detection 2016.
- Wagner, J.; Fischer, V.; Herman, M.; Behnke, S. Multispectral Pedestrian Detection Using Deep Fusion Convolutional Neural Networks. Comput. Intell. 2016. [Google Scholar]
- Konig, D.; Adam, M.; Jarvers, C.; Layher, G.; Neumann, H.; Teutsch, M. Fully Convolutional Region Proposal Networks for Multispectral Person Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: Honolulu, HI, USA, July, 2017; pp. 243–250. [Google Scholar]
- Song, K.; Bao, Y.; Wang, H.; Huang, L.; Yan, Y. A Potential Vision-Based Measurements Technology: Information Flow Fusion Detection Method Using RGB-Thermal Infrared Images. IEEE Trans. Instrum. Meas. 2023, 72, 1–13. [Google Scholar] [CrossRef]
- Feng, Z.; Lai, J.; Xie, X. Learning Modality-Specific Representations for Visible-Infrared Person Re-Identification. IEEE Trans. Image Process. 2020, 29, 579–590. [Google Scholar] [CrossRef]
- Qingyun, F.; Dapeng, H.; Zhaokui, W. Cross-modality fusion transformer for multispectral object detection 2022.
- Chen, Y.-T.; Shi, J.; Ye, Z.; Mertz, C.; Ramanan, D.; Kong, S. Multimodal object detection via probabilistic ensembling 2022.
- Zhang, H.; Xu, H.; Tian, X.; Jiang, J.; Ma, J. Image Fusion Meets Deep Learning: A Survey and Perspective. Inf. Fusion 2021, 76, 323–336. [Google Scholar] [CrossRef]
- Fu, Y.; Wu, X.-J.; Durrani, T. Image Fusion Based on Generative Adversarial Network Consistent with Perception. Inf. Fusion 2021, 72, 110–125. [Google Scholar] [CrossRef]
- Li, J.; Huo, H.; Li, C.; Wang, R.; Sui, C.; Liu, Z. Multigrained Attention Network for Infrared and Visible Image Fusion. IEEE Trans. Instrum. Meas. 2021, 70, 1–12. [Google Scholar] [CrossRef]
- Wang, Z.; Wu, Y.; Wang, J.; Xu, J.; Shao, W. Res2Fusion: Infrared and Visible Image Fusion Based on Dense Res2net and Double Nonlocal Attention Models. IEEE Trans. Instrum. Meas. 2022, 71, 1–12. [Google Scholar] [CrossRef]
- Zhang, X.; Wang, J.; Wang, T.; Jiang, R. Hierarchical Feature Fusion with Mixed Convolution Attention for Single Image Dehazing. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 510–522. [Google Scholar] [CrossRef]
- Li, J.; Huo, H.; Li, C.; Wang, R.; Feng, Q. AttentionFGAN: Infrared and Visible Image Fusion Using Attention-Based Generative Adversarial Networks. IEEE Trans. Multimedia 2021, 23, 1383–1396. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLO9000: better, faster, stronger 2016.
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: real-time end-to-end object detection 2024.
- Redmon, J.; Farhadi, A. YOLOv3: an incremental improvement 2018.
- Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection 2020.
- Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications 2022.
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021 2021.
- Zheng, Y.; Izzat, I.H.; Ziaee, S. GFD-SSD: gated fusion double SSD for multispectral pedestrian detection.
- Wang, C.-Y.; Mark Liao, H.-Y.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: Seattle, WA, USA, June, 2020; pp. 1571–1580. [Google Scholar]
- Cao, Z.; Yang, H.; Zhao, J.; Guo, S.; Li, L. Attention fusion for one-stage multispectral pedestrian detection. Sens. 2021, 21, 4184. [Google Scholar] [CrossRef] [PubMed]
- Wang, Z.; Chen, Y.; Shao, W.; Li, H.; Zhang, L. SwinFuse: A Residual Swin Transformer Fusion Network for Infrared and Visible Images. IEEE Trans. Instrum. Meas. 2022, 71, 1–12. [Google Scholar] [CrossRef]
- Flir, T. Free FLIR Thermal Dataset for Algorithm Training 2018.















| Methods | Modality | P | R | mAP50 | mAP50:95 |
|---|---|---|---|---|---|
| YOLOv5 | Visible | 0.531 | 0.395 | 0.441 | 0.202 |
| Infrared | 0.625 | 0.468 | 0.539 | 0.272 | |
| YOLOv8 | Visible | 0.532 | 0.396 | 0.448 | 0.218 |
| Infrared | 0.559 | 0.514 | 0.549 | 0.288 | |
| YOLOv10 | Visible | 0.727 | 0.538 | 0.620 | 0.305 |
| Infrared | 0.773 | 0.618 | 0.727 | 0.424 | |
| YOLOv10-add | V-I | 0.748 | 0.623 | 0.701 | 0.354 |
| CMA-Det | V-I | 0.812 | 0.468 | 0.518 | 0.237 |
| TFDet | V-I | 0.827 | 0.606 | 0.653 | 0.346 |
| CMAFF | V-I | 0.792 | 0.550 | 0.558 | 0.302 |
| BA-CAMF Net | V-I | 0.798 | 0.632 | 0.704 | 0.351 |
| SDRFPT-Net (ours) | V-I | 0.854 | 0.700 | 0.785 | 0.426 |
| Methods | Modality | P | R | mAP50 | mAP50:95 |
|---|---|---|---|---|---|
| YOLOv5 | Visible | 0.906 | 0.820 | 0.895 | 0.504 |
| Infrared | 0.962 | 0.898 | 0.960 | 0.631 | |
| YOLOv8 | Visible | 0.933 | 0.829 | 0.896 | 0.513 |
| Infrared | 0.956 | 0.901 | 0.961 | 0.645 | |
| YOLOv10 | Visible | 0.914 | 0.833 | 0.892 | 0.512 |
| Infrared | 0.962 | 0.909 | 0.961 | 0.637 | |
| YOLOv10-add | V-I | 0.961 | 0.893 | 0.957 | 0.628 |
| TFDet | V-I | 0.960 | 0.896 | 0.960 | 0.594 |
| CMAFF | V-I | 0.958 | 0.899 | 0.915 | 0.574 |
| BA-CAMF Net | V-I | 0.866 | 0.828 | 0.887 | 0.511 |
| SDRFPT-Net (ours) | V-I | 0.963 | 0.911 | 0.963 | 0.706 |
| ID | SHPA | SRFM | STPEM | mAP50 | mAP50:95 |
| A1 | ✔ | 0.701 | 0.354 | ||
| A2 | ✔ | ✔ | 0.775 | 0.373 | |
| A3 | ✔ | ✔ | ✔ | 0.785 | 0.426 |
| ID | Self-attention | Cross-attention | Channel-attention | mAP50 | mAP50:95 |
| B1 | ✔ | 0.776 | 0.408 | ||
| B2 | ✔ | 0.749 | 0.372 | ||
| B3 | ✔ | 0.730 | 0.384 | ||
| B4 | ✔ | ✔ | 0.774 | 0.424 | |
| B5 | ✔ | ✔ | 0.763 | 0.409 | |
| B6 | ✔ | ✔ | 0.729 | 0.362 | |
| B7 | ✔ | ✔ | ✔ | 0.785 | 0.426 |
| ID | P3/8 | P4/16 | P5/32 | mAP50 | mAP50:95 |
| C1 | Add | Add | Add | 0.701 | 0.354 |
| C2 | SRFM+STPEM | Add | Add | 0.769 | 0.404 |
| C3 | SRFM+STPEM | SRFM+STPEM | Add | 0.776 | 0.410 |
| C4 | SRFM+STPEM | SRFM+STPEM | SRFM+STPEM | 0.785 | 0.426 |
| ID | times | mAP50 | mAP50:95 |
| D1 | 1 | 0.769 | 0.395 |
| D2 | 2 | 0.783 | 0.418 |
| D3 | 3 | 0.785 | 0.426 |
| D4 | 4 | 0.783 | 0.400 |
| D5 | 5 | 0.761 | 0.417 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).