Submitted:
28 June 2025
Posted:
30 June 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Proposed Method
3.1. P2: Small Object Detection Head
3.2. DCS-Net Architecture
3.2.1. DCAM Module
3.2.2. C2f-DCAM
3.2.3. SCDown Module
- Pointwise Convolution(1x1): Compresses channel dimensions from to , reducing redundancy and emphasizing salient features.
- Deptwise Convolution (k×k, stride s): Performs channel-wise convolution for spatial downsampling, enabling the network to capture scale-specific information with reduced parameter count.
3.2.4. Overall Architecture
- Stage 1 The input image undergoes convolution and downsampling, reducing the feature map size by 1/4 while increasing the number of channels to 64-128. This process facilitates initial low-level feature extraction, focusing on aspects like edges and textures. The resulting thermal map primarily highlights the edge contours in the image, exhibiting heightened sensitivity to low-level features such as textures and boundaries. This stage captures intricate spatial details, establishing the groundwork for further feature extraction.
- Stage 2 The feature map is further reduced to 1/8 of its original size to emphasize extracting intricate local structural details. This process aids the model in discerning boundary and shape characteristics of small targets. The thermal map progressively narrows down to the target region, displaying pronounced highlights around small targets. This phenomenon suggests that the network is differentiating foreground from background areas and developing semantic recognition capabilities.
- Stage 3 The feature map size is reduced through the integration of the C2f-DCAM module. This module, known as the Dynamic Convolutional Attention Blending Module (DCAM), enhances the contextual semantic representation of small targets by employing parallel local enhancement (Lepe branching) and global dependency modeling (Attention branching). These mechanisms notably enhance target detection in scenarios with occluded, dense, or complex backgrounds. Notably, the thermal map excels in its capacity to concentrate on specific areas: it significantly amplifies responses in regions containing small targets while preserving background structural information. This observation suggests that the DCAM module steers the network towards establishing prolonged dependencies on critical regions via a global attention mechanism, thereby enhancing the discernment of small targets within intricate backgrounds.
- Stage 4 The feature map undergoes additional compression for downsampling efficiency, departing from conventional large-step convolution methods. The SCDown module operates through channel-wise spatial compression and separate dot-convolution channel compression, diminishing parameter volume while preserving essential spatial structures. This approach effectively addresses information loss concerns. Despite further reduction in thermal map spatial resolution, high responsiveness to small target areas is preserved. This outcome is credited to the SCDown module’s computational compression, which safeguards crucial spatial layout features and prevents excessive information loss.Finally, SPPF module (Fast Spatial Pyramid Pooling) fuses feature maps from different scales to enhance the adaptability to multi-scale objects, especially for detecting large and small objects simultaneously.
3.3. SDBIoU Loss Function
4. Results
4.1. Dataset
4.2. Experimental Environment and Training Strategy
4.3. Evaluation Metrics
- is calculated at an IoU threshold of 0.5.
- is computed by averaging AP values across IoU thresholds ranging from 0.5 to 0.95 in increments of 0.05.
4.4. Experiment Results
4.4.1. Comparison of Loss Functions
4.4.2. Comparison with Different Mainstream Models
4.5. Ablation Experiments
4.6. Visual Assessment
4.7. Real - Time Object Detectio
4.8. Discussion
- Baseline comparison: As shown in Table 5, compared to YOLOv8s, DCS-YOLOv8 achieves a 3.9% increase in (from 40.6% to 44.5%) and a 2.6% increase in (from 24.3% to 26.9%), while simultaneously reducing parameter count from 11.1M to 9.9M, confirming both performance and efficiency gains.
- YOLOv8 series comparison: Table 2 demonstrates that DCS-YOLOv8 outperforms YOLOv8n, YOLOv8s, and YOLOv8m in precision, recall, and mAP metrics, and achieves comparable performance to the heavier YOLOv8l with fewer parameters. Notably, it surpasses YOLOv8s by 2.4% in and 2.6% in , validating the scalability and effectiveness of our enhancements.
- Mainstream YOLO models comparison: Compared to YOLOv3, YOLOv5s, and YOLOv7, DCS-YOLOv8 delivers a better balance between inference speed and detection accuracy. Although YOLOv7 has fewer parameters, its detection performance lags behind, particularly in complex UAV imagery.
- State-of-the-art detectors comparison: Table 3 illustrates that DCS-YOLOv8 outperforms advanced detectors such as Faster R-CNN, Swin Transformer, Cascade R-CNN, and CenterNet, with a increase of 4.8% and increase of 2.7% over the best-performing baseline, confirming its robustness and adaptability in real-world scenarios.
- Ablation studies: The incremental addition of each module was evaluated. As shown in Table 4 and Table 5, the SDBIoU loss improves by 0.2%, P2 layer by 2.5%, and DCAM by 0.6%, while integrating SCDown leads to a further 0.6% improvement in . These results demonstrate that each component meaningfully contributes to the final model performance.
5. Conclusion
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Adil, M.; Song, H.; Jan, M.A.; Khan, M.K.; He, X.; Farouk, A.; Jin, Z. UAV-Assisted IoT Applications, QoS Requirements and Challenges with Future Research Directions. ACM Computing Surveys 2024, 56, 35. [CrossRef]
- Cai, W.; Wei, Z. Remote Sensing Image Classification Based on a Cross-Attention Mechanism and Graph Convolution. IEEE Geoscience and Remote Sensing Letters 2020. [CrossRef]
- Peng, C.; Zhu, M.; Ren, H.; Emam, M. Small Object Detection Method Based on Weighted Feature Fusion and CSMA Attention Module. Electronics 2022. [CrossRef]
- Feng, F.; Hu, Y.; Li, W.; Yang, F. Improved YOLOv8 algorithms for small object detection in aerial imagery. Journal of King Saud University - Computer and Information Sciences 2024, 36. [CrossRef]
- Zhang, X.; Zhang, T.; Jiao, J.L. Remote Sensing Object Detection Meets Deep Learning: A metareview of challenges and advances. Geoscience and remote sensing 2023, 11, 8–44. [CrossRef]
- Jiang, Y.; Xi, Y.; Zhang, L.; Wu, Y.; Tan, F.; Hou, Q. Infrared Small Target Detection Based on Local Contrast Measure With a Flexible Window. IEEE Geoscience and Remote Sensing Letters, 21. [CrossRef]
- Li, Z.; Dong, Y.; Shen, L.; Liu, Y.; Pei, Y.; Yang, H.; Zheng, L.; Ma, J. Development and challenges of object detection: A survey. Neurocomputing 2024, 598, 23. [CrossRef]
- Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A Survey of Object Detection for UAVs Based on Deep Learning. Remote Sensing 2024, 16, 29. [CrossRef]
- Girshick, R. Fast R-CNN. In Proceedings of the International Conference on Computer Vision, 2015.
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587. [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis & Machine Intelligence 2017, 39, 1137–1149. [CrossRef]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the Computer Vision & Pattern Recognition, 2016.
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. IEEE 2017, pp. 6517–6525.
- Terven, J.; Cordova-Esparza, D.M.; Romero-Gonzalez, J.A. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Machine Learning and Knowledge Extraction 2023, 5, 1680–1716. [CrossRef]
- Bi, J.; Zhu, Z.; Meng, Q. Transformer in Computer Vision. In Proceedings of the 2021 IEEE International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), 2021, pp. 178–188. [CrossRef]
- Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision Transformer with Deformable Attention 2022.
- Shah, S.; Tembhurne, J. Object detection using convolutional neural networks and transformer-based models: a review. Journal of Electrical Systems and Information Technology 2023, 10, 1–35. [CrossRef]
- Islam, S.; Elmekki, H.; Pedrycz, R.W. A comprehensive survey on applications of transformers for deep learning tasks. Expert Systems with Application 2024, 241, 122666.1–122666.48. [CrossRef]
- Chen, D.; Zhang, L. SL-YOLO: A Stronger and Lighter Drone Target Detection Model 2024.
- Khalili, B.; Smyth, A.W. SOD-YOLOv8 – Enhancing YOLOv8 for Small Object Detection in Traffic Scenes 2024.
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv e-prints 2018.
- Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. IEEE Computer Society 2017. [CrossRef]
- Liu, K.; Tang, H.; He, S.; Yu, Q.; Xiong, Y.; Wang, N. Performance Validation of Yolo Variants for Object Detection. In Proceedings of the BIC 2021: 2021 International Conference on Bioinformatics and Intelligent Computing, 2021.
- Wei, L.; Tong, Y. Enhanced-YOLOv8: A new small target detection model. Digital Signal Processing 2024, 153, 104611. [CrossRef]
- Xu, W.; Cui, C.; Ji, Y.; Li, X.; Li, S. YOLOv8-MPEB small target detection algorithm based on UAV images. Heliyon 2024, 10, 18. [CrossRef]
- Ding, X.; Zhang, Y.; Ge, Y.; Zhao, S.; Song, L.; Yue, X.; Shan, Y. UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition. IEEE 2023.
- Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. arXiv 2019.
- Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation 2020. [CrossRef]
- Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and Efficient IOU Loss for Accurate Bounding Box Regression 2021. [CrossRef]
- Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped Convolution and Scale-based Dynamic Loss for Infrared Small Target Detection 2024. [CrossRef]
- Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 2022, 44, 7380–7399. [CrossRef]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE 2021. [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows 2021.
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6568–6577. [CrossRef]
- Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression 2023.
- Tahir, N.U.A.; Long, Z.; Zhang, Z.; Asim, M.; Elaffendi, M. PVswin-YOLOv8s: UAV-Based Pedestrian and Vehicle Detection for Traffic Management in Smart Cities Using Improved YOLOv8. Drones (2504-446X) 2024, 8. [CrossRef]
- Wang, Y.; Pan, F.; Li, Z.; Xin, X.; Li, W. CoT-YOLOv8: Improved YOLOv8 for Aerial images Small Target Detection. In Proceedings of the 2023 China Automation Congress (CAC), 2023, pp. 4943–4948. [CrossRef]
- Zhang, H.; Li, G.; Wan, D.; Wang, Z.; Dong, J.; Lin, S.; Deng, L.; Liu, H. DS-YOLO: A dense small object detection algorithm based on inverted bottleneck and multi-scale fusion network. Microelectronics Journal 2024, 4.











| Metrics | Precision | Recall | ||
|---|---|---|---|---|
| CIoU | 51.8 | 39.4 | 40.6 | 24.3 |
| DIoU | 52 | 38.9 | 40.6 | 24.5 |
| EIoU | 49.8 | 39.4 | 40 | 24.3 |
| MPIoU[35] | 52.1 | 39 | 40.7 | 23.9 |
| SDBIoU(d=0.5) | 51.1 | 39.5 | 40.4 | 24.3 |
| SDBIoU(d=0.7) | 51.2 | 39.3 | 40.2 | 24.1 |
| SDBIoU(d=0.3) | 51.6 | 40.1 | 40.8 | 24.5 |
| Models | Precision | Recall | Time/ms | Parameter/ | ||
|---|---|---|---|---|---|---|
| YOLOv3 | 53.8 | 43.1 | 42.2 | 23.2 | 210 | 18.4 |
| YOLOv5s | 46.7 | 34.9 | 34.5 | 19.4 | 14.1 | 12.0 |
| YOLOv7 | 51.6 | 42.3 | 40.2 | 21.9 | 73.3 | 1.7 |
| YOLOv8n | 45.9 | 34.2 | 34.5 | 19.8 | 5.7 | 3.1 |
| YOLOv8s | 51.8 | 39.4 | 40.6 | 24.3 | 7.1 | 11.1 |
| YOLOv8m | 55.8 | 42.6 | 44.5 | 26.6 | 16.8 | 25.9 |
| YOLOv10n | 45.5 | 33.5 | 34 | 19.8 | 8 | 2.7 |
| YOLOv10s | 51 | 39.4 | 40.8 | 24.6 | 7.6 | 8.1 |
| PVswin-YOLO[36] | 54.5 | 41.8 | 43.3 | 26.4 | 8.8 | 10.1 |
| CoT-YOLO[37] | 53.2 | 41.1 | 42.7 | 25.7 | 12.2 | 10.6 |
| DS-YOLO[38] | 52.4 | 41.6 | 43.1 | 26.0 | 19.7 | 9.3 |
| DCS-YOLOv8 | 54.2 | 42.1 | 44.5 | 26.9 | 10.1 | 9.9 |
| Models | ||
|---|---|---|
| Faster R-CNN[11] | 36.6 | 21.1 |
| Swin Transformer[33] | 39.7 | 23.1 |
| CenterNet [34] | 39.2 | 22.7 |
| Cascade R-CNN [32] | 39.4 | 24.2 |
| DCS-YOLOv8 | 44.5 | 26.9 |
| Models | Pedestrian | People | Bicycle | Car | Van | Truck | Tricy | A-tricy | Bus | Motor | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| YOLOv8s | 44.2 | 34.3 | 13.9 | 80 | 45.5 | 40.2 | 28.5 | 16.6 | 57.8 | 44.8 | 40.6 |
| YOLOv8s-SDBIoU | 44.1 | 34.0 | 14.4 | 79.6 | 45.8 | 38.4 | 29.7 | 15.8 | 60.9 | 44.9 | 40.8 |
| YOLOv8s-SDBIoU-P2 | 50 | 40.7 | 16.6 | 83.3 | 46.7 | 39.7 | 29.1 | 16 | 60.5 | 50.6 | 43.3 |
| YOLOv8s-SDBIoU-P2-DCAM | 51 | 40.8 | 16.9 | 83.8 | 47.5 | 39.3 | 32.3 | 17 | 59.4 | 50.5 | 43.9 |
| YOLOv8s-SDBIoU-P2-DCS-Net | 51.5 | 42.5 | 17.2 | 83.8 | 48.5 | 39.4 | 33.3 | 18.1 | 59.8 | 50.6 | 44.5 |
| Baseline | SDBIOU | P2 | DCAM | SCDown | Precision | Recall | Time/ms | Parameter/ | ||
|---|---|---|---|---|---|---|---|---|---|---|
| ✓ | 51.8 | 39.4 | 40.6 | 24.3 | 7.1 | 11.1 | ||||
| ✓ | ✓ | 51.6 | 40.1 | 40.8 | 24.5 | 5.5 | 11.1 | |||
| ✓ | ✓ | ✓ | 53.9 | 41.2 | 43.3 | 26.1 | 6.6 | 10.6 | ||
| ✓ | ✓ | ✓ | ✓ | 54.8 | 41.8 | 43.9 | 26.5 | 9.7 | 11.3 | |
| ✓ | ✓ | ✓ | ✓ | ✓ | 54.2 | 42.1 | 44.5 | 26.9 | 10.1 | 9.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).