RoadNet: A High-Precision Transformer-CNN Framework for Road Defect Detection via UAV-Based Visual Perception

Long Gou; Yadong Liang; Xingyu Zhang; Jianfeng Yang

doi:10.20944/preprints202509.1236.v1

Submitted:

12 September 2025

Posted:

15 September 2025

You are already at the latest version

Abstract

Automated Road defect detection using Unmanned Aerial Vehicles (UAVs) has emerged as an efficient and safe solution for large-scale infrastructure inspection. However, object detection in aerial imagery poses unique challenges, including the prevalence of extremely small targets, complex backgrounds, and significant scale variations. Mainstream deep learning-based detection models often struggle with these issues, exhibiting limitations in detecting small cracks, high computational demands, and insufficient generalization abil-ity for UAV perspectives. To address these challenges, this paper proposes a novel com-prehensive network, RoadNet, specifically designed for high-precision road defect detec-tion in UAV-captured imagery. RoadNet innovatively integrates Transformer modules with a convolutional neural network backbone and detection head. This design not only significantly enhances the global feature modeling capability crucial for understanding complex aerial contexts but also maintains the computational efficiency necessary for po-tential real-time applications. The model was trained and evaluated on a self-collected UAV road defect dataset (UAV-RDD). In comparative experiments, RoadNet achieved an outstanding mAP@0.5 score of 0.9128 while maintaining a fast-processing speed of 210.01ms per image, outperforming other state-of-the-art models. The experimental results demonstrate that RoadNet possesses superior detection performance for road defects in complex aerial scenarios captured by drones.

Keywords:

object detection

;

transformer

;

convolutional neural networks

;

deep learning

;

unmanned aerial vehicle (UAV)

;

aerial imagery

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

The rapid advancement of Unmanned Aerial Vehicle (UAV) technology has revolutionized the field of infrastructure inspection, offering a safe, efficient, and cost-effective alternative to traditional manual or vehicle-based surveys [1]. Utilizing drones for road defect detection allows for the rapid collection of high-resolution imagery over large and potentially hazardous areas, minimizing traffic disruption and inspection risks. However, the vast amount of aerial image data generated by UAVs necessitates the development of robust automated analysis systems, making object detection a core technology in drone-based inspection pipelines [2].

Road defects such as potholes, cracks, and surface deterioration are inevitable due to factors like aging infrastructure, climate change, and heavy traffic loads. These defects not only pose significant threats to transportation safety but also lead to accelerated road degradation and substantially increased long-term maintenance costs if not addressed promptly [3]. Therefore, the timely and accurate identification of these problems from UAV imagery is of paramount importance for modern urban management and smart maintenance strategies.

While deep learning models, particularly Convolutional Neural Networks (CNNs), have achieved remarkable success in various object detection tasks [4], their application to UAV-captured road imagery presents unique and formidable challenges. Firstly, the bird's-eye perspective of drones means that targets like fine cracks and small potholes occupy an extremely small proportion of pixels in the image, making them notoriously difficult to detect against complex backgrounds like asphalt texture, shadows, and occlusions [5]. Secondly, the scale of objects can vary dramatically within and across images due to changes in flight altitude and camera angle, demanding a detection model with superior multi-scale feature representation capabilities. Furthermore, deep models often require substantial computational resources, posing a challenge for real-time or on-device processing scenarios which are highly desirable for UAV applications [6]. Finally, models trained on data from one specific environment often suffer from performance degradation when deployed under different conditions (e.g., varying lighting, road types, seasons), highlighting a critical need for improved generalization ability.

In response to the aforementioned challenges specific to aerial imagery, this paper proposes a novel high-precision road defect detection framework named RoadNet. This innovative deep learning model integrates the global contextual modeling strengths of Transformer architectures with the local feature extraction efficiency of convolutional neural networks. It is specifically designed to address the shortcomings of existing technologies in handling UAV-based inspection data. After conducting experiments on a dedicated UAV road defect dataset, it has been verified that RoadNet attains a detection accuracy of over 90%. Compared with existing state-of-the-art models, it achieves significant improvements in both precision and recall. The main contributions of this paper are summarized as follows:

To address the critical challenge of capturing long-range dependencies and complex contextual information in expansive aerial imagery, a Transformer module is incorporated into the network. This enhancement enables the model to more accurately identify and delineate defects ranging from minute cracks to large-area potholes, which is a common scenario in UAV inspections.
To overcome the limitations of handling extreme scale variations inherent in UAV perspectives, a multi-level feature pyramid network is employed to effectively fuse features across different scales. Coupled with an optimized detection head structure, this ensures robust performance on targets of various sizes while maintaining computational efficiency for detecting small targets.
A Spatial-Channel Interaction Module (SCIM) is designed, building upon the Transformer and feature pyramid network. This module facilitates simultaneous capture of global and local features by jointly modeling spatial and channel information, significantly enhancing the feature representation power for complex aerial scenes.

2. Related Works

2.1. Traditional Road Inspection Methods

Traditional road defect detection primarily relies on mechanical and physical measurement approaches, including laser scanning [7], ground penetrating radar (GPR) [8], and high-resolution photogrammetry [9]. For instance, ground penetrating radar utilizes electromagnetic waves to probe the internal structure of roadways, effectively identifying subsurface anomalies such as hidden voids and propagating cracks. Despite their accuracy, these methods suffer from high operational complexity and substantial costs, which severely restrict their large-scale deployment. Photogrammetry techniques, which employ high-resolution cameras to capture road surface images for subsequent analysis, offer a more cost-effective alternative. However, their performance is highly susceptible to environmental variations such as lighting changes and obstructions, leading to unstable detection outcomes. While these conventional approaches provide reliable measurements under controlled conditions, their limitations in cost, efficiency, and operational flexibility render them inadequate for modern large-scale infrastructure inspection requirements, especially from aerial platforms.

2.2. Deep Learning-Based Object Detection

The remarkable progress in artificial intelligence has established deep learning as a powerful paradigm for image processing and object detection tasks. Current deep learning-based detection methodologies can be broadly categorized into two architectural families.

The first category encompasses convolutional neural network (CNN) based approaches, which have demonstrated exceptional performance in various visual tasks including image classification, object detection, and semantic segmentation. In the domain of road defect analysis, popular architectures such as Faster R-CNN [10], YOLO [11], and SSD [12] have been extensively adopted. These models leverage convolutional operations to extract discriminative local features, enabling precise identification of road anomalies. Faster R-CNN achieves accurate defect localization through its region proposal network followed by classification and regression operations [10]. The YOLO framework, renowned for its inference efficiency, has been successfully applied in real-time road inspection systems [13]. SSD incorporates multi-scale feature fusion mechanisms to handle objects of varying dimensions, demonstrating robust performance across diverse defect sizes [12]. However, when deployed for UAV aerial imagery analysis, these CNN-based architectures face significant limitations. The characteristically small size of targets in aerial perspectives (where fine cracks may occupy merely several pixels), substantial scale variations due to altitude changes, and complex background clutter frequently cause performance deterioration, as these models lack effective global contextual modeling and efficient multi-scale representation capabilities [14,15].

The second category involves Transformer-based detection frameworks, which have recently gained prominence due to their exceptional global modeling capacities demonstrated in both natural language processing and computer vision domains. Representative models such as DETR [16] and SWIN Transformer [17] have shown remarkable capabilities in capturing long-range dependencies among image features. DETR revolutionizes object detection by implementing an end-to-end framework through self-attention mechanisms, eliminating the need for hand-designed components like region proposal networks [16]. SWIN Transformer introduces a hierarchical architecture with shifted window attention, demonstrating superior performance in processing high-resolution imagery [17]. These characteristics make Transformer-based models particularly suitable for aerial image analysis where comprehensive contextual understanding is essential for distinguishing true defects from complex background patterns. Nevertheless, their substantial computational requirements present considerable challenges for real-time deployment on resource-constrained UAV platforms or for processing large-scale aerial survey datasets [18].

2.3. UAV-Based Visual Inspection

UAV-based visual inspection has emerged as a rapidly evolving research domain, driven by the operational advantages of drone platforms for infrastructure monitoring [19]. Several investigations have explored the adaptation of existing deep learning architectures for road defect detection from aerial perspectives. For example, [20] implemented an optimized YOLOv5 model on a UAV platform for automated road crack detection, achieving a balance between accuracy and computational efficiency. Similarly, [21] proposed an attention-guided feature fusion network to enhance the detection of small cracks in UAV-captured images. Furthermore, [22] developed a large-scale benchmark dataset for UAV-based road damage detection and provided a comprehensive evaluation of state-of-the-art models, highlighting the critical challenge of scale variation. More recently, [23,24] explored the application of a lightweight Vision Transformer for real-time road inspection from drones, demonstrating its strong global feature extraction capability while addressing its computational demands. These collective efforts underscore the distinctive challenges of UAV-based road inspection—particularly in small object detection, computational efficiency, and model generalization—which demand specialized solutions beyond mere adaptation of existing ground-based models. Our proposed RoadNet framework is designed to address these specific challenges through a novel integration of convolutional and transformer architectures.

3. Method

Object detection in UAV-captured aerial imagery presents distinct challenges that demand specialized network architectures. The bird's-eye perspective and variable flight altitudes lead to extreme scale variations, where targets like road cracks may occupy only a few pixels while large potholes span significant areas. Additionally, complex background elements such as asphalt texture, shadows, and occlusions further complicate accurate detection. Traditional convolutional neural networks (CNNs) typically achieve multi-scale representation through hierarchical feature learning, progressively capturing fine-grained to coarse-grained features across network depths. However, this approach shows inherent limitations in capturing global dependencies and detailed information simultaneously, particularly problematic for UAV-based road defect detection where both minute cracks and extensive depressions must be detected within the same framework.

To address these specific challenges of aerial imagery analysis, our model incorporates a Transformer-based module to enhance the backbone network's multi-scale feature modeling capability. The Transformer's proven success in natural language processing demonstrates its powerful global context modeling ability through long-range dependency capture, a characteristic that translates exceptionally well to visual tasks involving complex aerial scenes. Specifically, we integrate a multi-head self-attention mechanism (MHSA) into critical layers of the backbone network, working synergistically with traditional convolutional units to efficiently combine local feature extraction with global contextual understanding. This integration is particularly valuable for UAV imagery, where understanding global context is essential for distinguishing true defects from similar-looking background patterns. The complete structure of our proposed RoadNet is illustrated in Figure 1.

3.1. Multi-Scale Feature Representation and Optimization Strategies for Aerial Imagery

The multi-scale convolution operation and feature fusion module specifically designed for aerial imagery analysis are shown in Figure 2. This design addresses the critical challenge of extreme scale variations in UAV-captured images, where road defects can range from few-pixel cracks to large-area potholes depending on flight altitude and camera perspective.

The Transformer module significantly enhances the model's ability to capture global dependencies through the self-attention mechanism, which is particularly crucial for analyzing expansive aerial views where defects may be distributed across wide areas. The success of this mechanism in natural language processing demonstrates its capability to effectively handle long-range dependencies, a valuable characteristic for processing high-resolution UAV imagery that requires understanding contextual relationships across the entire frame. Its core calculation method is as follows:

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

Here, Q = XWQ, K = XWK, V = XWV respectively represent the linear transformation of the input feature X. WQ, WK, WV are the parameter matrices learned, and dk is the scaling factor of the feature dimension, used to prevent the values from being too large.

In the actual implementation, the multi-head self-attention mechanism processes different feature subspaces from multiple perspectives by parallelizing multiple independent attention heads. This parallel processing capability is essential for handling the diverse visual patterns found in aerial road imagery. The final output is the concatenation of the attention results from each head.

M u l t i H e a d (Q, K, T) = C o n c a t ({h e a d}_{1}, {h e a d}_{2}, \dots, {h e a d}_{h}) W_{0}

(2)

Here, WO represents the output linear transformation matrix. By introducing this mechanism at the key levels of the main network, we achieve global and local modeling of multi-scale features. This design is particularly suitable for complex road detection tasks, enabling the capture of large-scale defects while not neglecting fine cracks and edge information.

In the main network, this self-attention mechanism is integrated into specific key layers to simultaneously focus on local and global information during multi-scale feature extraction. For example, for a certain intermediate feature map X, the convolution operation extracts its local features, while the Transformer module captures the global context through multi-head self-attention. This dual approach significantly enhances the expressiveness of the features.

To fully utilize features of different resolutions, this paper adopts a bottom-up multi-scale feature fusion strategy in the main network of the network. The low-level features generated by the convolution operation typically contain rich edge and texture information, while the high-level features capture semantic information. This paper designs a Cascaded Attention Mechanism (CAM), combining the global modeling ability of the Transformer and the local perception ability of the convolution, and achieving efficient context information integration through a multi-scale feature pyramid.

The key of the cascaded attention mechanism lies in introducing adaptive weights in each level of features to automatically adjust the importance of different feature scales. Specifically, assuming that at a certain scale s, the feature map is Fs, then the fused feature representation is:

F_{s}^{'} = σ (W_{1} \cdot F_{s}) + \emptyset (W_{2} \cdot F_{s + 1})

(3)

Here: σ and ϕ represents a non-linear activation function; W_1and W_2 are learnable parameters used to adjust the weights of different feature scales; Fs+1 represents the features at a higher level.

This fusion strategy not only enhances the network's adaptability in multi-scale scenarios, but also further optimizes the network's detection performance for fine targets (such as road cracks).

3.2. Spatial and Channel Interaction Optimization

In traditional convolutional networks, the features in the spatial dimension and the channel dimension are typically processed separately. This separate processing approach limits the network's utilization of the expression ability of high-dimensional features to a certain extent. Specifically, the spatial dimension contains the geometric shape and position information of objects, while the channel dimension corresponds to the category and semantic information of the features. However, modeling these two types of information independently can easily lead to insufficient information interaction, especially in complex scenarios, which may result in improper balance between global information and detailed information.

To solve this problem, we designed a Spatial-Channel Interaction Module (SCIM). This module captures the interdependence of the spatial and channel dimensions simultaneously, achieving efficient joint feature modeling. The core idea of SCIM is to dynamically adjust the interaction strength of information from different dimensions through an adaptive weight mechanism, thereby generating more rich and more expressive features.

Spatial dimension attention is generated through the global pooling operation of the feature map and combined with convolution processing to obtain attention weights, which are used to highlight the features at specific positions. The specific formula is as follows.

{A t t e n t i o n}_{s p a t i a l} (F) = σ (C o n v 2 D (C o n c a t [A v g P o o l (F), M a x P o o l (F)]))

(4)

Here, AvgPool and MaxPool represent global average pooling and global maximum pooling respectively; Concat indicates the feature concatenation operation; σ denotes the Sigmoid activation function. This dual-pooling strategy enhances the model's ability to focus on spatially significant regions in aerial imagery, where defects may be sparse and distributed unevenly.

Channel dimension attention generates channel weights through the global description of the feature map, which is used to enhance the expression ability of semantic information.

{A t t e n t i o n}_{c h a n n e l} (F) = σ (W_{2} \cdot R e L U (W_{1} \cdot A v g P o o l (F)))

(5)

Here, W_1and W_2represent the weight matrices of the fully connected layers; ReLU denotes the ReLU activation function.

Based on the outputs of these two attention modules, the fused feature representation is calculated, and further high-dimensional features are extracted through convolution operations.

F_{S C I M} = C o n v 2 D ({A t t e n t i o n}_{s p a t i a l} (F) + {A t t e n t i o n}_{c h a n n e l} (F))

(6)

This joint modeling approach effectively integrates local detail information with global contextual understanding, enabling the model to perform significantly better in multi-scale feature extraction and adaptability to complex aerial scenarios. The SCIM module proves particularly valuable for UAV-based inspection tasks, where the model must simultaneously maintain sensitivity to small spatial details while understanding global scene context to reduce false positives from background clutter.

4. Experiments and Results

4.1. UAV Image Dataset and Experimental Setup

To verify the performance of RoadNet in UAV-based road defect detection, the model was trained and evaluated on a dedicated Unmanned Aerial Vehicle Road Defect (UAV-RDD) dataset, which was specifically collected for this study. All imagery was captured using a DJI Mavic 3 Enterprise drone equipped with a 4/3 CMOS 20MP camera, flown at altitudes between 30 to 50 meters to simulate real-world inspection scenarios. The dataset covers two main types of road defects critical for infrastructure assessment: cracks and potholes, featuring diverse urban, highway, and rural scenarios under various lighting and weather conditions to ensure robustness.

The UAV-RDD dataset comprises 5,842 high-resolution aerial images (3840×2160 pixels), each meticulously annotated by domain experts using the LabelImg tool. The annotation process followed the standard PASCAL VOC format, bounding boxes for both crack and pothole defects. To address the class imbalance and extreme scale variations inherent in aerial imagery, we implemented a stratified sampling strategy during dataset splitting. The dataset was divided into 70% for training (4,089 images), 15% for validation (876 images), and 15% for testing (877 images), ensuring proportional representation of defect types and environmental conditions across all subsets.

Recognizing the challenges of small object detection in aerial imagery, we employed an extensive data augmentation pipeline tailored to UAV characteristics. This included not only standard techniques such as random flipping (horizontal and vertical), cropping (±20%), and color jittering (brightness, contrast, saturation adjustments of ±30%), but also altitude simulation via random scaling (0.5x to 1.5x) to mimic varying flight heights, and affine transformations to simulate different drone pitch and yaw angles. Additionally, we added Gaussian noise to improve model resilience to transmission artifacts and sensor noise common in real-world drone operations.

All experiments were conducted on a uniform hardware platform equipped with an NVIDIA RTX 4060 GPU and an Intel i9-13900K CPU to ensure fair comparison. The models were implemented in PyTorch 1.12 and trained with identical hyperparameters: input image size of 640×640, batch size of 16, AdamW optimizer with an initial learning rate of 0.001, and a cosine annealing scheduler over 300 epochs. All models were trained on the GPU platform. However, inference speed was evaluated on the CPU to simulate a more realistic deployment scenario on edge devices or ground control stations.

4.2. Ablation Study on Component Effectiveness

To rigorously evaluate the contribution of each proposed module within the context of aerial imagery analysis, we conducted a series of ablation experiments. The results, presented in Table 1, demonstrate the incremental performance gain achieved by integrating Transformer-based global context modeling and the Spatial-Channel Interaction Module (SCIM) into our baseline CNN architecture.

The baseline model (a CSPDarknet backbone with PANet neck and YOLO head) achieved an mAP@0.5 of 0.7357, highlighting the inherent difficulty of detecting small, low-contrast defects in aerial imagery. The incorporation of the Transformer module led to a substantial improvement (+10.9% mAP), underscoring the critical importance of global contextual information for distinguishing defects from complex aerial backgrounds like tar streaks, shadows, and water stains. The full RoadNet model, with both Transformer and SCIM, achieved the best performance (0.9128 mAP), validating the design of SCIM for effective spatial-channel feature interaction, which is paramount for precise localization and classification of varied defect sizes in UAV images. The precision and recall rates also showed significant gains, indicating a superior balance between reducing false positives and minimizing missed detections.

Furthermore, the ablation study confirms that our architectural innovations in RoadNet directly address the core challenges of UAV-based detection. Compared to the baseline, the final model achieved a 37.55% improvement in accuracy and a 31.17% increase in recall rate, resulting in a significant leap in overall performance. Specifically, the trend of key evaluation indicators during the training process is shown in Figure 3, while the comparison of detection results on the validation set is presented in Figure 4. These results fully demonstrate that our proposed architectural innovations play a decisive role in the enhancement of model performance for aerial imagery [17].

4.3. Comparative Evaluation with State-of-the-Art Models

A comprehensive comparative experiment was conducted to benchmark RoadNet against several state-of-the-art object detection models. All models were trained from scratch on our UAV-RDD dataset under the identical experimental setup described in Section 4.1 to ensure a fair and meaningful comparison. The evaluation metrics focus on both detection accuracy (mAP@0.5) and inference speed (ms per image on CPU), the latter being crucial for real-time UAV applications.

The results in Table 2 demonstrate the superior performance of RoadNet. It achieves the highest detection accuracy (0.9128 mAP@0.5), outperforming the next best model (YOLOv8s) by a significant margin of 25.3%. This notable improvement can be attributed to RoadNet's enhanced capability in capturing global context and fine-grained details, which is essential for accurate defect detection in challenging aerial scenes.

Crucially, RoadNet also achieves the fastest inference speed (210.1 ms/image) among all compared models on a CPU platform, which is the most likely deployment scenario for edge computing on UAVs or ground control stations. It is 4.7% faster than YOLOv5s and 36.5% faster than YOLOv8s. This efficiency breakthrough is vital for practical UAV inspection, enabling near-real-time analysis during flight missions. The dramatic speed advantage over Faster R-CNN and DETR further highlights the computational efficiency of our architecture design.

The combination of state-of-the-art accuracy and leading inference speed makes RoadNet uniquely suitable for real-world UAV-based road inspection tasks, successfully balancing the often-contradictory goals of high precision and computational efficiency.

5. Conclusions

This paper has presented RoadNet, a novel deep learning framework specifically designed to address the critical challenges of road defect detection in Unmanned Aerial Vehicle (UAV) imagery. Confronted with the inherent difficulties of aerial-based inspection—including extreme scale variations, minuscule target sizes, complex background clutter, and the demand for computational efficiency—RoadNet integrates Transformer-based global context modeling with the local feature extraction strengths of convolutional neural networks. The incorporation of a dedicated Spatial-Channel Interaction Module (SCIM) further enhances multi-scale feature representation, enabling the model to precisely localize and classify diverse road defects, from fine cracks to extensive potholes, in complex aerial scenes.

Extensive experiments conducted on a dedicated UAV road defect dataset (UAV-RDD) demonstrate the effectiveness of our approach. Ablation studies confirm the significant individual and synergistic contributions of the Transformer module and SCIM, with the full RoadNet model achieving a remarkable mAP@0.5 of 0.9128. In comparative evaluations, RoadNet not only surpassed state-of-the-art models like YOLOv8s and Faster R-CNN in detection accuracy by a considerable margin but also achieved the fastest inference speed on a CPU platform (210.1 ms/image), underscoring its suitability for real-time UAV applications.

For future work, research will proceed along several promising directions:

Enhanced Generalization: We will incorporate more diversified UAV datasets captured under a wider range of conditions (e.g., severe weather, different times of day, and various geographic locations) to further improve the model's robustness and generalization capability.

Advanced Lightweighting: While efficient, we will explore more aggressive model compression and quantization techniques, including neural architecture search (NAS) for optimal backbone design, to deploy RoadNet on the limited computational resources of UAV embedded systems without significant accuracy loss.

Edge Deployment and Real-time System Integration: The ultimate goal is to deploy an optimized version of RoadNet on edge computing devices within UAVs or ground stations, facilitating a closed-loop system for real-time detection, analysis, and reporting, thereby providing stronger technical support for the development of intelligent and automated transportation infrastructure maintenance.

Author Contributions

Software, W.W. and T.T.; formal analysis, J.X.; investigation, J.Y.; resources, Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by Pingyang, Zhejiang Province of China (No. 250071494).

Data Availability Statement

Data available on request from the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhou, Y.; Yue, Y.; Yan, B.; et al. Collaborative Target Tracking Algorithm for Multi-Agent Based on MAPPO and BCTD. Drones, 2025, 9, 521. [Google Scholar] [CrossRef]
Yang, B.; Tao, T.; Wu, W.; et al. MultiDistiller: Efficient Multimodal 3D Detection via Knowledge Distillation for Drones and Autonomous Vehicles. Drones, 2025, 9, 322. [Google Scholar] [CrossRef]
Huang, Y.; Fan, J.Y.; Hu, J.Z.Y. TBi-YOLOv5: A surface defect detection model for crane wire with Bottleneck Transformer and small target detection layer. Proceedings of the Institution of Mechanical Engineers, Part C. Journal of mechanical engineering science, 2024, 238, 2425–2438. [Google Scholar] [CrossRef]
Su, Y.; Deng, J.; Sun, R.; et al. A Unified Transformer Framework for Group-based Segmentation: Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection. 2022, 26, 313–325. [Google Scholar] [CrossRef]
Wang, A.; Ren, C.; Zhao, S.M.S. Attention guided multi-level feature aggregation network for camouflaged object detection. Image and vision computing, 2024, 144, 1. [Google Scholar] [CrossRef]
Li, H.; Zhang, R.; Pan, Y.; et al. LR-FPN: Enhancing Remote Sensing Object Detection with Location Refined Feature Pyramid Network. IEEE, 2024, 1-8.
Ha, T.T.; Chaisomphob, T. Automated Localization and Classification of Expressway Pole-Like Road Facilities from Mobile Laser Scanning Data. Advances in Civil Engineering 2020, 2020, 5016783.1–5016783.18. [Google Scholar]
Ma, Y.; Lei, W.; Pang, Z.; et al. Rebar Clutter Suppression and Road Defects Localization in GPR B-Scan Images Based on SuppRebar-GAN and EC-Yolov7 Networks. IEEE Transactions on Geoscience and Remote Sensing, 2024, 62, 1–14. [Google Scholar] [CrossRef]
Lv, Y.; Wang, G.; Hu, X. MACHINE LEARNING BASED ROAD DETECTION FROM HIGH RESOLUTION IMAGERY. ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2016, 41, 891–898. [Google Scholar]
Zhang, H.; Shao, F.; Chu, W.; et al. Faster R-CNN based on frame difference and spatiotemporal context for vehicle detection. Signal, Image and Video Processing, 2024, 18, 7013–7027. [Google Scholar] [CrossRef]
Mohd Yusof, N.; Sophian, A.; Mohd Zaki, H.F.; et al. Assessing the performance of YOLOv5, YOLOv6, and YOLOv7 in road defect detection and classification: a comparative study. Bulletin of Electrical Engineering & Informatics, 2024, 13, 350. [Google Scholar]
Zhang, B.; Fang, S.; Li, Z. Research on Surface Defect Detection of Rare-Earth Magnetic Materials Based on Improved SSD. Complexity, 2021, 2021, 1–10. [Google Scholar] [CrossRef]
Zhang, L.; Yan, S.F.; Hong, J.; et al. An improved defect recognition framework for casting based on DETR algorithm. Journal of Iron and Steel Research, International Edition, 2023, 30, 949–959. [Google Scholar] [CrossRef]
Zhu, W.; Zhang, H.; Zhang, C.; Zhu, X.; Guan, Z.; Jia, J. Surface defect detection and classification of steel using an efficient Swin Transformer. 2023, 57, 1572. [Google Scholar] [CrossRef]
Wu, Y.; Liao, K.Y.; Chen, J.; et al. D-former: a U-shaped Dilated Transformer for 3D medical image segmentation. Neural Computing and Applications, 2022, 35, 1931–1944. [Google Scholar] [CrossRef]
Wang, X.; Gao, H.; Jia, Z.; et al. A road defect detection algorithm incorporating partially transformer and multiple aggregate trail attention mechanisms. IOP Publishing Ltd, 2024, 36, 026003. [Google Scholar] [CrossRef]
Kim, G.I.; Yoo, H.; Cho, H.J.; et al. Defect Detection Model Using Time Series Data Augmentation and Transformation. Computers, Materials & Continua, 2024, 78, 1713. [Google Scholar]
Jiang, T.Y.; Liu, Z.Y.; Zhang, G.Z. YOLOv5s-road: Road surface defect detection under engineering environments based on CNN-transformer and adaptively spatial feature fusion. Measurement, 2025, 242, 115990. [Google Scholar] [CrossRef]
Wang, J.; Meng, R.; Huang, Y.; et al. Road defect detection based on improved YOLOv8s model. Scientific Reports, 2024, 14, 1. [Google Scholar] [CrossRef]
Fang, Z.; Shi, Z.; Wang, X.; et al. Roadbed Defect Detection from Ground Penetrating Radar B-scan Data Using Faster RCNN. 2020, 131–137. [Google Scholar] [CrossRef]
Sadhin, A.H.; Mohd Hashim, S.Z.; Samma, H.; et al. YOLO: A Competitive Analysis of Modern Object Detection Algorithms for Road Defects Detection Using Drone Images. Baghdad Science Journal, 2024, 21, 2167. [Google Scholar] [CrossRef]
Arya, D.; Maeda, H.; Ghosh, S.K.; et al. Global Road Damage Detection: A Large-Scale Dataset and Benchmark. IEEE Transactions on Intelligent Transportation Systems, 2022, 23, 23294–23305. [Google Scholar]
Kim, G.I.; Yoo, H.; Cho, H.J. Lightweight vision transformer for real-time road damage detection in UAV imagery. IEEE Geoscience and Remote Sensing Letters, 2023, 20, 6003205. [Google Scholar]
Jinsheng Xiao, Haowen Guo, Jian Zhou, Tao Zhao, Qiuze Yu, Yunhua Chen, Zhongyuan Wang, Tiny object detection with context enhancement and feature purification, Expert Systems with Applications (ESWA), 2023, January, vol 211, 118665.

Figure 1. Overall Structure of RoadNet.

Figure 2. Multi-scale Convolution Operations and Feature Fusion.

Figure 3. Trend of Key Evaluation Indicators.

Figure 4. Comparison of Labels and Predictions on the Validation Set.

Table 1. Ablation study of RoadNet components on the UAV-RDD test set.

Backbone	Transformer	SCIM	mAP@0.5	Precision	Recall	Params (M)
CNN (Baseline)	✗	✗	0.7357	0.7111	0.7172	7.2
CNN	✓	✗	0.8157	0.7904	0.7831	18.5
CNN	✓	✓	0.9128	0.904	0.8328	21.3

Table 2. Performance comparison of different models on the UAV-RDD test set.

Model	mAP@0.5	Inference Time (ms)	Platform
RoadNet (Ours)	0.9128	210.1	CPU (Intel i9-13900K)
Faster R-CNN [20]	0.4283	1221.3	CPU (Intel i9-13900K)
YOLOv5s [18]	0.7225	220.4	CPU (Intel i9-13900K)
YOLOv5l [21]	0.7234	993.8	CPU (Intel i9-13900K)
YOLOv8s [19]	0.7284	330.6	CPU (Intel i9-13900K)
DETR [16]	0.7012	1850.5	CPU (Intel i9-13900K)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.