1. Introduction
Cracks serve as early indicators of structural damage in buildings, bridges, and roads, making their detection vital for structural health monitoring. Analyzing the morphological characteristics, positional information, and extent of internal damage in cracks allows for accurate safety assessments of buildings and infrastructures[
1,
2]. Timely detection and repair of cracks not only reduce maintenance costs but also prevent further structural deterioration, ensuring safety and durability[
3,
4].
Traditional crack detection methods, such as visual inspections and manual evaluations, are often costly and inefficient, relying heavily on the expertise of inspectors, which can lead to subjective and inconsistent results[
5]. Therefore, the development of objective and efficient automated crack detection methods have become a significant trend in this field. Various sensor-based methods for automatic or semi-automatic crack detection, including crack meters, RGB-D sensors, and laser scanners, have been proposed[
6,
7,
8]. Although these sensors are accurate, they are expensive and challenging to deploy on a large scale.
Advancements in computer vision technology have popularized image-based crack detection methods due to their long-distance, non-contact, and cost-effective nature. Traditional visual detection methods, such as morphological image processing[
9,
10], filtering[
11,
12], and percolation models[
13], are simple to implement and computationally light but suffer from limited generalization performance. Environmental noise, such as debris around cracks, further complicates detection in practical engineering environments.
Recently, deep learning-based semantic segmentation algorithms have significantly improved the accuracy and stability of image recognition in noisy environments. These algorithms can precisely locate and label crack pixels, providing detailed information on crack distribution, width, length, and shape[
14]. However, general scene understanding models often fail to capture the unique features of cracks, which are typically thin, long, and irregularly shaped[
15,
16]. Cracks can span entire images while occupying less than 5% of the width, necessitating models that effectively capture dependencies between distant pixels. While self-attention mechanisms excel at aggregating long-distance contextual information[
17,
18,
19], they come with high computational costs, limiting detection speed. Additionally, cracks exhibit uneven distribution with significant size variations, necessitating multi-scale feature extraction[
20,
21,
22]. Although methods like DeepLabV3+[
23] and SegNext[
24] capture multi-scale information, they are computationally intensive and costly for large images.
Current computer vision-based crack detection applications have seen a significant shift towards the utilization of unmanned aerial vehicles (UAVs) due to their flexibility, cost-effectiveness, and ability to cover large and hard-to-reach areas efficiently[
25]. However, the computational resources available on UAV-based platforms or others edge devices are typically limited, often operating without GPUs, making it crucial to deploy lightweight models that can perform low-latency processing and analysis[
26]. Researchers have proposed lightweight networks that reduce computational costs by minimizing deep down-sampling, reducing channel numbers, and optimizing convolutional design. However, reducing subsampling stages can lead to insufficient model receptive fields for covering large objects, as seen with ENet[
27]. Bilateral backbone models partially address this issue. BiSeNet[
28], for instance, adds a context path with fewer channels, and HrSegNet[
29] maintains high-resolution features while adjusting parameters to reduce channels. Nevertheless, these two-branch structures increase computational overhead, and reducing channels can hinder the ability of model to learn relational features effectively.
Furthermore, we analyze the challenges in designing lightweight surface crack segmentation models: 1) existing methods increase computational complexity with large kernel convolutions, multiple parallel branches, and feature pyramid structures to handle various object sizes and shapes; 2) diverse crack image scenes and complex backgrounds limit lightweight model feature extraction, making it difficult to learn effective information from limited datasets; 3) subtle differences between cracked and normal areas complicate segmentation. While adding multiple skip connections and auxiliary training branches can improve accuracy, they increase memory overhead.
To address the aforementioned challenges, we propose CrackScopeNet, a lightweight segmentation network optimized for structural surface cracks. CrackScopeNet features an optimized multi-scale branching architecture and a carefully designed strip-wise context attention (SWA) module.
Figure 1 compares classical semantic segmentation networks, lightweight semantic segmentation networks, and crack-specific segmentation networks with CrackScopeNet in terms of mean intersection over union (mIoU), model floating point operations (FLOPs), and parameters on the CrackSeg9k dataset. Only models with mIoU scores above 80% and FLOPs below 100G are shown. What can be seen clearly is that CrackScopeNet outperforms all shown models with significantly fewer FLOPs and parameters. This is due to the initial design consideration of capturing both local context information around small cracks and remote context information to identify complete cracks and eliminate background interference.
Initially, in the local feature extraction stage, we divide the channel data and perform three convolution operations with different convolution kernel sizes to obtain the local context information of cracks. Subsequently, we utilize a combination of strip pooling and one-dimensional convolution to capture remote context information without compressing channel features. Finally, we construct a lightweight multi-scale feature fusion module to aggregate shallow detail and deep semantic information. In these modules, we employ depth-separable convolution, dropout, and residual connection structures to prevent overfitting, gradient disappearance, and gradient explosion problems, resulting in a lightweight neural network adaptable to crack detection.
In summary, our main contributions are as follows.
(1) We propose a novel crack image segmentation model, called CrackScopeNet, designed to meet structural health monitoring requirements. The model effectively extracts information at multiple levels during the down-sampling stage and fuses key features during the up-sampling stage.
(2) We design a lightweight multi-scale branch module and a strip-wise context attention module to capture remote context information. These modules are meticulously crafted to align with fracture morphological characteristics, capturing rich context information while minimizing computational cost and alleviating overfitting issues common in lightweight models with fewer parameters.
(3) CrackScopeNet demonstrates state-of-the-art performance on the CrackSeg9k dataset and exhibits excellent transferability to small datasets in specific scenarios. Additionally, CrackScopeNet has a low inference delay on resource-constrained drone platforms, making it ideal for outdoor crack detection through computer vision. This ensures that drone platforms can perform rapid crack detection and analysis, enhancing the efficiency and effectiveness of structural health monitoring.
4. Experiment
In this section, we first conduct a comprehensive quantitative comparison between CrackScopeNet and the most advanced segmentation models in various metrics, visualize the results, and comprehensively analyze the detection performance. Subsequently, we explore the transfer learning capability of our model on crack datasets specific to other scenarios. Finally, we perform ablation studies to meticulously examine the significance and impact of each component within CrackScopeNet.
4.1. Comparative Experiments
The primary objective is to achieve an exceptional balance between the accuracy of crack region extraction and inference speed. Thus, we compare CrackScopeNet with three types of models: classical general semantic segmentation models, advanced lightweight semantic segmentation models, and the latest models designed explicitly for crack segmentation, totaling 13 models. Specifically, U-Net[
51], PSPNet[
37], SegNet[
52], DeeplabV3+[
23], SegFormer[
19], and SegNext[
24] are selected as six classical high-accuracy segmentation models. BiSeNet[
28], BiSeNetV2[
53], STDC[
54], TopFormer[
55], and SeaFormer[
56] are chosen for their advantage in inference speed as lightweight semantic segmentation models. Notably, SegFormer, TopFormer, and SeaFormer are Transformer-based methods that have demonstrated outstanding performance on large datasets such as Cityscapes[
57]. Additionally, we compare two specialized crack segmentation models, U2Crack[
50] and HrSegNet[
29], which have been optimized for the crack detection scenario based on general semantic segmentation models.
It is important to note that to ensure the models could be easily converted to ONNX format and deployed on edge devices with limited computational resources and memory, we select lightweight backbones: MobileNetV2[
58] and ResNet-18[
59] for the DeepLabV3+ and BiSeNet models, respectively. For SegFormer and SegNext, we choose the lightweight versions SegFormer-B0[
19] and SegNext_MSCAN_Tiny[
24], which are suited for real-time semantic segmentation as proposed by the authors. For TopFormer and SeaFormer, we discover during training that the tiny versions are difficult to converge, so we only utilize their base versions.
Quantitative Results.Table 4 presents the performance of each baseline network and the proposed CrackScopeNet on the CrackSeg9k dataset, with the best values highlighted in bold. Analyzing the accuracy of different types of segmentation networks in the table reveals that larger models generally achieve higher mIoU scores than lightweight models. Specifically, compared to classical high-accuracy models, the proposed CrackScopeNet achieves the best performance in terms of mIoU, Recall, and F1 scores. Although our model’s precision is 1.26% lower than U-Net, U-Net has a poor recall performance (-2.24%), and our model’s parameters and FLOPs are reduced by 12 and 48 times, respectively.
In terms of network lightweightness, the CrackScopeNet proposed in this paper achieves the best accuracy-lightweight balance on the CrackSeg9k dataset, as more intuitively illustrated in
Figure 1. Our model achieves the highest mIoU with only 1.047M parameters and 1.58G FLOPs, making it incredibly lightweight. CrackScopeNet’s FLOPs are slightly higher than those of TopFormer and SeaFormer but lower than all other small models. Notably, due to the small size of crack dataset, the learning capability of lightweight segmentation networks is evidently limited, as mainstream lightweight segmentation models do not consider the unique characteristics of cracks, resulting in poor performance. The CrackScopeNet architecture, while maintaining superior segmentation performance, successfully achieves the design goal of a lightweight network structure, making it easily deployable on resource-constrained edge devices.
Moreover, compared to the state-of-the-art crack image segmentation algorithms, the proposed method achieves a mIoU score of 82.15% with only 1.047M parameters and 1.58G FLOPs, surpassing the highest-accuracy versions of U2Crack and HrSegNet models. Notably, the HrSegNet model employ an Online Hard Example Mining (OHEM) technique during training to improve accuracy. In contrast, we only use a cross-entropy loss function for model parameter updates without deliberately employing training tricks to enhance performance, showcasing the significant benefits of considering crack morphology in our model design.
Qualitative Results.Figure 4,
Figure 5, and
Figure 6 display the qualitative results of all models. CrackScopeNet achieves superior visual performance compared to other models. From the first, second, and third rows of
Figure 4, it can be observed that CrackScopeNet and the more significant parameter segmentation algorithms achieve satisfactory results for high-resolution images with apparent crack features. In the fourth row, where the original image contains asphalt with color and texture similar to cracks, CrackScopeNet and SegFormer successfully overcome such background noise interference. This is attributed to their long-range contextual dependencies, effectively capturing relationships between cracks. From the fifth row, CrackScopeNet exhibits robust performance even under uneven illumination conditions. This can be attributed to the design of CrackScopeNet, which considers both local and global features of cracks, effectively suppressing other noises.
Figure 5 clearly shows that lightweight networks struggle to eliminate background noise interference and produce fragmented segmentation results for fine cracks. This outcome is due to the limited parameters learned by lightweight models. Finally,
Figure 6 presents the visualization results of the most advanced crack segmentation models. U2Crack[
50], based on the ViT[
17] architecture, achieves a broader receptive field, somewhat alleviating background noise but at the cost of significant computational overhead. HrSegNet[
29] maintains a high-resolution branch to capture rich, detailed features. As seen in the last two columns of
Figure 6, with increased channels in the HrSegNet network, more detailed information is extracted, but this also leads to misclassifying background information as cracks, explaining why HrSegNet’s precision score is high while the recall score is low. In summary, CrackScopeNet outperforms other segmentation models with lower parameters and FLOPs by demonstrating excellent crack detection performance under various noise conditions.
Inference on Navio2-based Drones.
In practical applications, there remains a substantial gap for real-time semantic segmentation algorithms designed and validated for mobile and edge devices, which face challenges such as limited memory resources and low computational efficiency. To better simulate edge devices used for outdoor structural health monitoring, we explore the inference speed of models without GPU acceleration. We convert the models to ONNX format and test the inference speed on Navio2-based drones equipped with a representative Raspberry Pi 4B, focusing on models with tiny FLOPs and parameter counts: BiSeNetV2, DeepLabV3+, STDC, HrSegNetB48, SegFormer, TopFormer, SeaFormer, and our proposed model. The test settings are input image size of 3×400×400, batch size of 1, and 2000 testing epochs. To ensure fair comparisons, we do not optimize or prune any models during deployment, meaning the actual inference delay in practical applications could be further reduced based on these test results.
As shown in
Figure 7, the test results indicate that when running on highly resource-constrained drone platform, the proposed CrackScopeNet architecture achieves faster inference speed compared to other real-time or lightweight semantic segmentation networks based on convolutional neural networks, such as BiSeNet, BiSeNetV2, and STDC. Additionally, TopFormer and SeaFormer are designed with deployment on resource-limited edge devices in mind, resulting in extremely low inference latency. However, these two models perform poorly on the crack datasets due to inadequate data volume. Our proposed CrackScopeNet model, while maintaining rapid inference speed, achieves remarkable crack segmentation accuracy, establishing its advantage over competing models.
These results confirm the efficacy of deploying the CrackScopeNet model on outdoor mobile devices, where high-speed inference and lightweight architecture are crucial for the real-time processing and analysis of infrastructure surface cracks. By outperforming other state-of-the-art models, CrackScopeNet proves to be a suitable solution for addressing challenges associated with outdoor edge computing.
4.2. Scaling Study
To explore the adaptability of our model, we will adjust the number of channels and stack different numbers of Crack Scope Modules to cater to a broader range of application scenarios. Since CrackSeg9k is composed of multiple crack datasets, we will also investigate the model’s transferability to specific application scenarios.
We adjust the base number of channels after the stem from 32 to 64. Correspondingly, the number of channels in the remaining three feature extraction stages increase from (32, 64, 128) to (64, 128, 160) to capture more features. Meanwhile, the number of Crack Scope Modules stacked in each stage is adjusted from (3, 3, 4) to (3, 3, 3). We refer to the adjusted model as CrackScopeNet_Large. First, we train CrackScopeNet_Large on CrackSeg9k in the same parameter settings as the base version and evaluate the model on the test set. Furthermore, we use the training parameters and weights obtained from CrackSeg9k for these two models as the basis for transferring the models to downstream tasks in two specific scenarios. Images in the Ozgenel dataset are cropped to 448x448 and are high-resolution concrete crack images, similar to some scenarios in CrackSeg9k. The Aerial Track Dataset consists of low-altitude drone-captured images of post-earthquake highway cracks, cropped to 512x512, a type of scene not present in CrackSeg9k.
Table 5 presents the mIoU scores, parameter counts, and FLOPs of the base model CrackScopeNet and the high-accuracy version CrackScopeNet_Large on the CrackSeg9k dataset and two specific scenario datasets. In this table, mIoU(F) represents the mIoU score obtained after pre-training the model on CrackSeg9k and fine-tuning it on the respective dataset. It is evident that the large version of the model achieves higher segmentation accuracy across all datasets, but with approximately double the parameters and three times the FLOPs. Therefore, if computational resources and memory are sufficient, and higher accuracy in crack segmentation is required, the large version or further stacking of Crack Scope Modules can be employed.
For specific scenario training, whether from scratch or fine-tuning, our models are trained for only 20 epochs. It can be seen that even when training from scratch, our models converge quickly. We attribute this phenomenon to the initial design of CrackScopeNet, which consider the morphology of cracks and could successfully capture the necessary contextual information. For training using transfer learning, both versions of the model achieve remarkable mIoU scores on the Ozgenel dataset, with 90.1% and 92.31%, respectively. Even for the Aerial Track dataset, which includes low-altitude remote sensing images of highway cracks not seen in CrackSeg9k, our models still perform exceptionally well, achieving mIoU scores of 83.26% and 84.11%. These results demonstrate the proposed model’s rapid adaptability to small datasets, aligning well with real-world tasks.
4.3. Diagnostic Experiments
To gain more insights into CrackScopeNet, a set of ablative studies on CrackSeg9k are conducted. All the methods mentioned in this section are trained with the same parameters for efficiency in 200 epochs.
Stripe-wise Context Attention. First, we examine the role of the critical SWA module in CrackScopeNet by replacing it with two advanced attention mechanisms, CBAM[
36] and CA[
35]. The results are shown in
Table 6. It demonstrates that without any attention mechanism, merely stacking convolutional neural networks for feature extraction yields poor performance due to the limited receptive field. Then, the SWA attention mechanism, based on stripe pooling and one-dimensional convolution, is adopted, allowing the network structure to capture long-range contextual information. Under this configuration, the model exhibit the best performance.
Figure 8 shows the class activation maps (CAM)[
60] before the segmentation head of CrackScopeNet. It can be observed that without SWA, the model is easily disturbed by shadows, whereas with the SWA module, the model can focus on the global crack areas. Next, we sequentially replace the SWA module with the channel-spatial feature-based CBAM attention mechanism and the coordinate attention (CA) mechanism, which also uses stripe pooling. The model parameters do not change significantly, but the performance decline by 0.2% and 0.17%, respectively.
Furthermore, we explore the benefits of different attention mechanisms for other models by optimizing the advanced lightweight crack segmentation network HrSegNetB48[
29]. HrSegNetB48 consists of high-resolution and auxiliary branches, merging shallow detail information with deep semantic information at each stage. Therefore, we add SWA, CBAM, and CA attention mechanisms after feature fusion to capture richer features.
Table 6 shows the performance of HrSegNetB48 with different attention mechanisms, clearly indicating that introducing the SWA attention mechanism to capture long-range contextual information provides the most significant benefit.
Multi-scale Branch. Then, we examine the effect of the Multi-scale Branch in our Crack Scope Module. To ensure fairness, we replace the multi-scale branch with a convolution of larger kernel size, 5x5 instead of 3x3. The results with or without the multi-scale branch are shown in
Table 6. It is evident that using a 5x5 kernel size convolution instead of the multi-scale branch, even with more floating-point computations, decreases the mIoU score (-0.16%). This demonstrates that blindly adopting large kernel convolutions increases computational overhead without significant performance improvement. The benefits brought by multi-scale branch are further analyzed through the CAM. As shown in the third column of
Figure 8, when multi-scale branch is not used, it is obvious that the network misses the feature information of small cracks, while our model can perfectly capture the features of cracks of various shapes and sizes.
Decoder. CrackScopeNet uses a simple decoder to fuse feature information of different scales, complete the compression of channel features and the fusion of features at different stages. At present, the most popular decoders combine Atrous Spatial Pyramid Pooling (ASPP)[
23] module to introduce multi-scale information. In order to explore whether the introduction of ASPP module can bring benefits to our model and whether our proposed lightweight decoder is effective, we replace decoder with the ASPP method adopted by DeepLabV3+[
23], and the results are shown in the last two rows of
Table 6. It can be seen that the computational overhead is large because of the need to perform parallel dilated convolution operations on deep semantic information, but the performance of the model is not improved. We believe that this is because local feature information and long-distance context information have been taken into account when feature extraction is carried out in each stage, so it does not need to be complicated for the design of decoder.
4.4. Experiment Conclusion
Based on the comparative experiments on performance, parameter count, and FLOPs conducted in previous sections, CrackScopeNet demonstrates significant advantages over both classical and lightweight semantic segmentation models. On the composite CrackSeg9k dataset, CrackScopeNet achieves high segmentation accuracy and shows excellent transferability to specific scenarios. Notably, it maintains a low parameter count and minimal FLOPs, which translates to low-latency inference speeds on resource-constrained drone platforms without the need for GPU acceleration. This efficiency is achieved by considering crack morphology characteristics, allowing CrackScopeNet to remain lightweight and computationally efficient. This makes it particularly suitable for deployment on mobile devices in outdoor environments. In summary, CrackScopeNet achieves a better balance between segmentation accuracy and inference speed compared to other networks examined in this study, making it a promising solution for timely crack detection and analysis using drones in infrastructure surfaces.
5. Discussion
In this paper, we present CrackScopeNet, a lightweight infrastructure surface crack segmentation network specifically designed to address the challenges posed by varying crack sizes, irregular contours, and subtle differences between cracks and normal regions in real-world applications. The proposed CrackScopeNet network structure captures the local context information and long-distance dependencies of cracks through a lightweight multi-scale branch and SWA attention mechanism, respectively, and effectively extracts the low-level details and high-level semantic information required for accurate crack segmentation.
Experimental results demonstrate that CrackScopeNet delivers robust performance and high accuracy. It outperforms larger models like SegFormer in terms of efficiency, significantly reducing the number of parameters and computational cost. Furthermore, CrackScopeNet achieves faster inference speeds than other lightweight models such as BiSeNet and STDC, even without GPU acceleration. This makes it highly suitable for deployment on resource-constrained drone platforms, enabling efficient and low-latency crack detection in structural health monitoring. By making the model and code publicly available, we aim to advance the application of UAV remote sensing technology in infrastructure maintenance, providing an efficient and practical tool for the timely detection and analysis of cracks.
Unfortunately, in this era of large models, our model has only been trained and evaluated on datasets containing a few thousand images, while a large amount of data collection and huge manual labeling is the bottleneck. Recent advances in generative AI and self-supervised learning bypass the limitations imposed by data acquisition and manual annotation. Researchers use the inherent structure or attributes of existing data to generate richer "fake images" and "fake labels", and it will be a very interesting research avenue to apply them to crack detection.
Figure 1.
Comparisons between classical and lightweight semantic segmentation networks and the CrackScopeNet on CrackSeg9k dataset.
Figure 1.
Comparisons between classical and lightweight semantic segmentation networks and the CrackScopeNet on CrackSeg9k dataset.
Figure 2.
A structural overview of CrackScopeNet. (a) CrackScopeNet consists of three down-sampling stages, and each stage contains N Crack Scope Modules and FFN modules (we refer to these two combined modules as Crack Scope Block). (b) Crack Scope Module implies a multi-scale branch, where the input is divided into three parts along the channel dimension, and an (c) SWA module. (d) Upsampling Block upsamples deep features and stacks them with shallow information, using SWA modules and convolutional layers for feature fusion.
Figure 2.
A structural overview of CrackScopeNet. (a) CrackScopeNet consists of three down-sampling stages, and each stage contains N Crack Scope Modules and FFN modules (we refer to these two combined modules as Crack Scope Block). (b) Crack Scope Module implies a multi-scale branch, where the input is divided into three parts along the channel dimension, and an (c) SWA module. (d) Upsampling Block upsamples deep features and stacks them with shallow information, using SWA modules and convolutional layers for feature fusion.
Figure 3.
Samples in three crack datasets. The first line is the original images, and the second line is the overlay effect of the masks and the original images. (a) Samples in CrackSeg9k dataset. (b) Samples in Ozgenel dataset. (c) Samples in Aerial Track dataset.
Figure 3.
Samples in three crack datasets. The first line is the original images, and the second line is the overlay effect of the masks and the original images. (a) Samples in CrackSeg9k dataset. (b) Samples in Ozgenel dataset. (c) Samples in Aerial Track dataset.
Figure 4.
Visualization of the segmentation results of the classical segmentation models and our model on the CrackSeg9k test set.
Figure 4.
Visualization of the segmentation results of the classical segmentation models and our model on the CrackSeg9k test set.
Figure 5.
Visual segmentation results of the lightweight segmentation model and our model on the CrackSeg9k test set.
Figure 5.
Visual segmentation results of the lightweight segmentation model and our model on the CrackSeg9k test set.
Figure 6.
Visual segmentation results of the crack-specific segmentation model on the CrackSeg9k test set.
Figure 6.
Visual segmentation results of the crack-specific segmentation model on the CrackSeg9k test set.
Figure 7.
Results of inference speed test on Navio2-based Drones.
Figure 7.
Results of inference speed test on Navio2-based Drones.
Figure 8.
Showing visual explanations for different component of CrackScopeNet.
Figure 8.
Showing visual explanations for different component of CrackScopeNet.
Table 1.
Instance of CrackScopeNet.
Table 1.
Instance of CrackScopeNet.
Stage |
Downsampling |
Upsampling |
Output Size |
|
Operation |
|
|
Operation |
|
|
|
S0 |
Input |
|
3 |
Seg Head |
96 |
2 |
400x400 |
S1 |
Stem |
3 |
32 |
Concatenate |
64 |
96 |
100x100 |
S2 |
CS Stage x 3 |
32 |
32 |
Up-samp. |
128 |
64 |
100x100 |
S3 |
Donw-samp. |
32 |
64 |
Concatenate |
64 |
128 |
50x50 |
|
CS Stage x 3 |
64 |
64 |
Up-samp. |
128 |
64 |
50x50 |
S4 |
Donw-samp. |
64 |
128 |
|
25x25 |
|
CS Stage x 4 |
128 |
128 |
|
25x25 |
Table 2.
Sub-datasets in CrackSeg9k.
Table 2.
Sub-datasets in CrackSeg9k.
Name |
Number |
Material |
Masonry[39] |
240 |
Masonry structures |
CFD[40] |
118 |
Paths and sidewalks |
CrackTree200[41] |
175 |
Pavement |
Volker[42] |
427 |
Concrete structures |
DeepCrack[43] |
443 |
Concrete and asphalt surfaces |
Ceramic[44] |
100 |
Ceramic tiles |
SDNET2018[45] |
1,411 |
Building facades, bridges, sidewalks |
Rissbilder[42] |
2,736 |
Building surfaces (walls, bridges) |
Crack500[21] |
3,126 |
Pavement |
GAPS384[46] |
383 |
Pavement and concrete surfaces |
Table 3.
The parameter settings for training on CrackSeg9k dataset.
Table 3.
The parameter settings for training on CrackSeg9k dataset.
Item |
Setting |
Epoch |
200 |
Batch Size |
16 |
Optimizer |
Adamw |
Weight decay |
0.01 |
Beta1 |
0.9 |
Beta2 |
0.999 |
Initial learning rate |
0.005 |
Learning rate decay type |
poly |
GPU memory |
12 GB |
Image size |
400x400 |
Table 4.
Performance of different methods and our method on the CrackSeg9k dataset.
Table 4.
Performance of different methods and our method on the CrackSeg9k dataset.
Model |
mIoU |
Pr(%) |
Re(%) |
F1(%) |
Params |
FLOPs |
Classical |
U-Net[51] |
81.36 |
90.60 |
87.00 |
88.76 |
13.40M |
75.87G |
PSPNet[37] |
81.69 |
89.19 |
88.72 |
88.95 |
21.06M |
54.20G |
SegNet[52] |
80.50 |
89.71 |
86.57 |
88.11 |
29.61M |
103.91G |
DeepLabV3+[23] |
80.96 |
88.555 |
88.29 |
88.42 |
2.76M |
2.64G |
SegFormer[19] |
81.63 |
89.815 |
88.05 |
88.92 |
3.72M |
4.13G |
SegNext[24] |
81.55 |
89.28 |
88.44 |
88.86 |
4.23M |
3.72G |
Lightweight |
BiSeNet[28] |
81.01 |
89.74 |
87.26 |
88.48 |
12.93M |
34.57G |
BiSeNetV2[53] |
80.66 |
89.36 |
87.11 |
88.22 |
2.33M |
4.93G |
STDC[54] |
80.84 |
88.92 |
87.76 |
88.34 |
8.28M |
5.22G |
STDC2[54] |
80.94 |
89.54 |
87.33 |
88.42 |
12.32M |
7.26G |
TopFormer[55] |
80.96 |
89.28 |
87.60 |
88.43 |
5.06M |
1.00G |
SeaFormer[56] |
79.13 |
87.29 |
87.19 |
87.20 |
4.01M |
0.64G |
Specific |
U2Crack[50] |
81.45 |
90.125 |
87.52 |
88.80 |
1.19M |
31.21G |
HrSegNetB48[29] |
81.07 |
90.39 |
86.78 |
88.55 |
5.43M |
5.59G |
HrSegNetB64[29] |
81.28 |
90.44 |
87.03 |
88.70 |
9.65M |
9.91G |
CrackScopeNet |
82.15 |
89.34 |
89.24 |
89.29 |
1.047M |
1.58G |
Table 5.
Evaluation results of the two versions of our model on three different datasets. CSNet and CSNet_L represent terms of CrackScopeNet and CrackScopeNet_Large. mIoU(F) represents the model mIoU score pre-trained on CrackSeg9k dataset.
Table 5.
Evaluation results of the two versions of our model on three different datasets. CSNet and CSNet_L represent terms of CrackScopeNet and CrackScopeNet_Large. mIoU(F) represents the model mIoU score pre-trained on CrackSeg9k dataset.
Model |
CrackSeg9k |
Ozgenel |
Aerial Track Dataset |
mIoU |
Param |
FLOPs |
mIoU |
mIoU(F) |
FLOPs |
mIoU |
mIoU(F) |
FLOPs |
CSNet |
82.15% |
1.05M |
1.58G |
90.05% |
92.11% |
1.98G |
79.12% |
82.63% |
2.59G |
CSNet_L |
82.48% |
2.20M |
5.09G |
90.71% |
92.36% |
6.38G |
81.04% |
83.43% |
8.33G |
Table 6.
Ablation study on effectiveness of each component in CrackScopeNet.
Table 6.
Ablation study on effectiveness of each component in CrackScopeNet.
Attention |
Mutil-branch |
Decoder |
mIoU(%) |
FLOPs(G) |
SWA |
CA |
CBAM |
|
ours |
ASPP |
|
|
|
|
|
|
✓ |
|
81.34 |
1.57 |
|
✓ |
|
✓ |
✓ |
|
81.98 |
1.58 |
|
|
✓ |
✓ |
✓ |
|
81.95 |
1.58 |
✓ |
|
|
|
✓ |
|
81.91 |
1.61 |
✓ |
|
|
|
|
✓ |
82.14 |
2.89 |
✓ |
|
|
✓ |
✓ |
|
82.15 |
1.58 |
Table 7.
Adding different attention mechanisms to HrSegNet.
Table 7.
Adding different attention mechanisms to HrSegNet.
model |
mIoU(%) |
Pr(%) |
Re(%) |
F1(%) |
Params |
FLOPs |
HrSegNetB48 |
81.07 |
90.39 |
86.78 |
88.55 |
5.43M |
5.59G |
HrSegNetB48+CBAM |
81.16 |
90.40 |
86.90 |
8.61 |
5.44M |
5.60G |
HrSegNetB48+CA |
81.20 |
90.24 |
87.08 |
88.63 |
5.44M |
5.60G |
HrSegNetB48+SWA |
81.72 |
89.65 |
88.33 |
88.98 |
5.48M |
5.60G |