5.1. Results for Verification Area 1
5.1.1. Flight Altitude of 5 m
The quantitative results for Verification Area 1 at a flight altitude of 5 m are summarized as follows. Under the standard detection-threshold setting, Method 2 improved Precision while largely preserving the crack detections obtained by Method 1, resulting in higher overall performance. This indicates that introducing anomaly detection after object detection can effectively suppress false positives without substantially sacrificing crack detectability.
In contrast, Method 3, which employed a lower object-detection threshold, increased Recall by extracting crack candidates more comprehensively, but this was accompanied by a substantial decrease in Precision due to a large number of newly generated false positives. This result suggests that lowering the detection threshold improves candidate coverage, but it also leads to excessive responses to fine surface irregularities in concrete.
A clear difference was observed between Methods 4 and 5 in the low-threshold setting. In Method 4, many of the false positives generated by Method 3 persisted in the final results, indicating that anomaly detection based solely on grayscale information was insufficient to distinguish true cracks from small surface unevenness. By contrast, Method 5 removed most of these false positives, suggesting that preprocessing with the Frangi filter was more effective because it emphasized crack-like linear geometry rather than brightness alone. The results therefore demonstrate that, for suppressing false positives induced by low-threshold operation, geometric structure-based preprocessing is more suitable than grayscale-only preprocessing.
The numerical results are also consistent with this interpretation. Regarding Verification Area 1 at 5 m, Method 5 achieved a Precision of 0.7443, outperforming Method 4 (0.3475) and Method 3 (0.2380), while maintaining a Recall of 0.8674. Its F1-score and F2-score reached 0.8012 and 0.8396, respectively. These values indicate that Frangi-based preprocessing substantially improved the low-threshold results compared with Methods 3 and 4 by recovering Precision while retaining relatively high Recall. However, Method 5 did not surpass Method 2 in the balanced or recall-oriented metrics under this condition. Therefore, for Verification Area 1 at 5 m, the standard-threshold anomaly-based configuration (Method 2) remained the more reliable overall operating strategy, although Method 5 still demonstrated the practical benefit of Frangi-based preprocessing as a countermeasure for low-threshold over-detection.
Figure 9 visually supports the same conclusion as the quantitative results. Method 2 removed part of the false detections observed in Method 1 while preserving the main crack pattern. Method 3 produced numerous spurious detections over the concrete surface, whereas Method 5 substantially reduced them. Thus, at 5 m, the proposed framework improved not only the standard-threshold baseline but also the reliability of the low-threshold result through Frangi-based preprocessing.
5.1.2. Flight Altitude of 10 m
The quantitative results for Verification Area 1 at a flight altitude of 10 m are summarized as follows. As in the 5 m case, Method 2 improved Precision relative to Method 1 while only slightly decreasing Recall, and consequently achieved higher F1 and F2 scores. This confirms that anomaly-based false-positive reduction remained effective even at the 10 m image resolution. In other words, under the standard threshold setting, the proposed second-stage verification still contributed to more reliable crack detection.
Method 3 again increased Recall relative to Method 1, but at the cost of a substantial drop in Precision, indicating that the lower detection threshold yielded more crack candidates while also generating many new false positives. This pattern is consistent with the 5 m case and suggests that low-threshold object detection remained highly sensitive to non-crack surface patterns.
However, unlike at 5 m, the effectiveness of anomaly detection in the low-threshold setting became markedly weaker at 10 m. Method 4 showed only a very limited improvement over Method 3, implying that grayscale-based preprocessing could no longer sufficiently distinguish cracks from concrete-surface irregularities under reduced resolution. Method 5 still outperformed Method 4, which suggests that emphasizing line-like geometry remained beneficial. For example, in Verification Area 1, the Precision of Method 4 was 0.1604, whereas that of Method 5 increased to 0.5441. Nevertheless, this value was still far lower than the corresponding 5 m result for Method 5 (0.7443), showing that the benefit of Frangi-based preprocessing deteriorated considerably as image resolution decreased.
This degradation indicates that, at 10 m, the geometric differences between fine cracks and small concrete-surface irregularities became less distinguishable, thereby reducing the effectiveness of both the Frangi filter and the ViT-based anomaly detector. As a result, although Method 5 remained more effective than Method 4 at suppressing false positives generated by low-threshold detection, the overall balance between reliability and crack detectability remained inferior to that of Method 2. Therefore, for Verification Area 1 at 10 m, the most reliable operating strategy was not the low-threshold setting but rather the standard-threshold detection followed by anomaly detection.
Figure 10 visually supports the same interpretation as the quantitative results. Method 2 retained the principal crack detections while removing part of the false positives present in Method 1. In contrast, Method 3 produced dense spurious responses across the concrete surface. Although Method 5 reduced these responses more effectively than Method 4, residual false detections remained more noticeable than in the 5 m results, indicating a decline in discrimination performance due to reduced image resolution.
Overall, the comparison between 5 m and 10 m suggests that the proposed framework remained effective under the standard-threshold setting across both resolutions. In contrast, the benefit of low-threshold operation became increasingly limited as the image resolution decreased. This finding implies that the practical advantage of combining broad candidate extraction with anomaly-based filtering depends strongly on image quality and GSD. To improve the completeness of the altitude-dependent analysis, the detailed results for the 15 m and 20 m cases of Verification Area 1 are additionally provided in
Appendix B. These supplementary results show the same overall tendency as observed at 10 m: Method 2 remained the most reliable configuration under the standard-threshold setting, whereas the low-threshold variants, particularly Methods 4 and 5, became less effective as image resolution deteriorated.
Figure 11 provides an overview of the detection performance of the five methods across Verification Areas 1–5 at a flight altitude of 10 m, and highlights the stable effectiveness of Method 2 under the standard-threshold setting despite the reduced image resolution.
Because the subsequent discussion focuses on Verification Areas 2–5 under the 5 m condition,
Figure 12 provides an overview of the detection performance of the five methods across Verification Areas 1–5 at a flight altitude of 5 m.
5.2. Results for Verification Areas 2–5
Because the reliability of crack detection decreased with increasing flight altitude in Verification Area 1, the subsequent verification for Areas 2–5 was conducted using aerial images acquired at a flight altitude of 5 m. This setting was adopted in order to evaluate the proposed method under sufficiently high-resolution conditions while focusing on differences in environmental disturbances among sites.
As shown in
Figure 12, the results at 5 m across Verification Areas 1–5 exhibited broadly consistent trends. Under the standard-threshold setting, Method 2 generally improved Precision relative to Method 1 while maintaining comparable Recall, indicating that anomaly detection remained effective as a post-processing step for suppressing false positives across different environments. By contrast, Method 3, which adopted a lower detection threshold, increased Recall across all areas but led to a substantial decline in Precision, confirming that broader candidate extraction inevitably induced many additional false positives. When anomaly detection was applied after low-threshold detection, both Method 4 and Method 5 recovered part of the lost Precision. However, their relative effectiveness varied depending on site conditions, suggesting that the benefit of preprocessing depended on the dominant surface patterns and disturbance types present in each area.
5.2.1. Verification Area 2
For Verification Area 2, Method 2 provided the most stable overall improvement over the baseline. Compared with Method 1, Precision increased from 0.7391 to 0.9042, while Recall decreased only slightly from 0.8635 to 0.8440. As a result, both F1-score and F2-score improved, from 0.7965 and 0.8354 to 0.8731 and 0.8554, respectively. These results indicate that, in this area, anomaly detection effectively removed false positives generated by the baseline detector without causing a substantial loss of crack detectability.
Under the low-threshold setting, Method 3 increased Recall to 0.9422 but reduced Precision to 0.3778, demonstrating the expected trade-off between broader candidate extraction and severe over-detection. Method 4 showed only limited recovery, whereas Method 5 substantially improved Precision to 0.8941 while maintaining Recall at 0.8159. Although Method 5 did not surpass Method 2 in balanced metrics, it clearly outperformed Method 4, suggesting that Frangi-based preprocessing was more effective than grayscale-only preprocessing for suppressing false positives induced by the low-threshold operation in this area.
5.2.2. Verification Area 3
Verification Area 3 exhibited a slightly different tendency. As in the other areas, Method 2 markedly improved Precision relative to Method 1, increasing it from 0.5000 to 0.9254, while Recall changed only marginally from 0.5735 to 0.5507. Consequently, the F1-score increased from 0.5343 to 0.6905, confirming that the proposed anomaly detection step again contributed to more reliable crack detection under the standard-threshold setting.
However, under the low-threshold setting, the performance balance differed from that in Verification Area 2. Method 4 achieved Precision and Recall values of 0.5758 and 0.7350, respectively, yielding the highest F2-score among the low-threshold variants (0.6964), whereas Method 5 slightly improved the F1-score to 0.6474 but reduced the F2-score to 0.6438. This suggests that, in Verification Area 3, both preprocessing strategies were effective to some extent, but grayscale-based preprocessing retained a slight advantage in the recall-oriented evaluation, whereas Frangi-based preprocessing provided a marginally better balance between precision and recall.
5.2.3. Verification Area 4
In Verification Area 4, the superiority of Method 2 became particularly clear. Relative to Method 1, Method 2 increased Precision from 0.3981 to 0.7185 while maintaining a high Recall of 0.9408, resulting in substantial improvements in F1-score and F2-score, from 0.5641 and 0.7521 to 0.8148 and 0.8860, respectively. This confirms that anomaly detection introduced after standard-threshold object detection provided the most reliable crack detection for this area.
Method 3 again yielded the highest Recall (0.9915) but at the cost of extremely low Precision (0.1697), indicating the generation of a large number of false positives. Although both Method 4 and Method 5 improved Precision relative to Method 3 and even exceeded Method 1 in balanced metrics, neither method outperformed Method 2. The results likewise indicate that, for this area, introducing anomaly detection into the existing standard-threshold framework was more reliable than lowering the detection threshold and then attempting to remove the newly generated false positives.
5.2.4. Verification Area 5
Verification Area 5 showed another practically important pattern. Method 2 again improved Precision over the baseline, from 0.5352 to 0.8154, while Recall decreased only moderately from 0.6546 to 0.6018. This led to a higher F1-score than Method 1 (0.6925 vs. 0.5889), indicating that the standard-threshold anomaly-detection framework remained effective in this area as well.
At the same time, the low-threshold variants achieved competitive results in Verification Area 5. Method 5 increased Precision to 0.7736 while maintaining Recall at 0.6813, resulting in the highest F1-score among all five methods in this area (0.7245). By contrast, Method 4 achieved the highest F2-score (0.7074) because it retained a somewhat higher Recall of 0.7708. These results further suggest that Method 5 suppressed concrete-surface false positives more effectively than Method 4, but it also removed part of the fine crack regions as false positives. Thus, in Verification Area 5, Frangi-based preprocessing improved precision recovery under the low-threshold setting, whereas grayscale-based preprocessing preserved slightly better crack coverage.
5.2.5. Discussion of Verification Areas 2–5
Taken together, the results for Verification Areas 2–5 reinforce two main findings. First, Method 2 consistently provided a robust and reliable improvement over the conventional baseline across different site conditions. This supports the interpretation that anomaly detection can function as a stable false-positive suppression layer when applied after standard-threshold object detection. Second, the effectiveness of low-threshold detection followed by anomaly filtering was strongly site-dependent. In some areas, especially Verification Area 5, Method 5 achieved competitive or even superior F1 scores, whereas in others, Method 2 remained clearly preferable. These results suggest that the practical value of low-threshold operation depends not only on image resolution but also on the local characteristics of concrete-surface texture and environmental noise.
5.3. General Discussion
The experimental results obtained in this study reveal two important findings regarding the practical UAV-based crack inspection of port quay walls. First, integrating anomaly detection into the conventional object detection framework under the standard detection-threshold setting yielded a stable improvement in reliability across different flight conditions and site environments. Overall, Method 2 provided the most stable overall improvement over the conventional baseline across the tested conditions, indicating that anomaly detection can be incorporated into the existing system as a robust false-positive suppression layer without substantially sacrificing crack detectability. This finding is particularly important for practical deployment, because it suggests that the proposed framework can improve inspection reliability even in visually complex inspection environments on port quay walls, where debris and other disturbances frequently interfere with crack identification. Accordingly, the main contribution of this study is not the proposal of a fundamentally new anomaly-detection architecture, but the operational validation of a reliability-oriented inspection framework under practical conditions relevant to UAV-based geomatics and infrastructure monitoring.
Second, the experimental results clarify both the potential and the limitations of the low-threshold detection strategy. Lowering the object detection threshold increased crack candidate coverage and enabled capturing fine crack regions that were likely to be missed in the conventional setting. However, this broader candidate extraction also yielded a large number of false positives due to concrete-surface irregularities and environmental noise. In this context, Method 5, which combined low-threshold object detection with Frangi-based preprocessing and anomaly detection, demonstrated greater versatility than Method 4, which relied primarily on grayscale-based preprocessing. In particular, under the 5 m condition, Method 5 achieved a favorable balance between broader detection and false-positive suppression in some verification areas. These results suggest that emphasizing geometric line-like structures is more effective than relying only on brightness information when distinguishing cracks from visually similar non-crack patterns. In this sense, the study should be understood as a practically oriented comparative evaluation of false-positive suppression strategies across object-detection thresholds, preprocessing conditions, and image resolutions represented by different flight altitudes and GSDs, rather than as a benchmark for proposing a new deep learning backbone.
At the same time, the results also demonstrate that the effectiveness of this low-threshold strategy is strongly dependent on image resolution. As flight altitude increased, the geometric difference between fine cracks and small concrete-surface unevenness became less distinguishable, and the anomaly detection stage was no longer able to sufficiently eliminate the newly generated false positives. These findings indicate that, when the flight altitude exceeded 10 m, the current combination of Frangi filtering and the ViT-based anomaly detector approached its discrimination limit, as reduced image resolution weakened the geometric cues required for reliable separation. Accordingly, the present results imply that the practical benefit of the low-threshold strategy is conditional on sufficiently high-resolution imagery. In contrast, the standard-threshold anomaly-detection framework is more robust across operating conditions. The need for condition-specific threshold calibration is a current limitation of the proposed framework, and future work should investigate more principled and automated threshold selection strategies. A supplementary sensitivity analysis for representative conditions is additionally provided in
Appendix C, showing that the adopted thresholds corresponded to practical operating points near the balance between false-positive suppression and crack retention.
Another limitation of the present study is the potential domain gap between the SDNET2018 crack images used to construct the crack feature center and the UAV-acquired images of port quay walls used for verification. These two image domains differ in acquisition geometry, image resolution, lighting conditions, and background texture, and such differences may affect the transferability of the learned crack feature distribution. Therefore, the present results should be interpreted as evidence that the SDNET2018-based crack feature center retained practical utility under the investigated conditions, rather than as proof that the same transferability will hold universally across different UAV inspection settings. Future work should examine feature-center construction using UAV-specific crack datasets and more systematic strategies for reducing domain mismatch.
From a geomatics perspective, these findings have two implications. One is methodological: reliability enhancement should not be discussed only in terms of georeferenced output, but also in terms of the fundamental discrimination capability of the image-analysis pipeline itself. The other is operational: for low-cost UAV-based infrastructure inspection, maintaining stable performance under practical field conditions requires not only efficient image acquisition but also an appropriate balance between crack candidate extraction and post-detection verification. In this sense, the proposed anomaly-based framework can be interpreted as a reliability-oriented extension to a low-cost UAV inspection workflow. In other words, the present study focuses on reliability validation at the image-analysis level, while the extension to a full orthophoto-based workflow should be understood as the next implementation step rather than a prerequisite for interpreting the core discrimination results.
To illustrate the practical integration of the proposed framework into a georeferenced inspection workflow,
Figure 13 presents an example in which crack detection results were projected onto an orthophoto of the same inspection area. The comparison visually indicates that the anomaly-based configuration (Method 2) reduced many spatially scattered false positives in the orthophoto-based output while preserving the main crack-related detections. In particular, the anomaly-based result shows fewer false detections around the drift-debris-rich central portion of the orthophoto. Because the orthophoto used in
Figure 13 was generated from aerial images acquired at 30 m due to site constraints, its spatial resolution is lower than that of the images used in the main quantitative evaluation. Accordingly, some cracks are not clearly represented in the orthophoto-based output, and this figure should be interpreted as an illustrative demonstration of georeferenced integration rather than as a complete crack-detection result. This figure is therefore intended only as an end-to-end illustration of georeferenced output, whereas the main quantitative evaluation in this study remains based on single-image analysis.
Nevertheless, several issues remain. The current anomaly detection framework may still remove parts of true crack regions, especially under aggressive filtering conditions, and its effectiveness decreases with lower image resolution. These results also suggest several future directions, including dedicated learning strategies tailored to port quay wall inspection, geometric or morphological completion of crack regions partially removed during anomaly filtering, threshold optimization strategies, and the use of super-resolution techniques. Additional improvements may also be achieved by updating the baseline object detection model itself and by developing hybrid discrimination frameworks that combine geometric, color, and depth-related features. These directions are expected to further improve robustness against changes in altitude and imaging environment.
It should also be noted that the scope of the present study was intentionally limited to port quay walls, which represent a practically important but visually challenging inspection target. Accordingly, the current findings should be interpreted as evidence of effectiveness within this domain, based on comparisons across flight altitudes, preprocessing conditions, and threshold settings, rather than as proof of immediate generalization to other infrastructure types or external crack datasets. Validation across different infrastructure assets and cross-dataset settings remains an important subject for future work. It should also be noted that the present study did not include formal statistical significance testing, because the evaluation was designed as a practical comparative analysis across a limited number of site-specific verification cases rather than as a large-scale benchmark experiment with repeated randomized trials. Therefore, the present findings should be interpreted primarily in terms of the consistency of performance trends observed across the investigated areas, flight altitudes, and preprocessing conditions. Because all compared methods were evaluated on the same predefined spatial extent for each verification area and flight-altitude condition, the observed differences should be interpreted as method-dependent changes under matched visual conditions rather than as differences caused by scene selection. Accordingly, the present results should be interpreted as evidence of the comparative benefit of anomaly-based post-processing under condition-dependent calibration, rather than as proof of a universally threshold-free inspection workflow.
To examine the practical applicability of the proposed framework, the computational environment and runtime were also recorded.
Table 3 summarizes the hardware environment used for runtime measurement. Because YOLOR-based crack detection and anomaly detection (for both grayscale- and Frangi-based settings) were executed in different software environments, the corresponding Python, PyTorch, and CUDA versions are listed separately.
Table 4 reports the average number of candidate regions and the average total processing time under representative experimental conditions. The runtime values were measured in seconds from two representative site images in Verification Areas 1 and 2, using a single execution for each condition and including image loading and saving. In contrast, GeoJSON output was not included because the present study evaluated the framework at the single-image level.
Table 4 focuses on representative configurations that include anomaly-based post-processing, because the purpose of the runtime comparison was to examine the additional computational cost introduced by preprocessing and anomaly detection. The runtime comparison was intended solely as a practical reference, not a strict benchmark analysis.
As expected, the low-threshold setting substantially increased the number of candidate regions and, consequently, the total processing time of the overall workflow.