4.1. Datasets
Experiments are conducted on two representative SAR datasets with distinct complexity characteristics, i.e, SAR-Aircraft-1.0 and HRSID.
SAR-AIRcraft-1.0 consists of 4,368 SAR images acquired by the GaoFen-3 satellite [
21], containing 16,463 aircraft instances across seven fine-grained categories. There are Boeing 737, Boeing 787, A220, A320/321, A330, ARJ21, and other categories. This dataset is characterized by complex airport environments, dense target distribution, strong structural interference from airport facilities, multi-scale aircraft instances, and fine-grained classification requirements. Such characteristics result in strong background clutter and heterogeneous scattering patterns. The dataset is split into training, validation, and test sets with a 7:1:2 proportion.
HRSID contains 5,604 SAR images collected from three satellites, with 16,951 ship instances [
22]. Unlike SAR-Aircraft-1.0, it includes only a single ship category without fine-grained subdivision. Ships are distributed in offshore and nearshore scenes, accounting for 81.6% and 18.4%, respectively. Since most targets are located in offshore areas with relatively homogeneous backgrounds, the overall dataset complexity is relatively low. The dataset is split into training, validation, and test sets with an 8:1:1 ratio.
4.4. Ablation Experiment
Ablation experiments are conducted on both SAR-Aircraft-1.0 and HRSID, with the original YOLOv8 serving as the baseline. To ensure fairness and mitigate the risk of overfitting due to limited SAR training data, ResNet18 is used as the backbone in the baseline, while the original FPN-PAN neck and detection head are retained. Based on this baseline, DCSP, SAN, and CMH are progressively incorporated to evaluate their respective effectiveness and complementary contributions.
Results on SAR-Aircraft-1.0 dataset: The ablation results are presented in
Table 1. After introducing DCSP into the backbone, P, R, AP50, AP75, and AP50:95 are increased by 0.4%, 0.5%, 0.7%, 0.2%, and 0.4%, respectively. Meanwhile, the number of parameters shows a slight increase, while the inference speed exhibits a marginal decrease, indicating that DCSP enhances feature representation at a low additional cost. Replacing the original FPN-PAN with SAN improves P, R, AP50, AP75, and AP50:95 by 1.3%, 2%, 1.5%, 0.5%, and 0.3%, respectively. The notable gains in precision, recall, and AP50 suggest more efficient multi-scale feature aggregation and better prediction quality in complex airport scenes. In addition, SAN significantly reduces parameters and FLOPs. Although FPS decreases from 285 to 251, this is mainly due to the additional multi-branch fusion and feature alignment operations, which are less hardware-efficient despite their lower theoretical computational cost. Replacing the original detection head with CMH improves P, R, AP50, AP75, and AP50:95 by 2.9%, 2.1%, 2.6%, 0.8%, and 0.7%, respectively. The larger gains in precision, recall, and AP50 indicate that CMH mainly improves overall prediction quality by enhancing the alignment between classification confidence and localization reliability. Consequently, false alarms and missed detections are reduced, while parameters, FLOPs, and FPS remain nearly unchanged.
For the dual-module settings, DCSP+SAN further improves precision, recall, and AP50:95 over SAN alone, with only slight decreases in AP50 and AP75, indicating a trade-off between stronger feature representation and strict localization accuracy. SAN+CMH achieves clear gains in precision and AP50:95 over SAN alone, demonstrating effective complementarity between feature fusion and reliability-aware prediction modulation. DCSP+CMH yields the most notable gains in recall and AP50 compared with DCSP alone, suggesting improved target coverage and coarse-localization quality. The full model achieves the best performance, with P, R, AP50, AP75, and AP50:95 increased by 3.8%, 3.8%, 2.9%, 1%, and 1.4%, respectively, compared with the baseline. These results demonstrate that DCSP, SAN, and CMH can be effectively integrated to exploit complementary strengths and improve overall detection performance.
Heatmaps for the baseline and the progressively enhanced variants with DCSP, SAN, and CMH are presented in
Figure 8(a)-(d), respectively, while the ground truth is given in
Figure 8(e). As shown in
Figure 8(a), the baseline fails to sufficiently highlight the targets’ weak-scattering components, resulting in incomplete activation across the target regions. After introducing DCSP into the backbone, as shown in
Figure 8(b), the activations become more concentrated around the dominant scattering centers, and the response intensity over the true target regions is noticeably enhanced. This improvement can be attributed to the enlarged receptive field and stronger contextual aggregation capability of DCSP. Nevertheless, some parts of the targets still exhibit relatively weak responses. In
Figure 8(c), after replacing the original FPN-PAN with SAN, the target regions are well activated, and the overall activation distribution becomes more spatially consistent due to enhanced cross-scale feature interaction and feature alignment. However, a few background regions show weak activations. After further replacing the original detection head with CMH, the heatmap in
Figure 8(d) exhibits clearer target focus and sharper response boundaries. In particular, non-target strong scattering regions are significantly suppressed, and the activation responses are better aligned with the ground-truth bounding boxes in
Figure 8(e). These observations indicate that the proposed reliability-aware classification-regression alignment effectively mitigates clutter-induced false alarms and improves localization consistency within true target regions.
Results on HRSID dataset: The ablation results are summarized in
Table 2. After introducing DCSP into the backbone, the gains in precision and AP50:95 are more pronounced than those observed on the SAR-Aircraft-1.0 dataset. Although recall decreases slightly by 0.1%, this can be regarded as a normal fluctuation. These results suggest that DCSP primarily improves target discrimination rather than recall on this dataset, since ship targets are set against relatively clean backgrounds, making target coverage less challenging. Replacing the original FPN-PAN with SAN yields clear improvements in precision and AP50:95, while recall decreases noticeably. This indicates that SAN effectively suppresses false alarms but tends to produce fewer positive predictions, leading to a more precision-oriented behavior. Replacing the original detection head with CMH yields moderate overall improvements, while AP75 decreases slightly. This indicates that the benefit of CMH in relatively simple scenes is mainly reflected in overall prediction quality rather than further enhancement of high-IoU localization.
For the dual-module settings, DCSP+SAN further improves all evaluation metrics over DCSP alone, demonstrating complementary effects between stronger feature representation and adaptive feature fusion. SAN+CMH improves recall over SAN alone, indicating that CMH can effectively alleviate the conservative prediction tendency introduced by SAN. DCSP+CMH yields a notable improvement in AP75 compared with CMH alone, suggesting that the combination is particularly beneficial for high-quality localization. The complete model yields consistent performance improvements across all evaluation metrics, demonstrating that the three proposed modules effectively complement one another. In addition, after introducing one or more modules, the variation trends of parameters, FLOPs, and FPS remain generally consistent with those on the SAR-Aircraft-1.0 dataset.
Heatmaps for the baseline and for sequentially introducing DCSP, SAN, and CMH to the baseline are presented in
Figure 9(a)-(d), respectively; the ground truth is given in
Figure 9(e). In
Figure 9(a), the phenomena where some target regions fail to be activated, and background clutter is activated, are observed in different image samples. In
Figure 9(b), after introducing DCSP, the activation becomes more concentrated around the central target structure. However, some target regions still fail to be activated. In
Figure 9(c), after replacing the original FPN-PAN with SAN, all target regions are correctly activated, while some background regions are weakly activated. In
Figure 9(d), after further replacing the original detection head with CMH, the target regions are correctly activated, whereas the background clutter is not. Overall,
Figure 9 verifies that each proposed component contributes progressively to improved target localization and clutter suppression.
4.6. Comparative Experiment
The effectiveness of the proposed detector is evaluated by comparing it with several state-of-the-art detectors on the SAR-Aircraft-1.0 and HRSID datasets.
Results on the SAR-Aircraft-1.0 dataset: For a fair comparison, all detectors are evaluated using ResNet18 and ResNet50 as backbones, and the results are presented in
Table 5. Across both backbone configurations, the proposed CGAAN consistently achieves the best performance among all compared detectors. When using ResNet18, CGAAN outperforms representative anchor-based and anchor-free detectors, including RetinaNet [
43], GFL [
44], AutoAssign [
45], ATSS [
46], and FCOS [
47]. Compared with more recent advanced detectors, such as RTMDet [
48] and YOLOv10 [
49], CGAAN consistently improves all evaluation metrics, demonstrating comprehensive performance gains in false-alarm suppression, target coverage, and localization in complex airport scenes. When replacing the backbone with ResNet50, CGAAN still achieves higher recall, AP50, and AP75 than Faster R-CNN [
50], Cascade R-CNN [
51], RepPoints [
52], SKG-Net [
53], and SA-Net [
21], demonstrating its robustness across different backbone configurations. However, compared with the ResNet18-based setting, the performance decreases. This indicates that a deeper backbone is not more suitable for small-sample SAR datasets.
The detection results of all models using ResNet18 as the backbone are visualized for comparison. Five randomly selected images are used for qualitative comparison, with the results of eight detectors presented in
Figure 10(a)-(h). In these figures, green, red, and blue bounding boxes denote correctly detected targets, missed detections, and false alarms, respectively. As shown in
Figure 9(a)-(e), early detectors such as RetinaNet, GFL, AutoAssign, ATSS, and FOCS suffer from false alarms and missed detections, and duplicate detections frequently occur. This observation indicates insufficient discrimination between targets and complex background clutter. In
Figure 10(f)-(g), corresponding to RTMDet and YOLOv10, the number of missed detections is significantly reduced. However, false alarms remain relatively prominent, suggesting that although these methods improve target coverage, their ability to suppress clutter-induced responses is still limited. In
Figure 10(h), the proposed CGAAN further reduces both false alarms and missed detections while alleviating duplicate detections. Moreover, the predicted bounding boxes exhibit better spatial consistency with the ground-truth targets, indicating improved localization accuracy. These observations are consistent with the metrics achieved by CGAAN in
Table 5.
Results on the HRSID dataset: The comparison results under the ResNet18 backbone are presented in
Table 6. The proposed CGAAN delivers the most favorable performance among all compared detectors. Compared with conventional detectors such as RetinaNet [
43], GFL [
44], AutoAssign [
45], ATSS [
46], FCOS [
47], DDOD [
54], and FoveaBox [
55], CGAAN yields consistent improvements across all metrics. It also outperforms recent advanced methods, including RTMDet [
48] and YOLOv10 [
49], particularly in AP75, indicating better localization quality under stricter evaluation criteria. Although HRSID has a relatively simple background, clear performance differences among detectors still exist. These results suggest the effectiveness and stable performance of the proposed CGAAN in relatively less challenging SAR scenes.
The detection results of RetinaNet, GFL, AutoAssign, ATSS, FCOS, DDOD, FoveaBox, RTMDet, YOLOv10, and the proposed CGAAN, all using ResNet18 as the backbone, are visualized in
Figure 11. Three offshore and two nearshore images are selected for comparison. From left to right, the 2nd and 5th columns correspond to offshore images, while the 1st, 3rd, and 4th columns correspond to nearshore images. The color definitions are consistent with
Figure 10. As shown in
Figure 11(a)-(g), most detectors suffer from false alarms and missed detections. In
Figure 11(h)-(i), RTMDet and YOLOv10 significantly reduce missed detections, especially in offshore scenes, while suppressing false alarms to some extent. In
Figure 11(j), CGAAN further reduces both missed detections and false alarms across offshore and nearshore scenes. The predicted bounding boxes exhibit better spatial correspondence with the true targets, confirming improved localization consistency. These visual observations are in good agreement with the quantitative results listed in
Table 6.