4.4.3. Ablation Experiments on Spectral Hierarchical Recursive Progressive Fusion Strategy
To verify the effectiveness of the spectral fusion strategy, we conducted ablation experiments from two aspects: fusion position selection and recursive progression iterations.
Impact of introducing fusion positions. Multi-scale feature fusion is a key link in multispectral object detection, and effective fusion of features at different scales has a decisive impact on model performance. This section explores the impact of fusion positions and fusion strategies on detection performance through ablation experiments and visualization analysis.
To systematically study the impact of fusion positions on model performance, we designed a series of ablation experiments, as shown in
Table 6. The experiments started from the baseline model (C1, using simple addition fusion at all scales), progressively applying our proposed innovative fusion modules (fusion mechanism combining SRFM and STPEM) at different scales, and finally evaluating the effect of comprehensive application of advanced fusion strategies.
As shown in
Table 6, with the increase in application positions of innovative fusion modules, model performance progressively improves. The baseline model (C1) only uses simple addition fusion at all feature scales, with an mAP50 of 0.701. When applying SRFM and STPEM modules at the P3/8 scale (C2), performance significantly improves to an mAP50 of 0.769. With further application of advanced fusion at the P4/16 scale (C3), mAP50 increases to 0.776. Finally, when the complete fusion strategy is applied to all three scales (C4), performance reaches optimal levels with an mAP50 of 0.785 and mAP50:95 of 0.426.
To intuitively understand the effect differences between different fusion strategies, we visualized and compared feature maps of simple addition fusion and advanced fusion strategies (SRFM+STPEM) at three scales: P3/8, P4/16, and P5/32, as shown in
Figure 15.
To gain a deeper understanding of the performance differences between different fusion strategies, we conducted systematic visualization comparisons at different scales (P3/8, P4/16, P5/32).
At the P3/8 scale, simple addition fusion presents dispersed activation patterns with insufficient target-background differentiation, especially with suboptimal activation intensity for small vehicles; whereas the feature map generated by the SRFM+STPEM fusion strategy possesses more precise target localization capability and boundary representation, with significantly improved background suppression effect and activation intensity distribution more concentrated on target regions, effectively enhancing small target detection performance.
The comparison of P4/16 scale feature maps shows that although simple addition fusion can capture medium target positions, activation is not prominent enough and background noise interference exists; in contrast, the advanced fusion strategy produces more concentrated activation areas with higher target-background contrast and clearer boundaries between vehicles. As an intermediate resolution feature map, P4 (40×40) demonstrates superior structured representation and background suppression capability under the advanced fusion strategy.
At the P5/32 scale, rough semantic information generated by simple addition fusion makes it difficult to distinguish main vehicle targets; whereas the advanced fusion strategy can better capture overall scene semantics, accurately represent main vehicle targets, and effectively suppress background interference. Although the P5 feature map has the lowest resolution (20×20), it has the largest receptive field, and the advanced fusion strategy fully leverages its advantages in large target detection and scene understanding.
Through comparative analysis, we observed three key synergistic effects of multi-scale fusion:
Complementarity enhancement: the advanced fusion strategy makes features at different scales complementary, with P3 focusing on details and small targets, P4 processing medium targets, and P5 capturing large-scale structures and semantic information;
Information flow optimization: features at different scales mutually enhance each other, with semantic information guiding small target detection and detail information precisely locating large target boundaries;
Noise suppression capability: the advanced fusion strategy demonstrates superior background noise suppression capability at all scales, effectively reducing false detections.
Impact of recursive iteration count. The recursive progression mechanism is a key strategy in our proposed SDRFPT-Net model, which can further enhance feature representation capability through multiple recursive progressive fusions. To explore the optimal number of recursive iterations, we designed a series of experiments to observe model performance changes by varying the number of iterations.
Table 7 shows the impact of iteration count on model performance.
The experimental results show that the number of recursive iterations significantly affects model performance. When the iteration count is 1 (D1), the model achieves an mAP50 of 0.769, indicating that even a single round of iteration can provide effective feature fusion. As the iteration count increases to 2 (D2) and 3 (D3), performance continues to improve, reaching mAP50 of 0.783 and 0.785, and mAP50:95 of 0.418 and 0.426 respectively. However, when the iteration count further increases to 4 (D4) and 5 (D5), performance begins to decline, with the mAP50 of 5 iterations significantly dropping to 0.761.
To more intuitively understand the impact of iteration count on feature representation, we conducted visualization analysis of feature maps at three feature scales—P3, P4, and P5—with different iteration counts, as shown in
Figure 16.
Through visualization, we observed the following feature evolution patterns:
P3 feature layer (high resolution): As the iteration count increases, the feature map gradually evolves from initial dispersed response (n=1) to more focused target representation (n=2,3), with clearer boundaries and stronger background suppression effect. However, when the iteration count reaches 4 and 5, over-smoothing phenomena begin to appear, with some loss of boundary details.
P4 feature layer (medium resolution): At n=1, the feature map has basic response to targets but is not focused enough. After 2-3 rounds of iteration, the activation intensity of target areas significantly increases, improving target differentiation. Continuing to increase the iteration count to 4-5 rounds, feature response begins to diffuse, reducing precise localization capability.
P5 feature layer (low resolution): This layer demonstrates the most obvious evolution trend, gradually developing from blurred response at n=1 to highly structured representation at n=3 that can clearly distinguish main targets. However, obvious signs of overfitting appear at n=4 and n=5, with feature maps becoming overly smoothed and target representation degrading.
These observations reveal the working mechanism of recursive progressive fusion: moderate iteration count (n=3) can achieve progressive optimization of features through multiple rounds of interactive fusion of complementary information from different modalities, enhancing target feature representation and suppressing background interference. However, excessive iteration count may lead to "over-fusion" of features, i.e., the model overfits specific patterns in the training data, losing generalization capability.
Combining quantitative and visualization analysis results, we determined n=3 as the optimal iteration count, achieving the best balance between feature enhancement and computational efficiency. This finding is also consistent with similar observations in other research areas, such as the optimal unfolding steps in recurrent neural networks and the optimal iteration count in message passing neural networks, where similar "performance saturation points" exist.
Through the above ablation experiments, we verified the effectiveness and optimal configuration of each core component of SDRFPT-Net. The results show that the spectral hierarchical perception architecture (SHPA), the complete combination of three attention mechanisms, the full-scale advanced fusion strategy, and three rounds of recursive progressive fusion collectively contribute to the model's superior performance. The rationality of these design choices is not only validated through quantitative metrics but also intuitively explained through feature visualization, providing new ideas for multispectral object detection research.