6.3. Ablation Study
To investigate the contributions of each proposed module, ablation experiments were conducted on the tobacco detection task. The baseline model is YOLOv11, and additional modules—Edge-Enhanced Feature Stem (EEFS), Multi-Scale Kernel Interaction (MSKI), and Global–Local Synergistic Attention combined with Adaptive Weighted Feature Fusion (GLSA+AWFF)—were incrementally added.
Table 5 summarizes the results.
The ablation study results presented in
Table 5 systematically validate the contribution of each proposed module. The integration of the Edge-Enhanced Feature Strengthening (EEFS) module leads to a notable improvement, raising the
from 0.8862 to 0.9067. This significant gain underscores the critical role of explicitly incorporating edge and gradient information into low-level features, which provides stronger spatial cues and sharper boundaries, thereby substantially enhancing the initial localization accuracy of tobacco plants within cluttered aerial imagery.
Introducing the Multi-Scale Kernel Interaction (MSKI) module addresses the challenge of detecting targets with considerable size variation. Its impact is reflected in the increase of from 0.4094 to 0.4100. While seemingly modest, this broader metric is more sensitive to the quality of bounding box regression across multiple IoU thresholds. The improvement confirms the module’s efficacy in enabling the network to capture more robust multi-scale representations, which is crucial for accurately detecting small, distant, or partially occluded tobacco plants that are common in UAV perspectives.
The combination of the Global-Local Spatial Attention (GLSA) and Adaptive Weighted Feature Fusion (AWFF) mechanisms works synergistically to refine feature discrimination. GLSA suppresses irrelevant background activations and emphasizes informative plant structures, while AWFF optimally balances contributions from different network stages. This dual strategy stabilizes and consolidates performance gains, as evidenced by the sustained high scores in both and , while simultaneously reducing the total parameter count. This demonstrates an effective design principle for achieving more powerful feature representation without simply increasing model capacity.
The full integration of all modules yields the best performance, achieving the highest of 0.9123 and of 0.4152. This outcome demonstrates a clear synergistic effect where the components complement each other: edge-aware features provide precise low-level cues, multi-scale interactions ensure robustness across object sizes, and attention-guided fusion enhances feature selectivity. This cascaded refinement process from low-level to high-level semantics is key to the model’s superior detection accuracy.
Regarding computational efficiency, the full model exhibits a slight increase in GFLOPs to 7.8 and a corresponding minor decrease in FPS compared to the baseline. This represents a highly favorable trade-off, where a relatively small incremental computational cost yields substantial gains in precision and robustness. The model’s efficiency remains well within the operational requirements for real-time UAV-based tobacco inspection, making the proposed architecture not only effective but also practical for deployment in agricultural monitoring scenarios.
To further validate the effectiveness of the proposed modules, we generated corresponding feature map visualizations. As shown in
Figure 9, we compare the attention maps produced by the baseline model, the model with partial modules, and the full model. It is evident that the full model achieves the most precise focus on the core regions of the tobacco plants, while effectively suppressing noise from complex backgrounds such as soil and shadows. Furthermore, the full model demonstrates significantly enhanced feature activations on small, overlapping, and marginal target areas, with more distinct and concentrated attention responses. These visual comparisons intuitively confirm the complementary benefits of each module: the localization enhancement mechanism helps the model concentrate on foreground targets, the multi-scale fusion structure improves adaptability to objects of varying sizes, and the feature refinement path further suppresses background interference while strengthening detailed features. Therefore, the visualization results align with the quantitative metrics, jointly demonstrating the effectiveness of the proposed modules in improving the model’s perceptual accuracy and robustness.
6.4. Ablation Study and Comparative Analysis
Table 6 presents the detection performance of mainstream YOLO models and the proposed full method on the tobacco dataset. The metrics include GFLOPs, parameter count, mAP
0.5, mAP
0.5:0.95, and FPS for both total processing and inference only. The comparison includes YOLOv5n, YOLOv8n, YOLO12n, the YOLO11 baseline, and the proposed full method incorporating EEFS, MSKI, AWFF, and GLSA modules.
From the table, it is evident that the YOLO11 baseline achieves moderate performance with an mAP0.5 of 88.62% and mAP0.5:0.95 of 40.26%. The other YOLO variants (YOLOv5n, YOLOv8n, YOLO12n) perform comparably, with differences within 1%, indicating limited improvement over YOLO11 under the same dataset conditions.
By integrating EEFS, MSKI, AWFF, and GLSA modules, the full method substantially enhances detection performance, achieving an mAP0.5 of 91.23% and mAP0.5:0.95 of 41.52%. The full model also maintains a reasonable computational cost with 7.8 GFLOPs and 2.10M parameters, while achieving FPS suitable for near real-time UAV deployment.
This improvement reflects that each proposed module contributes to better feature extraction, multi-scale context modeling, adaptive feature fusion, and global-local attention, collectively enhancing detection accuracy and robustness in complex tobacco field environments.
Figure 10 illustrates a qualitative comparison of detection results between the YOLO11 baseline and the proposed full method on representative tobacco field images. As shown, the YOLO11 baseline tends to miss densely clustered or small-scale tobacco plants, and its bounding boxes are sometimes misaligned due to scale variation and complex background interference.
In contrast, the proposed full method, integrating EEFS, MSKI, AWFF, and GLSA modules, demonstrates a significant improvement in detection performance. It accurately identifies more tobacco plants, including those in dense or occluded regions, and provides tighter and more precise bounding boxes. The enhanced feature representation and multi-scale context modeling contribute to improved robustness against viewpoint variations and complex field conditions, resulting in more reliable detection across diverse scenarios.
Figure 11 presents the training curves of YOLO11 baseline and the proposed full method, including precision, recall, mAP
0.5, and mAP
0.5:0.95. The curves demonstrate that the proposed method consistently outperforms YOLO11 across all evaluation metrics throughout the training process. Notably, both models employed an early stopping strategy, which resulted in convergence and termination around the 175th epoch.
It can be observed that the proposed full method not only achieves higher precision and recall values but also exhibits superior bounding box regression quality, as indicated by the elevated mAP0.5 and mAP0.5:0.95 curves. This validates that the integration of EEFS, MSKI, AWFF, and GLSA modules enhances feature extraction and multi-scale representation, contributing to more stable and accurate detection in complex tobacco field scenarios.
6.5. Multi-Dimensional Robustness Analysis
To further evaluate the reliability and generalization capability of the proposed method in real-world complex field conditions, robustness tests were conducted on our custom tobacco dataset across multiple dimensions, including varying weather conditions (sunny, cloudy), acquisition heights (5 m, 7 m), and observation angles (orthogonal, oblique). The experimental findings are as follows.
In the orthogonal view subset, the EEFS contributes to more accurate low-level feature extraction, enabling the model to precisely delineate tobacco plant contours even under varying lighting conditions. As a result, detection confidence and recall are improved compared to the YOLO11 baseline, especially for small targets affected by shadows or low contrast.
Under oblique view scenarios, the MSKI and GLSA modules collaboratively enhance the model’s ability to handle geometric distortions and partial occlusions caused by UAV attitude variations. MSKI enlarges the receptive field to capture multi-scale contextual cues, while GLSA jointly models local details and global context, allowing the model to maintain high detection accuracy. The AWFF further strengthens feature aggregation across scales, ensuring that deformed edge features are correctly recognized. Compared with the YOLO11 baseline, detection confidence under oblique views is increased by approximately 4%, effectively reducing misdetections of overlapping targets.
Regarding lighting and scale variations, EEFS combined with AWFF ensures that both fine-grained and high-level features are consistently represented. In high-altitude captures or under low-light conditions, the model maintains stable precision and recall, demonstrating robustness across different environmental perturbations.
Overall, the proposed full method exhibits superior robustness under multi-dimensional complex scenarios. This is attributed to the precise low-level feature extraction of EEFS, the multi-scale contextual modeling of MSKI, the adaptive fusion mechanism of AWFF, and the cross-level attention modeling of GLSA. These modules synergistically suppress spatial deformation and environmental noise, providing a reliable foundation for automated tobacco plant counting on UAV platforms.
Table 7 presents detailed performance metrics across different scenarios.