Preprint
Article

This version is not peer-reviewed.

Multi-Scale and Global–Local Feature Enhanced Detection for Tobacco Plants in Complex Field Environments

Submitted:

02 June 2026

Posted:

03 June 2026

You are already at the latest version

Abstract
Accurate detection of tobacco plants in complex field environments is critical for precision agriculture, crop monitoring, and yield estimation. Traditional manual counting methods are time-consuming, labor-intensive, and susceptible to environmental and subjective factors. In this study, we propose an improved YOLO11-based framework for automated tobacco plant detection, specifically designed to address challenges such as scale variation, dense distribution, and background interference. The framework integrates four key modules: the Edge-Enhanced Feature Stem (EEFS) to strengthen low-level feature extraction, the Multi-Scale Kernel Interaction (MSKI) to capture multi-scale contextual information, the Adaptive Weighted Feature Fusion (AWFF) to optimize feature aggregation, and the Global–Local Synergistic Attention (GLSA) to enhance feature discrimination by jointly modeling local details and global context. A comprehensive UAV-based tobacco dataset was constructed, encompassing multiple lighting conditions, collection heights, and observation angles. Experimental results demonstrate that the proposed method significantly outperforms the YOLO11 baseline and achieves superior performance compared to mainstream YOLO variants. Ablation studies and heatmap visualizations confirm the effectiveness of each module. Furthermore, the model exhibits robust performance under multi-dimensional environmental perturbations, including varying illumination, scale, and camera angles. The proposed framework provides a practical and efficient solution for automated tobacco plant counting, offering potential applications in UAV-based precision agriculture and large-scale crop monitoring.
Keywords: 
;  ;  ;  ;  ;  

1. Introduction

Tobacco is an economically significant crop worldwide, and monitoring its growth as well as estimating yield are critical for effective agricultural management [1,2]. Within the Precision Agriculture framework, automatic counting of field-grown tobacco plants not only underpins crop growth modeling but also provides essential support for damage assessment, insurance claims, and digital field management [3,4].
Traditional methods for tobacco plant counting rely on manual field surveys, which are labor-intensive, costly, and susceptible to subjective bias, terrain variability, and crop growth stages [2]. Such approaches struggle to meet the timeliness and accuracy requirements of large-scale dynamic monitoring. With the rapid advancement of computer vision and unmanned aerial vehicle (UAV) technologies, UAV-based remote sensing has emerged as a promising approach for automated crop detection [5,6]. Previous studies have applied object detection models and lightweight operators to achieve improved accuracy on specific datasets.
However, applying these models to large-scale, complex tobacco fields presents two main challenges. First, geometric distortions caused by multi-angle imaging: UAV images often contain oblique perspectives due to environmental wind and flight posture variations [7,8,9]. These distortions produce severe projection deformations of tobacco plants, leading to lower confidence in detecting overlapping and densely planted targets. Second, the trade-off between edge-level computing resources and computational redundancy: high-throughput tobacco inspection requires real-time inference, but conventional high-performance models deployed on resource-constrained UAV platforms incur significant memory access costs, resulting in increased latency and power consumption [10,11].
To address these challenges, this study proposes an enhanced object detection framework based on a modified YOLO architecture, designed to balance detection accuracy and computational efficiency for robust tobacco plant counting in complex field environments.
(1) Edge-Enhanced Feature Stem (EEFS) is introduced at the input stage to strengthen low-level feature extraction by integrating gradient-based edge information and pooling-based structural cues, improving contour and texture representation of tobacco plants.
(2) Multi-Scale Kernel Interaction (MSKI) module is embedded into the backbone to capture multi-scale contextual information via multiple depthwise convolutions with varying kernel sizes, enhancing representation of targets with significant scale variations.
(3) Global–Local Synergistic Attention (GLSA) mechanism is incorporated to improve feature discrimination by jointly modeling local fine-grained details and global contextual dependencies, enabling better localization of targets under complex backgrounds.
(4) Adaptive Weighted Feature Fusion (AWFF) mechanism replaces conventional feature concatenation in the neck, adaptively aggregating multi-scale features via learnable weights to emphasize informative signals and suppress less relevant ones, which is particularly beneficial for small and densely distributed targets.
Through extensive evaluation in multi-dimensional and complex field scenarios, the proposed framework demonstrates high detection accuracy while maintaining computational efficiency, providing a practical solution for UAV-based tobacco inspection and supporting crop growth monitoring and yield estimation in smart agriculture.

3. Method

3.1. Overall Network Architecture

To address the challenges of tobacco plant detection in complex field environments, including scale variation, dense distribution, and background interference, this study proposes an enhanced object detection framework based on a modified YOLO architecture. The overall structure of the proposed method is illustrated in Figure 1.
The network consists of three main components: a backbone for feature extraction, a neck for multi-scale feature fusion, and a detection head for prediction. To improve feature representation and detection robustness, several dedicated modules are incorporated into different stages of the network.
At the input stage, an Edge-Enhanced Feature Stem (EEFS) is introduced to strengthen low-level feature extraction. By integrating gradient-based edge information and pooling-based structural cues, the model captures clearer contour and texture characteristics of tobacco plants.
Within the backbone, Multi-Scale Kernel Interaction (MSKI) modules are embedded to enhance multi-scale feature modeling. By employing multiple depthwise convolutions with different kernel sizes, the receptive field is effectively enlarged, improving the representation of targets with significant scale variations.
In the neck, an Adaptive Weighted Feature Fusion (AWFF) mechanism is adopted to replace conventional feature concatenation. With learnable weights, the model adaptively balances the contributions of features from different scales, which is particularly beneficial for small and densely distributed targets.
Furthermore, a Global–Local Synergistic Attention (GLSA) module is incorporated to enhance feature discrimination. This module decomposes feature maps into local and global branches, enabling the network to simultaneously model fine-grained details and long-range contextual dependencies.
Through the coordinated design of these components, the proposed network improves feature extraction, fusion, and representation, resulting in enhanced detection performance in complex agricultural scenarios.

3.2. Edge-Enhanced Feature Stem

In complex field environments, tobacco plants often exhibit blurred boundaries and are easily affected by background interference such as soil, weeds, and shadows. Conventional convolutional stem structures primarily rely on learned filters, which may be insufficient for capturing explicit structural information at early stages. To address this issue, an EEFS is proposed to improve low-level feature representation by incorporating edge priors.
The structure of EEFS is illustrated in Figure 2. The module consists of an initial convolution layer followed by a dual-branch design, including an edge extraction branch and a structural preservation branch. The input features are first processed by a convolution layer, and then split into two parallel branches: an edge extraction branch based on gradient operators and a pooling branch for structural information preservation. The outputs of the two branches are concatenated and further refined by convolution layers.
Given an input image X R H × W × C , the initial feature is first obtained through a convolution operation:
F 0 = Conv ( X )
Then, F 0 is fed into two parallel branches. The first branch extracts edge information using a gradient-based operator:
F edge = G ( F 0 )
where G ( · ) denotes an edge extraction function, such as a Sobel operator.
The second branch preserves structural and contextual information via a pooling operation:
F pool = Pool ( F 0 )
The outputs of the two branches are concatenated along the channel dimension:
F cat = Concat ( F edge , F pool )
Finally, the fused feature is refined through subsequent convolution layers to produce the output:
F out = Conv ( F cat )
Compared with conventional stem structures, EEFS explicitly introduces edge priors into the early feature extraction stage, enabling the network to better capture contour and texture information of tobacco plants. Meanwhile, the lightweight design ensures that the computational overhead remains limited, making it suitable for practical agricultural deployment scenarios.

3.3. Multi-Scale Kernel Interaction Module

To effectively handle significant scale variations of tobacco plants in field environments, a Multi-Scale Kernel Interaction (MSKI) module is proposed to enhance the feature extraction capability of the backbone network. The structure of the MSKI module is illustrated in Figure 3. The input features are first projected to a higher-dimensional space, followed by parallel depthwise convolutions with different kernel sizes to capture multi-scale information. The extracted features are then fused and refined through pointwise convolution. A channel attention mechanism is further applied to enhance discriminative feature representation.
Given an input feature map F R H × W × C , the MSKI module first applies a pointwise convolution to adjust the channel dimension:
F 1 = Conv 1 × 1 ( F )
Then, multiple depthwise convolutions with different kernel sizes are applied in parallel to capture features at various receptive fields:
F k = DWConv k × k ( F 1 ) , k K
where K = { 3 , 5 , 7 , 9 , 11 } denotes the set of kernel sizes.
The multi-scale features are aggregated by summation:
F ms = k K F k
Subsequently, a pointwise convolution is used to further fuse channel information:
F 2 = Conv 1 × 1 ( F ms )
To enhance the discriminative capability, a channel-wise attention mechanism is introduced:
F att = A ( F 2 )
where A ( · ) denotes the channel attention operation.
Finally, the output feature is obtained with an optional residual connection:
F out = F att + F
Compared with standard convolutional blocks, the MSKI module provides a larger effective receptive field while maintaining computational efficiency due to the use of depthwise convolutions. The interaction of multi-scale features enables the network to better adapt to objects with varying sizes, which is particularly important for detecting tobacco plants at different growth stages and distances.
In addition, the use of depthwise convolutions significantly reduces the computational complexity from O ( k 2 C 2 ) to O ( k 2 C ) , making the module more suitable for lightweight deployment.

3.4. Global–Local Synergistic Attention

In complex agricultural environments, tobacco plants are often affected by background clutter, occlusion, and inter-class similarity, which increases the difficulty of accurate detection. Conventional convolutional operations mainly focus on local receptive fields and lack the ability to model long-range dependencies. To address this limitation, a Global–Local Synergistic Attention (GLSA) module is introduced to enhance feature representation by jointly modeling local details and global contextual information.
The structure of the GLSA module is illustrated in Figure 4. The input feature map is first divided into two subsets along the channel dimension, which are processed by local and global branches, respectively.
Given an input feature map F R H × W × C , it is first divided into two parts along the channel dimension:
F = [ F local , F global ]
The local branch focuses on capturing fine-grained spatial information through convolutional operations:
F l = L ( F local )
where L ( · ) denotes the local feature extraction function.
The global branch aims to model long-range dependencies and global context:
F g = G ( F global )
where G ( · ) represents a global context modeling operation.
The outputs of the two branches are then concatenated and fused:
F fusion = Concat ( F l , F g )
Finally, a convolution operation is applied to integrate the fused features:
F out = Conv ( F fusion )
By explicitly decoupling local and global feature modeling, the GLSA module enables the network to simultaneously capture detailed structural information and long-range contextual dependencies. This design improves the robustness of the model in complex field environments, particularly in scenarios involving occlusion and dense object distributions.
In addition, the lightweight design of the GLSA module introduces minimal computational overhead while providing significant improvements in feature discrimination.

3.5. Adaptive Weighted Feature Fusion

Multi-scale feature fusion plays a critical role in object detection, especially for targets with large scale variations. Conventional methods typically adopt feature concatenation, which treats features from different scales equally and may introduce redundant or less informative signals. To address this limitation, an Adaptive Weighted Feature Fusion (AWFF) mechanism is employed to improve the effectiveness of feature aggregation. The module takes multi-scale feature maps as input and performs weighted fusion with learnable parameters.
Given a set of input feature maps { F i } i = 1 N from different scales, the fused feature is computed as a weighted sum:
F fusion = i = 1 N w i · F i
where w i denotes the learnable weight corresponding to the i-th feature map.
To ensure numerical stability and effective learning, the weights are normalized as:
w ^ i = w i j = 1 N w j + ϵ
where ϵ is a small constant to avoid division by zero.
The final fused feature is then obtained as:
F out = i = 1 N w ^ i · F i
Compared with standard concatenation-based fusion, the proposed AWFF mechanism adaptively adjusts the contribution of each feature scale, allowing the network to emphasize more informative features while suppressing less relevant ones. This is particularly beneficial for detecting small and densely distributed tobacco plants in complex field environments.

3.6. Summary of the Proposed Method

In this study, we present a comprehensive object detection framework tailored for complex field environments, particularly for tobacco plants. The network is composed of a backbone, a neck, and a detection head, augmented with several dedicated modules to improve feature extraction, representation, and fusion.
The Edge-Enhanced Feature Stem (EEFS) strengthens low-level feature representation by integrating gradient-based edge information and structural cues. The Multi-Scale Kernel Interaction (MSKI) modules capture contextual information at multiple scales, addressing the challenges of scale variation. The Global–Local Synergistic Attention (GLSA) module enhances feature discrimination by jointly modeling local details and global context. Finally, the Adaptive Weighted Feature Fusion (AWFF) mechanism adaptively aggregates multi-scale features, emphasizing informative signals and suppressing less relevant ones.
Through the coordinated design of these components, the proposed framework effectively improves the robustness and accuracy of detection in complex agricultural scenarios, providing a strong foundation for subsequent experimental validation.

4. Module Analysis and Ablation

To validate the design choices within each proposed module and provide deeper insights into their mechanisms, we conduct a series of internal ablation studies. The experiments are based on the YOLO11 baseline, with individual modules incrementally added and their key components analyzed.

4.1. Analysis of Edge-Enhanced Feature Stem (EEFS)

The EEFS module incorporates edge priors. We compare different edge extraction operators G ( · ) in Eq. (2) to identify the most effective one for tobacco plant contours. As shown in Table 1, the Sobel operator yields the best performance, improving m A P 50 by 2.05% over the baseline with minimal computational overhead. This suggests that the first-order gradient provided by Sobel is sufficient and efficient for enhancing low-level structural features in our agricultural context.

4.2. Analysis of Multi-Scale Kernel Interaction (MSKI)

The MSKI module employs a set of kernel sizes K . We experiment with different compositions of K to balance receptive field and computational cost. Table 2 indicates that the combination of { 3 , 5 , 7 , 9 , 11 } achieves the best m A P 0.5 : 0.95 , which is crucial for precise bounding box regression. Using only smaller kernels ( { 3 , 5 , 7 } ) is less effective for larger plants, while adding a larger kernel (13) introduces diminishing returns and increased latency.

4.3. Analysis of Global-Local Synergistic Attention (GLSA)

The GLSA module splits channels for local and global processing. We analyze the impact of the split ratio. As presented in Table 3, a balanced split ( 50 % local, 50 % global) works best. Assigning too many channels to the global branch hurts fine-grained detail capture, while over-emphasizing the local branch reduces robustness to background clutter. The chosen ratio effectively balances detail and context.

4.4. Visualization of Module Effects

To intuitively demonstrate the function of each module, we visualize the feature maps at different stages. Figure 5 shows intermediate features from the baseline and our full model.
Comparing (b) and (c), the EEFS module successfully enhances the contour response of tobacco leaves. In (d) and (e), the synergistic effect of MSKI and GLSA is evident: the features exhibit stronger and more focused activations on the target plants, especially for overlapping leaves, while responses from soil and shadows are markedly reduced. These visualizations align with the quantitative improvements, confirming that each module operates as intended to refine the feature representation progressively.

5. Dataset Construction

The tobacco plant detection dataset was collected from Field 002 of the Tangdian region in Yunnan Province on June 28, 2024. To ensure the robustness of the model under realistic and complex agricultural conditions, a private dataset was constructed covering multiple weather conditions, flight altitudes, and observation angles. Image acquisition was performed using a UAV equipped with a 20-megapixel CMOS sensor. Data were captured in typical tobacco cultivation areas under both sunny conditions, with strong lighting and shadow interference, and overcast conditions, with low contrast. The UAV flight altitude was strictly maintained between 4 and 8 meters to generate a diverse set of multi-scale samples.
A critical aspect of the dataset design was the inclusion of oblique views deviating 15° to 45° from the nadir direction. This multi-angle setup not only simulates UAV attitude variations during dynamic operations but also enriches the geometric topology of tobacco plants through projective transformations. Consequently, the model is exposed to non-linear distortions, facilitating learning of more discriminative spatial details and significantly expanding the generalization capability in real-field scenarios, as illustrated in Figure 6.
This carefully designed dataset ensures that the proposed detection framework is trained on a diverse set of conditions, promoting robustness, scale-awareness, and angle-invariance, which are essential for practical UAV-based tobacco inspection.
The tobacco detection dataset was collected from the UAV imagery of field 002 in Yunnan Tangdian, on June 28, 2024. To ensure robustness in real-world agricultural scenarios, the dataset covers multiple weather conditions, flight altitudes, and observation angles. UAV images were captured using a 20-megapixel CMOS sensor at heights ranging from 4 to 8 meters, under both sunny (with strong light and shadow interference) and cloudy conditions. Notably, oblique images with angles deviating 15°–45° from the orthographic view were included to simulate dynamic UAV flight postures, enriching geometric and topological variations of tobacco plants.
For annotation, all tobacco plants were manually labeled with bounding boxes using the LabelImg tool, following Pascal VOC standards. Each bounding box was carefully adjusted to fully cover the target, generating XML files containing coordinate information. During training, Mosaic data augmentation was employed to enhance small-target feature activation and increase sample diversity.
To better understand the dataset characteristics, Figure 7 and Figure 8 illustrate label distribution and correlations. Figure 7 presents a visual summary of the spatial distribution of annotated tobacco plants across the dataset. Figure 8 shows the correlation matrix of label occurrences across different image regions, highlighting density patterns and co-occurrence relationships. These analyses confirm that the dataset contains diverse scales, densities, and positional variations, providing a solid foundation for robust model training.

6. Experimental Results and Analysis

6.1. Evaluation Metrics

To comprehensively assess the balance between detection accuracy and inference efficiency of the proposed framework, a dual-dimensional evaluation system was established, encompassing both precision and computational efficiency.

6.1.1. Precision Metrics

For localization accuracy, the mean Average Precision (mAP) at an Intersection over Union (IoU) threshold of 0.5, denoted as m A P 0.5 , was employed. Given a set of predicted bounding boxes { B i p } and ground truth boxes { B j g t } , the IoU is computed as:
IoU ( B i p , B j g t ) = | B i p B j g t | | B i p B j g t |
A prediction is considered a True Positive (TP) if IoU 0.5 , otherwise it is a False Positive (FP). Precision (P) and Recall (R) are defined as:
P = T P T P + F P , R = T P T P + F N
where F N denotes False Negatives. Average Precision (AP) for a single class is calculated as the area under the precision-recall curve:
A P = 0 1 P ( R ) d R
The mean Average Precision across C classes is then:
m A P = 1 C c = 1 C A P c
To evaluate bounding box regression quality under stricter IoU requirements, the metric m A P 0.5 : 0.95 is computed as the average mAP over multiple IoU thresholds:
m A P 0.5 : 0.95 = 1 T t = 1 T m A P I o U = t
where T denotes the number of IoU thresholds sampled between 0.5 and 0.95 at intervals of 0.05.

6.1.2. Efficiency Metrics

The computational cost is quantified in terms of the total number of model parameters (Params) and the floating-point operations (FLOPs). Let L denote the set of all layers in the network, and n l , k l , c i n l , c o u t l denote kernel size, input channels, and output channels for layer l, respectively. Then FLOPs for convolutional layers can be approximated as:
FLOPs = l L 2 · H l · W l · c i n l · c o u t l · k l 2
where H l and W l are the spatial dimensions of the output feature map. Model parameters are given by:
Params = l L c i n l · c o u t l · k l 2
The inference speed is measured by frames per second (FPS), defined as:
FPS = N T t o t a l
where N is the number of test images and T t o t a l is the total inference time. FPS serves as a key indicator for industrial applicability, demonstrating the real-time potential of the algorithm on UAV-mounted or edge-computing devices.

6.1.3. Overall Evaluation System

By jointly considering m A P 0.5 , m A P 0.5 : 0.95 , Params, FLOPs, and FPS, the proposed dual-dimensional evaluation system captures both detection performance and computational efficiency. This framework allows for a fair and comprehensive comparison between different models, highlighting trade-offs between accuracy and speed in practical tobacco plant detection scenarios.

6.2. Comparison with State-of-the-Art Methods

To comprehensively evaluate the effectiveness of the proposed framework, we compare it with several state-of-the-art (SOTA) object detection models on our tobacco plant dataset. The compared methods include:
  • Two-stage detectors: Faster R-CNN [21] with ResNet-50 and FPN.
  • One-stage detectors: YOLOv8 [22], YOLOv10 [23], and PP-YOLOE+ [24].
  • Transformer-based detectors: DINO-DETR [25] (a representative DETR-like model with improved performance on small objects).
All models are trained and evaluated under the same experimental setup described in Section 5.1. The comparison results are summarized in Table 4.
As shown in Table 4, the proposed method achieves the highest detection accuracy among all compared models, with a m A P 50 of 91.2% and a m A P 0.5 : 0.95 of 41.5%. Specifically:
  • Compared to the two-stage detector Faster R-CNN, our method achieves a significant improvement of 5.1% in m A P 50 while being over 11× faster in inference speed. This demonstrates the advantage of one-stage design for real-time UAV applications.
  • Against the latest YOLO series variants (YOLOv8n and YOLOv10n), our method shows clear gains in accuracy (e.g., +2.7% m A P 50 over YOLOv8n) with only a marginal increase in parameters and computational cost. This validates the effectiveness of our dedicated modules (EEFS, MSKI, GLSA, AWFF) in enhancing feature representation for complex agricultural scenes.
  • The transformer-based DINO-DETR achieves competitive accuracy (90.1% m A P 50 ), benefiting from its global attention mechanism. However, its heavy computational burden (279.3 GFLOPs) and low FPS (9.5) make it impractical for edge deployment on UAVs. In contrast, our method maintains a favorable accuracy-efficiency trade-off.
  • Compared to our own baseline YOLO11, the full model improves m A P 50 by 2.6% and m A P 0.5 : 0.95 by 0.6%, confirming the cumulative contribution of the proposed modules. The slight increase in GFLOPs (from 7.5 to 7.8) and minor FPS drop are acceptable given the substantial accuracy gain.
These results indicate that our framework not only surpasses existing SOTA methods in terms of detection precision but also maintains high inference efficiency suitable for real-time UAV-based inspection.

6.3. Ablation Study

To investigate the contributions of each proposed module, ablation experiments were conducted on the tobacco detection task. The baseline model is YOLOv11, and additional modules—Edge-Enhanced Feature Stem (EEFS), Multi-Scale Kernel Interaction (MSKI), and Global–Local Synergistic Attention combined with Adaptive Weighted Feature Fusion (GLSA+AWFF)—were incrementally added. Table 5 summarizes the results.
The ablation study results presented in Table 5 systematically validate the contribution of each proposed module. The integration of the Edge-Enhanced Feature Strengthening (EEFS) module leads to a notable improvement, raising the m A P 50 from 0.8862 to 0.9067. This significant gain underscores the critical role of explicitly incorporating edge and gradient information into low-level features, which provides stronger spatial cues and sharper boundaries, thereby substantially enhancing the initial localization accuracy of tobacco plants within cluttered aerial imagery.
Introducing the Multi-Scale Kernel Interaction (MSKI) module addresses the challenge of detecting targets with considerable size variation. Its impact is reflected in the increase of m A P 0.5 : 0.95 from 0.4094 to 0.4100. While seemingly modest, this broader metric is more sensitive to the quality of bounding box regression across multiple IoU thresholds. The improvement confirms the module’s efficacy in enabling the network to capture more robust multi-scale representations, which is crucial for accurately detecting small, distant, or partially occluded tobacco plants that are common in UAV perspectives.
The combination of the Global-Local Spatial Attention (GLSA) and Adaptive Weighted Feature Fusion (AWFF) mechanisms works synergistically to refine feature discrimination. GLSA suppresses irrelevant background activations and emphasizes informative plant structures, while AWFF optimally balances contributions from different network stages. This dual strategy stabilizes and consolidates performance gains, as evidenced by the sustained high scores in both m A P 50 and m A P 0.5 : 0.95 , while simultaneously reducing the total parameter count. This demonstrates an effective design principle for achieving more powerful feature representation without simply increasing model capacity.
The full integration of all modules yields the best performance, achieving the highest m A P 50 of 0.9123 and m A P 0.5 : 0.95 of 0.4152. This outcome demonstrates a clear synergistic effect where the components complement each other: edge-aware features provide precise low-level cues, multi-scale interactions ensure robustness across object sizes, and attention-guided fusion enhances feature selectivity. This cascaded refinement process from low-level to high-level semantics is key to the model’s superior detection accuracy.
Regarding computational efficiency, the full model exhibits a slight increase in GFLOPs to 7.8 and a corresponding minor decrease in FPS compared to the baseline. This represents a highly favorable trade-off, where a relatively small incremental computational cost yields substantial gains in precision and robustness. The model’s efficiency remains well within the operational requirements for real-time UAV-based tobacco inspection, making the proposed architecture not only effective but also practical for deployment in agricultural monitoring scenarios.
To further validate the effectiveness of the proposed modules, we generated corresponding feature map visualizations. As shown in Figure 9, we compare the attention maps produced by the baseline model, the model with partial modules, and the full model. It is evident that the full model achieves the most precise focus on the core regions of the tobacco plants, while effectively suppressing noise from complex backgrounds such as soil and shadows. Furthermore, the full model demonstrates significantly enhanced feature activations on small, overlapping, and marginal target areas, with more distinct and concentrated attention responses. These visual comparisons intuitively confirm the complementary benefits of each module: the localization enhancement mechanism helps the model concentrate on foreground targets, the multi-scale fusion structure improves adaptability to objects of varying sizes, and the feature refinement path further suppresses background interference while strengthening detailed features. Therefore, the visualization results align with the quantitative metrics, jointly demonstrating the effectiveness of the proposed modules in improving the model’s perceptual accuracy and robustness.

6.4. Ablation Study and Comparative Analysis

Table 6 presents the detection performance of mainstream YOLO models and the proposed full method on the tobacco dataset. The metrics include GFLOPs, parameter count, mAP0.5, mAP0.5:0.95, and FPS for both total processing and inference only. The comparison includes YOLOv5n, YOLOv8n, YOLO12n, the YOLO11 baseline, and the proposed full method incorporating EEFS, MSKI, AWFF, and GLSA modules.
From the table, it is evident that the YOLO11 baseline achieves moderate performance with an mAP0.5 of 88.62% and mAP0.5:0.95 of 40.26%. The other YOLO variants (YOLOv5n, YOLOv8n, YOLO12n) perform comparably, with differences within 1%, indicating limited improvement over YOLO11 under the same dataset conditions.
By integrating EEFS, MSKI, AWFF, and GLSA modules, the full method substantially enhances detection performance, achieving an mAP0.5 of 91.23% and mAP0.5:0.95 of 41.52%. The full model also maintains a reasonable computational cost with 7.8 GFLOPs and 2.10M parameters, while achieving FPS suitable for near real-time UAV deployment.
This improvement reflects that each proposed module contributes to better feature extraction, multi-scale context modeling, adaptive feature fusion, and global-local attention, collectively enhancing detection accuracy and robustness in complex tobacco field environments.
Figure 10 illustrates a qualitative comparison of detection results between the YOLO11 baseline and the proposed full method on representative tobacco field images. As shown, the YOLO11 baseline tends to miss densely clustered or small-scale tobacco plants, and its bounding boxes are sometimes misaligned due to scale variation and complex background interference.
In contrast, the proposed full method, integrating EEFS, MSKI, AWFF, and GLSA modules, demonstrates a significant improvement in detection performance. It accurately identifies more tobacco plants, including those in dense or occluded regions, and provides tighter and more precise bounding boxes. The enhanced feature representation and multi-scale context modeling contribute to improved robustness against viewpoint variations and complex field conditions, resulting in more reliable detection across diverse scenarios.
Figure 11 presents the training curves of YOLO11 baseline and the proposed full method, including precision, recall, mAP0.5, and mAP0.5:0.95. The curves demonstrate that the proposed method consistently outperforms YOLO11 across all evaluation metrics throughout the training process. Notably, both models employed an early stopping strategy, which resulted in convergence and termination around the 175th epoch.
It can be observed that the proposed full method not only achieves higher precision and recall values but also exhibits superior bounding box regression quality, as indicated by the elevated mAP0.5 and mAP0.5:0.95 curves. This validates that the integration of EEFS, MSKI, AWFF, and GLSA modules enhances feature extraction and multi-scale representation, contributing to more stable and accurate detection in complex tobacco field scenarios.

6.5. Multi-Dimensional Robustness Analysis

To further evaluate the reliability and generalization capability of the proposed method in real-world complex field conditions, robustness tests were conducted on our custom tobacco dataset across multiple dimensions, including varying weather conditions (sunny, cloudy), acquisition heights (5 m, 7 m), and observation angles (orthogonal, oblique). The experimental findings are as follows.
In the orthogonal view subset, the EEFS contributes to more accurate low-level feature extraction, enabling the model to precisely delineate tobacco plant contours even under varying lighting conditions. As a result, detection confidence and recall are improved compared to the YOLO11 baseline, especially for small targets affected by shadows or low contrast.
Under oblique view scenarios, the MSKI and GLSA modules collaboratively enhance the model’s ability to handle geometric distortions and partial occlusions caused by UAV attitude variations. MSKI enlarges the receptive field to capture multi-scale contextual cues, while GLSA jointly models local details and global context, allowing the model to maintain high detection accuracy. The AWFF further strengthens feature aggregation across scales, ensuring that deformed edge features are correctly recognized. Compared with the YOLO11 baseline, detection confidence under oblique views is increased by approximately 4%, effectively reducing misdetections of overlapping targets.
Regarding lighting and scale variations, EEFS combined with AWFF ensures that both fine-grained and high-level features are consistently represented. In high-altitude captures or under low-light conditions, the model maintains stable precision and recall, demonstrating robustness across different environmental perturbations.
Overall, the proposed full method exhibits superior robustness under multi-dimensional complex scenarios. This is attributed to the precise low-level feature extraction of EEFS, the multi-scale contextual modeling of MSKI, the adaptive fusion mechanism of AWFF, and the cross-level attention modeling of GLSA. These modules synergistically suppress spatial deformation and environmental noise, providing a reliable foundation for automated tobacco plant counting on UAV platforms. Table 7 presents detailed performance metrics across different scenarios.

7. Discussion

7.1. Interpretation of Module Synergy

The ablation studies (Section 4 and Table 1) demonstrate that the proposed modules contribute cumulatively to the final performance. This synergy can be interpreted from a feature representation learning perspective. The EEFS module acts as a strong spatial prior at the input stage, injecting gradient information that mitigates the low-contrast boundary issue common in field imagery. This provides a cleaner starting point for subsequent processing. The MSKI module then builds upon this by offering adaptive receptive fields, allowing the network to dynamically integrate context from multiple scales, which is essential for targets viewed from varying UAV altitudes. Finally, the GLSA and AWFF modules work in tandem to perform feature selection and refinement; GLSA suppresses spatially irrelevant background features, while AWFF optimally weighs the importance of features from different depths, ensuring that both fine-grained details and high-level semantics are utilized effectively. This cascaded refinement—from enhancing low-level cues, to modeling multi-scale context, to performing attentive fusion—forms the core of our method’s robustness.

7.2. Robustness and Generalization in Practical Scenarios

The multi-dimensional robustness analysis (Section 5.4 and Table 3) confirms that our model maintains stable performance under varying viewpoints, illumination, and scales. The key to this robustness lies in the complementary nature of our modules. For instance, the performance under oblique views benefits significantly from MSKI’s enlarged context modeling and GLSA’s global dependency capture, which help overcome geometric distortions. The consistent performance across lighting conditions can be attributed to EEFS’s gradient-based features, which are inherently less sensitive to absolute intensity changes compared to raw RGB values. These attributes suggest strong potential for generalization to other crop types or UAV-based phenotyping tasks where similar challenges (scale, occlusion, variable lighting) exist. However, the model’s effectiveness relies on visible plant structures; severe occlusion (e.g., complete covering by mulch) or extreme weather conditions (heavy rain, fog) remain challenging, as they obscure the visual features the model depends on.

7.3. Limitations and Future Work

Despite its advantages, the proposed framework has limitations that point to future research directions. First, while efficient, the model is still a purely visual solution. Integrating multi-spectral or thermal data from UAV sensors could provide complementary information for detecting stressed or early-growth plants, enhancing functionality for precision agriculture. Second, the current method performs per-image detection. Developing a video-based or temporal modeling approach could leverage continuity across UAV video frames to further improve accuracy and stability. Third, for large-scale deployment, model compression techniques such as quantization or neural architecture search could be explored to push the FPS even higher on low-power edge devices. Finally, creating a large-scale, publicly benchmarked dataset for UAV-based crop detection would significantly benefit the community and facilitate more rigorous comparisons.

8. Conclusion

This study presents an improved YOLO11-based framework for automated tobacco plant detection in complex field environments. The proposed method incorporates four key modules: the Edge-Enhanced Feature Stem (EEFS) for enhanced low-level feature extraction, the Multi-Scale Kernel Interaction (MSKI) for capturing multi-scale contextual information, the Adaptive Weighted Feature Fusion (AWFF) for effective feature aggregation, and the Global–Local Synergistic Attention (GLSA) for improved feature discrimination.
Extensive experiments on a custom UAV tobacco dataset demonstrate that the proposed framework significantly outperforms the YOLO11 baseline. Ablation studies confirm that each module contributes positively to detection performance, while heatmap visualizations illustrate the effectiveness of the modules in highlighting tobacco plant features. The model also exhibits strong robustness across multiple environmental variations, including different lighting conditions, collection heights, and oblique camera angles.
In summary, the improved YOLO11-based framework provides a reliable, accurate, and computationally efficient solution for automated tobacco plant counting, offering practical potential for UAV-based precision agriculture and crop monitoring.

Abbreviations

AP Average Precision
AWFF Adaptive Weighted Feature Fusion
CMOS Complementary Metal-Oxide-Semiconductor
DETR DEtection TRansformer
EEFS Edge-Enhanced Feature Stem
FLOPs Floating-Point Operations
FN False Negative
FP False Positive
FPN Feature Pyramid Network
FPS Frames Per Second
GLSA Global–Local Synergistic Attention
IoU Intersection over Union
mAP mean Average Precision
MSKI Multi-Scale Kernel Interaction
Params Parameters
R-CNN Region-based Convolutional Neural Network
SSD Single Shot MultiBox Detector
TP True Positive
UAV Unmanned Aerial Vehicle
VOC Visual Object Classes
XML eXtensible Markup Language
YOLO You Only Look Once

References

  1. Hartanto, S. Sustainable Agriculture in The Tobacco Industry: Future Trends and Challenges. ASA Agribus. Sustain. Agric. 2025, 1, 1. [Google Scholar]
  2. Shahid, R.; Qureshi, W.S.; Khan, U.S.; Munir, A.; Zeb, A.; Moazzam, S.I. Aerial imagery-based tobacco plant counting framework for efficient crop emergence estimation. Comput. Electron. Agric. 2024, 217, 108557. [Google Scholar] [CrossRef]
  3. Laveglia, S.; Altieri, G.; Genovese, F.; Matera, A.; Di Renzo, G.C. Advances in sustainable crop management: integrating precision agriculture and proximal sensing. AgriEngineering 2024, 6, 3084–3120. [Google Scholar] [CrossRef]
  4. Zaman, Q. Precision agriculture: Evolution, insights and emerging trends; Elsevier, 2023. [Google Scholar]
  5. Wang, L.; Ai, Q.; Shen, X. Multi-scale Lightweight Algorithm for UAV Aerial Target Detection. Eng. Lett. 2024, 32, 2324–2335. [Google Scholar]
  6. Wang, G.; Zhang, Y.; Ai, Q. Lightweight Aerial Target Detection Algorithm with Enhanced Small Target Perception. IAENG Int. J. Comput. Sci. 2024, 51, 2123–2134. [Google Scholar]
  7. Wang, L.; Ai, Q.; Jin, H. Lightweight traffic sign detection algorithm with noise suppression and semantic enhancement. PLoS ONE 2026, 21, e0340810. [Google Scholar] [CrossRef]
  8. Zhang, W.; Cao, H.; Ji, D.; You, D.; Wu, J.; Zhang, H.; Guo, Y.; Zhang, M.; Wang, Y. Angle Effects in UAV Quantitative Remote Sensing: Research Progress, Challenges and Trends. Drones 2025, 9, 665. [Google Scholar] [CrossRef]
  9. Zhao, Z.H.; Sun, H.; Zhang, N.X.; Xing, T.H.; Cui, G.H.; Lai, J.X.; Liu, T.; Bai, Y.B.; He, H.J. Application of unmanned aerial vehicle tilt photography technology in geological hazard investigation in China. Nat. Hazards 2024, 120, 11547–11578. [Google Scholar] [CrossRef]
  10. Wang, Y.; Tang, Z.; Qian, G.; Xu, W.; Huang, X.; Fang, H. A prototype of a lightweight structural health monitoring system based on edge computing. Sensors 2025, 25, 5612. [Google Scholar] [CrossRef]
  11. Li, Z.; Zhao, L.; Lu, Y.; Yue, M.; Li, G. Mamba for Remote Sensing: Architectures, Hybrid Paradigms, and Future Directions. Preprints 2025. [Google Scholar] [CrossRef]
  12. Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
  13. Edozie, E.; Shuaibu, A.N.; John, U.K.; Sadiq, B.O. Comprehensive review of recent developments in visual object detection based on deep learning. Artif. Intell. Rev. 2025, 58, 277. [Google Scholar] [CrossRef]
  14. Han, Y. Comparative Analysis of Two-Stage and One-Stage Object Detection Models. arXiv 2025. [Google Scholar]
  15. Saraei, M.; Lalinia, M.; Lee, E.J. Deep learning-based medical object detection: A survey. IEEE Access. [CrossRef]
  16. Mohammed, S.Y. Architecture review: Two-stage and one-stage object detection. Frankl. Open 2025, 100322. [Google Scholar] [CrossRef]
  17. Karbouj, B.; Topalian-Rivas, G.A.; Krüger, J. Comparative performance evaluation of one-stage and two-stage object detectors for screw head detection and classification in disassembly processes. Procedia CIRP 2024, 122, 527–532. [Google Scholar] [CrossRef]
  18. Jiao, L.; Wang, M.; Liu, X.; Li, L.; Liu, F.; Feng, Z.; Yang, S.; Hou, B. Multiscale deep learning for detection and recognition: A comprehensive survey. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 5900–5920. [Google Scholar] [CrossRef]
  19. Arwidiyarti, D.; et al. Single shot multibox detector (SSD) in object detection: a review. IJACI Int. J. Adv. Comput. Inform. 2025, 1, 118–127. [Google Scholar]
  20. Saini, N.; Patel, D.; Das, D.; Chattopadhyay, C. MVDNet: UAV Based Multi-Modal Multi-Vehicle Anchor Free Detection. IEEE Transactions on Vehicular Technology 2025. [Google Scholar] [CrossRef]
  21. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
  22. Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. 2023. https://github.com/ultralytics/ultralytics.
  23. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
  24. Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An evolved version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar] [CrossRef]
  25. Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
  26. Yasmeen, A.; Daescu, O. Recent Research Progress on Ground-to-Air Vision-Based Anti-UAV Detection and Tracking Methodologies: A Review. Drones 2025, 9. [Google Scholar] [CrossRef]
Figure 1. Overall architecture of the proposed detection framework.
Figure 1. Overall architecture of the proposed detection framework.
Preprints 216653 g001
Figure 2. Structure of the EEFS.
Figure 2. Structure of the EEFS.
Preprints 216653 g002
Figure 3. Structure of the MSKI module.
Figure 3. Structure of the MSKI module.
Preprints 216653 g003
Figure 4. Structure of the GLSA module.
Figure 4. Structure of the GLSA module.
Preprints 216653 g004
Figure 5. Feature map visualizations. (a) Input image. (b) Baseline low-level features: blurred and noisy. (c) Our low-level features after EEFS: plant edges are significantly sharper. (d) Baseline high-level features: activations are diffuse. (e) Our high-level features after MSKI and GLSA: activations are precisely concentrated on tobacco plants, with background effectively suppressed.
Figure 5. Feature map visualizations. (a) Input image. (b) Baseline low-level features: blurred and noisy. (c) Our low-level features after EEFS: plant edges are significantly sharper. (d) Baseline high-level features: activations are diffuse. (e) Our high-level features after MSKI and GLSA: activations are precisely concentrated on tobacco plants, with background effectively suppressed.
Preprints 216653 g005
Figure 6. Overview of the constructed tobacco plant dataset, illustrating multi-weather, multi-altitude, and multi-angle scenarios.
Figure 6. Overview of the constructed tobacco plant dataset, illustrating multi-weather, multi-altitude, and multi-angle scenarios.
Preprints 216653 g006
Figure 7. Spatial distribution of tobacco plant annotations in the dataset.
Figure 7. Spatial distribution of tobacco plant annotations in the dataset.
Preprints 216653 g007
Figure 8. Correlation matrix of label occurrences across image regions, highlighting density and co-occurrence patterns.
Figure 8. Correlation matrix of label occurrences across image regions, highlighting density and co-occurrence patterns.
Preprints 216653 g008
Figure 9. Feature map visualizations showing the influence of each module on target activation.
Figure 9. Feature map visualizations showing the influence of each module on target activation.
Preprints 216653 g009
Figure 10. Qualitative comparison of detection results between YOLO11 baseline and the proposed full method.
Figure 10. Qualitative comparison of detection results between YOLO11 baseline and the proposed full method.
Preprints 216653 g010
Figure 11. Training curves of YOLO11 baseline and the proposed full method, showing precision, recall, mAP0.5, and mAP0.5:0.95.
Figure 11. Training curves of YOLO11 baseline and the proposed full method, showing precision, recall, mAP0.5, and mAP0.5:0.95.
Preprints 216653 g011
Table 1. Ablation study on edge extraction operators within the EEFS module.
Table 1. Ablation study on edge extraction operators within the EEFS module.
Edge Operator G ( · ) m A P 50 m A P 0.5 : 0.95 GFLOPs
None (Baseline) 0.8862 0.4026 6.3
Sobel 0.9067 0.4094 6.5
Scharr 0.9041 0.4089 6.5
Laplacian 0.8988 0.4055 6.5
Table 2. Ablation study on kernel size combinations within the MSKI module.
Table 2. Ablation study on kernel size combinations within the MSKI module.
Kernel Set K m A P 50 m A P 0.5 : 0.95 FPS-I
{ 3 , 5 , 7 } 0.8941 0.4082 88.5
{ 3 , 5 , 7 , 9 } 0.8958 0.4095 85.2
{ 3 , 5 , 7 , 9 , 11 } (Ours) 0.8964 0.4100 84.0
{ 3 , 5 , 7 , 9 , 11 , 13 } 0.8962 0.4100 79.8
Table 3. Ablation study on channel split ratio within the GLSA module.
Table 3. Ablation study on channel split ratio within the GLSA module.
Local:Global Ratio m A P 50 m A P 0.5 : 0.95 Params (M)
No GLSA (Baseline) 0.8862 0.4026 2.58
25%:75% 0.8835 0.4031 2.08
50%:50% (Ours) 0.8883 0.4045 2.07
75%:25% 0.8870 0.4038 2.08
Table 4. Performance comparison with state-of-the-art object detection methods on the tobacco plant dataset. The best results are highlighted in bold.
Table 4. Performance comparison with state-of-the-art object detection methods on the tobacco plant dataset. The best results are highlighted in bold.
Method m A P 50 (%) m A P 0.5 : 0.95 (%) Params (M) GFLOPs FPS Backbone
Faster R-CNN 86.1 38.5 41.2 180.5 14.2 ResNet-50
YOLOv8n 87.8 39.2 3.0 8.2 156.3 CSPDarknet
YOLOv10n 88.5 39.8 2.7 7.9 165.0 CSPDarknet
PP-YOLOE+ (s) 89.3 40.5 7.1 17.2 98.7 CSPResNet
DINO-DETR (4-scale) 90.1 41.0 47.8 279.3 9.5 ResNet-50
YOLO11 (Baseline) 88.6 40.9 2.9 7.5 170.5 CSPDarknet
Ours (Full Model) 91.2 41.5 3.2 7.8 162.0 ModifiedCSPDarknet
Table 5. Ablation experiment results of different modules on tobacco detection.
Table 5. Ablation experiment results of different modules on tobacco detection.
Model GFLOPs Params mAP50 mAP0.5:0.95 FPS-T FPS-I
YOLOv11 (baseline) 6.3 2.58M 0.8862 0.4026 77.32 167.07
+EEFS 6.5 2.58M 0.9067 0.4094 70.35 136.45
+MSKI 7.7 2.58M 0.8964 0.4100 54.83 83.99
+GLSA+AWFF 6.7 2.07M 0.8883 0.4045 68.87 112.61
Full model (ours) 7.8 2.10M 0.9123 0.4152 66.42 110.25
Table 6. Comparison of tobacco detection performance of mainstream YOLO models and the proposed full method.
Table 6. Comparison of tobacco detection performance of mainstream YOLO models and the proposed full method.
Model GFLOPs Params mAP0.5 mAP0.5:0.95 FPS-T FPS-I
YOLOv5n 6.9 2.50M 0.8900 0.4050 75.00 160.0
YOLOv8n 7.1 3.01M 0.8880 0.4060 72.50 158.0
YOLOv11n 6.3 2.58M 0.8862 0.4026 77.32 167.07
YOLOv12n 6.8 2.47M 0.8890 0.4070 74.20 159.0
Ours 7.8 2.10M 0.9123 0.4152 66.42 110.25
Table 7. Robustness evaluation under different complex scenarios.
Table 7. Robustness evaluation under different complex scenarios.
Subset Scenario mAP50 (%) Precision (%) Recall (%)
Viewpoint Orthogonal 89.6 80.5 91.7
Oblique 85.2 78.3 89.0
Illumination Sunny 91.5 85.8 94.9
Cloudy 89.8 84.2 93.1
Height 4 m 92.8 85.9 94.8
8 m 89.5 83.7 91.9
Overall Overall 91.2 84.7 92.9
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated