An End-to-End Precision Phenotyping Framework: Rice Panicle Detection and Counting in Complex Fields via Lightweight DETR

Jian Zheng; Yong Chen; Junfan Jin; Shengxiong Huang; Xiangxing Zhou; Wentao Huang

doi:10.20944/preprints202604.1190.v1

Submitted:

15 April 2026

Posted:

16 April 2026

You are already at the latest version

Abstract

Accurate, high-throughput quantification of rice panicles plays a vital role in advancing precision yield prediction. However, transitioning to real-time, edge-deployable unmanned aerial vehicle phenotyping is often impeded by extreme spatial scale variations from altitude fluctuations and complex unstructured backgrounds. To address this, we constructed a comprehensive composite dataset specifically capturing multi-altitude and varying illumination field conditions. We then propose Panicle-DETR, a highly optimized precision phenotyping framework incorporating a frequency-aware CSP backbone. By projecting visual perception into the frequency domain, the architecture inherently suppresses low-frequency environmental noise and minimizes computational redundancy. Furthermore, a Lossless Feature Encoder prevents the irreversible pixel decimation of micro-targets across varying operational altitudes, while a composite metric loss explicitly disentangles heavily adhered panicle clusters. Evaluated on our composite dataset, Panicle-DETR achieved an outstanding detection Precision of 90.97% alongside robust agronomic counting stability, demonstrated by a Mean Absolute Error of 4.28 and an $ R^2 $ of 0.957. With a compact footprint of only 13.78 M parameters, this framework fundamentally overcomes the computational and spatial limitations of traditional vision models, establishing a highly reliable paradigm for autonomous, onboard agricultural monitoring.

Keywords:

precision agriculture

;

UAV remote sensing

;

rice panicle counting

;

high-throughput phenotyping

;

edge computing

;

object detection

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

The density per unit area and the spatial distribution pattern of rice panicles are direct agronomic traits determining final crop yield. Accurate, high-throughput quantification of these traits in the field is crucial to overcoming the bottlenecks of traditional yield estimation and breeding screening [1,2]. Recently, unmanned aerial vehicle (UAV) remote sensing has revolutionized agricultural phenotyping [3,4]. By capturing high-resolution field imagery, UAV platforms enable the rapid acquisition of large-scale heading-stage traits.

However, translating raw UAV imagery into actionable agronomic insights faces significant engineering hurdles. Historically, UAV-based phenotyping has relied on a rigid “capture-and-process” paradigm, where heavy image data must be transferred to high-performance ground workstations for offline analysis [5,6]. As precision agriculture evolves towards autonomous, real-time decision-making, deploying high-throughput phenotyping systems directly onto UAV edge devices or autonomous agricultural robots has become imperative to bypass transmission latency and data redundancy [7]. Yet, edge deployment in unstructured agricultural environments introduces profound challenges. First, edge devices possess strictly limited computational power and battery capacities, which prohibits the use of parameter-heavy deep learning models. Second, unlike controlled laboratory conditions, UAVs operating in real fields experience unavoidable altitude fluctuations (e.g., dropping from 20 m to 3 m) due to wind interference and uneven terrain. These altitude shifts cause extreme spatial scale variations of the targets. As recently highlighted in small object detection studies, such multi-scale variations often lead to severe background noise interference and the “semantic collapse” of micro-panicles in standard neural networks [8]. Furthermore, severe leaf occlusion and dynamic lighting reflections from flooded paddies exacerbate the difficulty of automated feature extraction.

To bypass manual feature engineering, the advent of deep learning, particularly Convolutional Neural Networks (CNNs), established a robust automated paradigm. Optimized one-stage detectors have been widely explored for agricultural contexts, such as multi-scale information-enhanced weed detection [9] and crop maturity evaluation combining AI with remote sensing [10]. Initial efforts in automated panicle detection employed two-stage models; for example, Zhang et al. [11] improved Faster R-CNN using multi-scale feature fusion. However, the rapid evolution toward lightweight one-stage architectures has dominated recent literature, particularly for UAV applications. For instance, recent comparative studies have underscored the necessity of evaluating YOLO variants under constrained UAV image datasets for precise yield estimation [12], as well as adapting these models to mitigate severe lighting variations and environmental noise in natural fields [13]. Building on these insights, architectures like EfficientNet-based OE-YOLO [14] and DRPU-YOLO11 [15] have been specifically developed to tackle complex infield backgrounds. This architectural evolution is also continued in recent state-of-the-art models like YOLOv10 [16], which aims for real-time end-to-end detection by eliminating NMS post-processing. To accommodate operational constraints, FRPNet addresses multi-altitude scale variations [17], while LKD-YOLO leverages knowledge distillation to facilitate edge deployment [18]. To further resolve specific unstructured field anomalies, researchers have continuously introduced specialized modules, such as rotated bounding boxes to handle inclined panicles [19], enhanced feature fusion networks to improve counting precision [20], and multi-scale enhancement methods like MFNet [21] to mitigate miss-detections in complex backgrounds. Additionally, the integration of autonomous vision-guided systems directly into field robotics underscores the critical need for robust, real-time detection in complex canopies [22].

Despite these advancements, CNN-based architectures suffer from inherent spatial limitations. Their localized receptive fields restrict the assimilation of global contextual information, making it exceedingly difficult to separate targets from backgrounds during late growth stages when severe panicle adhesion occurs [23,24]. While Transformer architectures circumvent this by establishing long-range pixel dependencies—with DETR [25] transforming object detection into a set prediction problem and RT-DETR [26] achieving real-time efficiency via a hybrid encoder—their specialized application in edge-based agriculture is still in its infancy. Recent breakthroughs have demonstrated the immense potential of hybrid CNN-Transformer architectures in tackling complex agricultural segmentation and detection tasks [27]. However, directly applying generic RT-DETR to UAV imagery exposes a critical flaw: the faint textures of tiny panicles are easily submerged by background noise, and traditional downsampling operations demand excessive computational resources while causing irreversible feature loss for high-altitude targets.

To bridge this fundamental gap between algorithmic precision and edge-deployable engineering, this paper proposes an end-to-end precision phenotyping framework: Panicle-DETR. Tailored specifically for complex field environments and multi-altitude UAV operations, our lightweight framework introduces three core engineered innovations:

FasterFD (Frequency-Aware CSP Backbone): Inheriting the robust Cross Stage Partial (CSP) hierarchy of YOLOv8, this backbone introduces a radical micro-architectural innovation by upgrading standard bottlenecks into C2f-FasterFD modules. By integrating the Partial Convolution (PConv) paradigm with Frequency Dynamic Convolution (FDConv), it effectively suppresses low-frequency background noise and amplifies faint panicle textures, perfectly optimizing the multi-scale feature extraction process for low-power edge devices.
LFE (Lossless Feature Encoder): By integrating Space-to-Depth mapping with bidirectional feature pyramids, this encoder achieves complete cross-scale feature reconstruction, successfully overcoming the “semantic collapse” of micro-targets captured from high UAV altitudes.
Composite Metric Loss: Integrating NWD and Inner-IoU, this formulation compensates for the gradient failure of traditional IoU on overlapping targets, significantly improving bounding box regression and dense counting precision without adding any inference delay.

2. Materials and Methods

2.1. Study Area and Data Preparation

To engineer a highly robust object detection framework capable of adapting to unstructured field environments, accommodating multiple rice cultivars, and extracting scale-invariant features across varying UAV flight altitudes (ranging from 3 m to 20 m), this study constructed a “Composite Multi-Altitude Rice Panicle Dataset”. This composite corpus strategically fuses a massive public benchmark dataset with targeted, ultra-low-altitude self-collected data to simulate a complete UAV operational envelope.

The foundational generalization data for our framework is derived from the DRPD dataset published by Teng et al. [28]. Acquired using an industrial-grade DJI M300 UAV equipped with a Zenmuse P1 imaging sensor, this dataset provides a massive volume of complex field samples. After uniform cropping, the DRPD subset contributes 5,372 high-resolution image patches (512 × 512 pixels), encompassing 229 rice varieties across four critical growth stages (from initial heading to mid-grain filling). Crucially, it contains images captured at three distinct medium-to-high flight altitudes (7 m, 12 m, and 20 m), with Ground Sample Distances (GSD) ranging from approximately 0.08 to 0.25 cm/pixel. This subset provides over 250,000 expert-annotated panicle instances, serving as a solid foundation to guarantee the model’s generalization capability across diverse cultivars and lighting conditions.

However, in practical precision agriculture phenotyping tasks, relying solely on medium-to-high altitude data is insufficient. UAVs frequently need to perform ultra-low-altitude inspections to capture fine-grained morphological traits. To push the model’s multi-scale perception limits to the extreme and validate its adaptability to consumer-grade lightweight UAV sensors, a targeted supplementary data acquisition campaign was conducted in July 2025 at the Gannan Red Seed Industry Rice Research Base (Ganzhou City, Jiangxi Province, China), as illustrated in Figure 1. Unlike the industrial-grade equipment used in DRPD, our collection deliberately utilized a consumer-grade, lightweight UAV (DJI Mini 3) operating strictly at an ultra-low altitude of approximately 3 m. This specific acquisition strategy was designed to capture ultra-high-definition panicle textures and morphological details, acting as a vital spatial-scale supplement to the DRPD dataset.

To standardize spatial dimensions and prepare raw UAV imagery for network ingestion, a systematic data processing pipeline was established, as illustrated in Figure 2. Initially, the raw ultra-low-altitude images acquired by the DJI Mini 3 were processed using a sliding window cropping strategy. This operation partitioned the high-resolution raw images into uniform 512 × 512 pixel patches, ensuring dimensional consistency with the DRPD dataset. Subsequently, a rigorous quality filtering process was applied to discard invalid patches suffering from severe rotor-induced airflow blur or lacking panicle targets.

Following the filtering phase, 646 high-quality ultra-low-altitude image patches were retained. We strictly performed manual annotation on these patches using the LabelImg software. To guarantee cross-dataset compatibility, our annotation protocol adhered exclusively to the bounding box definitions established by the public dataset, ultimately yielding approximately 10,435 high-precision ground-truth bounding boxes for 3 m ultra-low-altitude rice panicles.

Finally, these meticulously processed 3 m data patches were strategically fused with the 7 m, 12 m, and 20 m sub-datasets from DRPD. This fusion yields our final Composite Multi-Altitude Dataset. By integrating these specific altitude gradients, the dataset forces the Panicle-DETR network to learn highly robust, scale-invariant feature representations, successfully simulating the comprehensive altitude fluctuations a UAV experiences during real-world field inspections.

Taking into account the complexity of unstructured field environments, to prevent overfitting and improve the robustness of the model against illumination and viewpoint variations, we implemented multiple online data enhancement strategies during the training phase, as illustrated in Figure 3. Geometric transformations were applied by random horizontal and vertical flipping to simulate various UAV flight headings. To replicate the highly dynamic illumination patterns typical of unstructured field environments—ranging from harsh solar glare to heavy overcast—we introduced stochastic photometric perturbations across the brightness, contrast, and saturation channels. Noise simulation was conducted by adding Gaussian noise to mimic potential sensor interference during UAV operation. These augmentation strategies effectively expanded the feature space of training samples and improved the model’s generalization capability in unseen scenarios.

2.2. Panicle-DETR Network Framework

2.2.1. Overall Network Architecture

To address the profound agronomic engineering challenges of extreme spatial scale variations, high-density occlusion, and the semantic collapse of micro-panicle features in unstructured field environments, this study proposes Panicle-DETR, an improved end-to-end, edge-deployable object detection framework. The overall architecture of Panicle-DETR is illustrated in Figure 4.

To resolve the inherent tension between edge-inference speed and detection precision, we anchored our framework on the RT-DETR architecture. While recent advancements in the YOLO family have successfully achieved end-to-end detection by eliminating NMS post-processing, these models still fundamentally rely on localized spatial receptive fields. This structural constraint makes it exceedingly difficult to separate genuine adjacent targets during the late crop growth stages, where severe panicle adhesion and leaf occlusion occur. RT-DETR circumvents this limitation by employing a Transformer architecture; its self-attention mechanism processes global contextual features, allowing for superior relational modeling among densely packed agricultural targets.

Nevertheless, generic RT-DETR models exhibit a distinct vulnerability in agricultural aerial imagery: they struggle to encode the fine-grained textures of tiny objects, and their default heavy backbones (e.g., ResNet) are too computationally expensive for UAVs. To address this, Panicle-DETR establishes a highly efficient hybrid architecture. We replace the conventional backbone with a customized YOLO-style Cross Stage Partial (CSP) network. Rather than relying solely on standard spatial convolutions, the imagery is routed through this upgraded frequency-domain perception backbone, which explicitly decouples faint panicle signals from background noise while maintaining ultra-high throughput. These refined features are then passed into the LFE encoder, engineered to propagate complete spatial information across multiple scales without the decimation typical of strided downsampling. The entire flow is ultimately resolved by an improved decoder and detection head to deliver robust end-to-end predictions.

2.2.2. FasterFD: Frequency-Aware CSP Feature Extraction Backbone

In complex background images captured by low-altitude UAVs, rice panicles often exhibit extremely fine textures and blurred edges. Consequently, traditional spatial-domain convolutions easily overlook this high-frequency morphological information. From a biomorphological perspective, mature rice panicles possess significant high-frequency texture features defined by drastic grayscale variations, especially compared to flooded soil and mature leaves in the background, which are typically present as low-frequency and smooth continuous signals.

To resolve this while strictly adhering to the stringent inference speed constraints of field-edge devices, we designed the FasterFD backbone. Macro-structurally, FasterFD inherits the robust Cross Stage Partial hierarchy characteristic of the YOLOv8 architecture. It systematically downsamples spatial features via strided convolutions to generate a standard five-level multi-scale representation spanning from P1 to P5. However, generic YOLOv8 architectures rely heavily on standard spatial convolutions within their iconic Cross Stage Partial Bottleneck, or C2f, modules. This over-reliance renders faint panicle features highly susceptible to being submerged by complex environmental noise while simultaneously generating severe computational redundancy.

To break this bottleneck, we introduce a radical micro-architectural innovation. As illustrated in Figure 5, we systematically upgrade standard C2f modules into our proposed C2f-FasterFD modules. This upgraded architecture integrates the highly efficient computational paradigm of FasterNet [29] with advanced frequency-domain filtering, striking an optimal balance between high-frequency perception capabilities and edge-computing efficiency.

The architecture of the C2f-FasterFD module is governed by two core principles. The first involves partial channel processing. Within the bottleneck structure, to explicitly minimize memory access costs on UAV payloads, we adopt a partial convolutional channel splitting strategy. An input feature map

X \in R^{C \times H \times W}

is partitioned along the channel dimension into an active subset

X_{active} \in R^{\frac{C}{r} \times H \times W}

, using a standard split ratio of

r = 4

, along with an unmodified identity subset

X_{identity}

.

The second principle focuses on the enhancement of the frequency-domain features. For the active channel subset, we abandon the traditional spatial convolution

3 \times 3

and substitute it with the frequency dynamic convolution [30]. By projecting spatial signals into the frequency domain via a Fast Fourier Transform, this operation inherently functions as a global high-pass filter. This transformation intrinsically broadens the receptive field, actively amplifying the high-frequency granular details characteristic of panicle textures while aggressively attenuating low-frequency background noise. Mathematically, the forward pass for the active channel subset is defined as:

X_{active}^{'} = F^{- 1} (K ⊙ F (X_{active}))

(1)

In this formulation,

F (\cdot)

and

F^{- 1} (\cdot)

signify the mapping to the frequency domain and its corresponding inverse. The operator ⊙ denotes the Hadamard product executed within the frequency domain, modulated by

K

, a learnable complex-valued weight matrix. By integrating this frequency-aware projection directly into the YOLO-style hierarchical topology, FasterFD not only retains ultra-high throughput but significantly enhances the network’s resilience against blurred or low-resolution targets in dynamic field environments.

2.2.3. LFE: Lossless Feature Encoder

Resolving the extreme spatial disparities inherent in multi-altitude UAV surveys (varying from 3 m to 20 m) requires a specialized feature fusion architecture. Agricultural micro-target detection is fundamentally constrained by a spatial-semantic dichotomy: early network layers retain the critical textural and geometric granularity of panicles but lack macroscopic context, whereas deep layers encode robust semantics at the cost of severe spatial degradation.

Conventional feature pyramids frequently fail to bridge this divide. Their standard strided downsampling operations (Stride-2 Conv) irreversibly decimate the physical pixel footprints of high-altitude targets—causing severe semantic degradation for micro-panicles spanning merely a handful of pixels.

To address this structural imbalance and maximize spatial information retention, this study proposes the Lossless Feature Encoder (LFE). This module systematically reconstructs the cross-scale feature flow pipeline. First, it integrates the bidirectional topological structure and fast normalized weighted fusion mechanism of BiFPN [31] (Figure 6a). The network adaptively adjusts the fusion weights based on the contribution of the input features, thereby intelligently prioritizing shallow layers rich in micro-target morphology and preventing them from being overshadowed by deep-level macroscopic semantics.

Crucially, to effectively prevent the spatial information loss caused by traditional convolutions during pyramid construction, we innovatively embed SPDConv [32] seamlessly into the cross-scale downsampling paths of the LFE. As shown in Figure 6b, SPDConv leverages a Space-to-Depth mapping paradigm. It rearranges spatial information into the channel dimension through a precise slicing operation, achieving effectively lossless downsampling. Specifically, for an input feature map

X

of size

S \times S \times C_{in}

, it is transformed into an output feature map

X^{'}

of size

S / 2 \times S / 2 \times 4 C_{in}

. The mapping relationship is defined as:

X_{x, y}^{'} = [X_{2 x, 2 y}, X_{2 x + 1, 2 y}, X_{2 x, 2 y + 1}, X_{2 x + 1, 2 y + 1}]

(2)

In summary, the LFE optimally synergizes the adaptive attention of BiFPN with the spatial-preserving characteristics of SPDConv. It ensures that the granular textures of ultra-small panicles captured at 20 m altitudes are robustly transmitted and fused into deep semantic features, significantly elevating the model’s multi-scale robustness under extreme UAV altitude variations.

2.2.4. Adjustment of the Loss Function

In densely planted paddies, severe occlusion and morphological adhesion among tiny panicles expose two fundamental flaws in traditional Intersection over Union (IoU) loss. First, non-overlapping bounding boxes yield a zero IoU, preventing meaningful gradient backpropagation. Second, marginal positional deviations in micro-targets trigger precipitous IoU drops, destabilizing model convergence. To resolve these issues, we designed a composite regression loss strategy combining metric learning and auxiliary bounding boxes.

Normalized Wasserstein Distance Loss

First, to address the issue that small objects are extremely sensitive to positional deviations, we introduce the NWD loss[33]. Unlike the traditional IoU calculation based on geometric overlap, NWD models the bounding box

B = (c_{x}, c_{y}, w, h)

as a 2D Gaussian distribution

N (μ, Σ)

, where the mean vector

μ

and the covariance matrix

Σ

are defined as:

μ = [\begin{matrix} c_{x} \\ c_{y} \end{matrix}], Σ = [\begin{matrix} \frac{w^{2}}{4} & 0 \\ 0 & \frac{h^{2}}{4} \end{matrix}]

(3)

By calculating the second-order Wasserstein distance

W_{2}^{2}

between the predicted distribution

N_{p}

and the ground truth distribution

N_{g}

, we construct

L_{NWD}

:

L_{NWD} = 1 - exp (- \frac{\sqrt{| | μ_{p} - μ_{g} {| |}_{2}^{2} + | | Σ_{p}^{1 / 2} - Σ_{g}^{1 / 2} {| |}_{F}^{2}}}{C})

(4)

By measuring distributional similarity, NWD smooths the drastic gradient fluctuations caused by minor positional deviations, enabling the model to maintain stable learning capabilities when handling non-overlapping or slightly overlapping samples.

2.: Inner-IoU Loss

While NWD resolves the metric failure for distant targets, convergence speed remains suboptimal during the fine-grained regression stage for highly overlapping samples. To optimize this, we integrated the Inner-IoU loss [34]. Traditional IoU provides uniform gradients irrespective of sample quality. By applying a spatial scale factor

ratio

(empirically set to 0.7), Inner-IoU generates contracted “auxiliary inner boxes” within the original anchors:

\begin{matrix} b_{l}^{inner} & = c_{x} - \frac{w \cdot ratio}{2}, b_{r}^{inner} = c_{x} + \frac{w \cdot ratio}{2} \\ b_{t}^{inner} & = c_{y} - \frac{h \cdot ratio}{2}, b_{b}^{inner} = c_{y} + \frac{h \cdot ratio}{2} \end{matrix}

(5)

By exclusively computing the intersection ratio between these auxiliary inner boxes (

{IoU}^{inner}

), the absolute value of the regression gradient is significantly amplified, thereby rapidly accelerating the bounding box refinement for dense, adhered panicles.

3.: Total Loss Formulation

Ultimately, Panicle-DETR is optimized through a synergistic multi-task objective. The total loss formulation,

L_{Total}

, integrates Varifocal Loss (

L_{VFL}

) for classification with the dual-metric bounding box regression mechanism. To mitigate the extreme scale disparities, the regression penalty seamlessly amalgamates the NWD and Inner-IoU metrics via an interpolation coefficient

α

:

L_{Total} = λ_{cls} L_{VFL} + λ_{box} (α L_{Inner - IoU} + (1 - α) L_{NWD}) + λ_{L 1} L_{L 1}

(6)

Here,

λ_{cls}

,

λ_{box}

, and

λ_{L 1}

serve as structural hyperparameters to explicitly regulate the gradient contribution of each component. By dynamically coupling the distributional sensitivity of NWD with the accelerated boundary convergence of Inner-IoU, this tripartite optimization scheme guarantees exceptional localization resilience amidst hyper-dense, unstructured agricultural conditions.

3. Results

3.1. Experimental Setup

3.1.1. Experimental Environment

To ensure consistent evaluation and eliminate potential hardware-induced performance bottlenecks during deep learning iterations, all models were trained and tested on a workstation equipped with an Intel Core i7-14700KF processor, 64 GB of RAM, and an NVIDIA GeForce RTX 4080 SUPER GPU (16 GB VRAM). Furthermore, the software environment was strictly configured with Ubuntu 22.04 LTS, Python 3.10, PyTorch 2.5.1, and CUDA 12.4 to maintain a highly standardized and reproducible experimental pipeline.

3.1.2. Experimental Parameters

To objectively evaluate the feature extraction capabilities of the models without bias from pre-existing weights, all networks were trained entirely from scratch. Input images were standardized to a spatial resolution of

640 \times 640

pixels. The detailed configuration of the unified training hyperparameters is summarized in Table 1.

As outlined in Table 1, the AdamW optimizer was utilized in conjunction with a Cosine Annealing scheduler to prevent premature convergence in local minima. A weight decay factor was systematically applied to suppress overfitting on complex background noise, such as water reflections and soil. To ensure optimal convergence while minimizing redundant computation, training was capped at 300 epochs, supplemented by a 50-epoch early stopping mechanism monitored via the validation loss.

3.1.3. Evaluation Metrics

To rigorously quantify the efficacy of Panicle-DETR within unstructured agricultural settings, we established a tripartite evaluation framework encompassing fundamental detection capability, holistic localization accuracy, and agronomic counting variance.

(1) Fundamental Detection Metrics

Recognizing that false negatives (missed panicles) compromise yield projections more severely than false positives (background misclassifications), we prioritized Precision (P), Recall (R), and the F1-Score. Specifically, P quantifies the exactness of the predicted bounding boxes, R evaluates the model’s capacity to retrieve all genuine panicles, and the F1-Score acts as their harmonic mean to indicate overall architectural stability. The corresponding formulations are:

P = \frac{T P}{T P + F P} \times 100 %

(7)

R = \frac{T P}{T P + F N} \times 100 %

(8)

F 1 = \frac{2 \times P \times R}{P + R}

(9)

Here,

T P

denotes the volume of accurately localized panicles,

F P

designates background artifacts (such as foliage or soil) incorrectly classified as targets, and

F N

represents the ground-truth panicles overlooked by the detector.

(2) Comprehensive Accuracy Metrics

In alignment with rigorous COCO evaluation protocols, Mean Average Precision (mAP) was adopted as the primary holistic metric. It aggregates the Average Precision (

A P

) over the entire precision-recall trajectory. The definitions are expressed as:

A P = \int_{0}^{1} P (R) d R, m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(10)

In this context, N signifies the total class count (with

N = 1

for our single-class panicle task). We specifically reported

{mAP}_{50}

(at an Intersection over Union threshold of 0.5) to baseline the general detection performance, alongside

{mAP}_{50 - 95}

(averaged across IoU thresholds from 0.5 to 0.95 in 0.05 increments) to rigorously examine the accuracy of bounding box regression under stringent boundary constraints.

(3) Agronomic Counting Metrics

To validate the utility of the model for downstream yield prediction, the precision of panicle counting was assessed by measuring the deviation between the algorithmic predictions (

C_{pred}

) and the manual ground-truth annotations (

C_{gt}

). This evaluation utilized the Mean Absolute Error (MAE), the Root Mean Square Error (RMSE), and the Coefficient of Determination (

R^{2}

):

M A E = \frac{1}{n} \sum_{i = 1}^{n} | C_{pred, i} - C_{gt, i} |

(11)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(C_{pred, i} - C_{gt, i})}^{2}}

(12)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(C_{gt, i} - C_{pred, i})}^{2}}{\sum_{i = 1}^{n} {(C_{gt, i} - {\bar{C}}_{gt})}^{2}}

(13)

where n indicates the total volume of test set images, and

{\bar{C}}_{gt}

is the empirical mean of the actual panicle distributions. Minimized MAE and RMSE scores, coupled with an

R^{2}

approaching unity, denote exceptional counting consistency, which is an absolute prerequisite for high-throughput phenotyping and large-scale digital agriculture.

3.2. Ablation Study of Engineered Innovations

To quantify the independent contributions and synergistic effects of the proposed architectural modifications, we conducted a combinatorial ablation study. The evaluated components include the FasterFD backbone, the Bidirectional Feature Pyramid Network (BiFPN), and the composite bounding box regression loss (NWD + Inner-IoU). As detailed in Table 2, the baseline model (RT-DETR-r18) was systematically augmented with individual modules and their pairwise combinations to isolate their impacts on terminal localization accuracy and computational overhead.

These architectural permutations elucidate several mechanisms explicitly tailored for UAV constraints. Given that inference economy is paramount for battery-powered drones, integrating the FasterFD module substantially reduced computational demand. Specifically, Model 1 decreased GFLOPs by 26.5% (from 56.9 to 41.8) compared to the baseline, while concurrently improving mAP@50 to 92.95%. This demonstrates that projecting feature extraction into the frequency domain effectively filters out redundant, low-frequency background noise (e.g., homogeneous soil or water reflections), redirecting constrained computational bandwidth exclusively toward the high-frequency textural details of panicles.

Building upon this efficient foundation, the LFE module addresses the spatial degradation of micro-targets. While isolating LFE (Model 2) increased mAP@50 to 93.20% and improved Recall, its dense bidirectional fusion inherently escalated the computational load to 64.7 GFLOPs. This uplift confirms the efficacy of the Space-to-Depth mapping mechanism, which circumvents physical pixel decimation to preserve the minute footprints of high-altitude panicles. Crucially, when LFE is coupled with FasterFD (Model 4), the computational savings from the spectral backbone fully absorb the encoder’s overhead. This synergy stabilizes the final GFLOPs at 53.0 while pushing mAP@50 to 93.60%, achieving a highly competitive balance between multi-altitude accuracy and edge efficiency.

Finally, the composite loss function (NWD + Inner-IoU) optimizes boundaries for dense clusters with zero additional inference overhead. Applied in isolation (Model 3), the gain remained marginal (92.55% mAP@50) because traditional convolutions fail to extract sufficiently clear micro-target features, rendering the advanced metric loss ineffective on highly blurred boxes. However, a powerful synergistic effect emerges in the complete architecture. Once FasterFD and LFE establish high-fidelity, scale-invariant representations, the composite loss decisively pushes Precision to 90.97% and mAP@50 to 93.94%. This verifies that prioritizing distributional similarity over geometric overlap effectively decouples adherent panicles, provided the preceding network supplies robust structural features.

3.2.1. Comparative Analysis of Backbone Architectures

To further validate the selection of FasterFD, we benchmarked it against various mainstream vision backbones. All candidates were integrated into the identical Panicle-DETR paradigm (equipped with BiFPN and composite loss) to strictly isolate their feature extraction capabilities. These encompassed parameter-heavy architectures (ResNet-50 [35], HGNetv2-L [26]), lightweight networks (MobileNetV3 [36], StarNet [37]), and state-of-the-art efficient models (FasterNet [29], EfficientNetV2 [38]). As Table 3 demonstrates, FasterFD achieves an optimal trade-off between detection accuracy and edge-inference efficiency.

Unlike traditional heavy networks such as ResNet-50—which guarantee robust accuracy (93.70% mAP@50) but incur a massive computational cost of 108.5 GFLOPs—FasterFD slightly surpasses this precision (93.94%) while reducing the computational load by more than half (53.0 GFLOPs). This effectively eliminates the parameter redundancy that typically restricts real-time UAV deployment. Conversely, while lightweight networks like StarNet and MobileNetV3 maintain compact footprints (40.0 and 47.7 GFLOPs, respectively), their constrained capacities prove insufficient against complex field backgrounds, resulting in severe accuracy degradation (dropping to 91.80% and 90.50% mAP@50).

Most notably, when benchmarked against the original FasterNet architecture (which relies purely on spatial-domain partial convolutions at an economical 41.5 GFLOPs), FasterFD yields a decisive 1.09% enhancement in mAP@50. While vanilla FasterNet exhibits a low footprint, its spatial-only perception struggles with the severe morphological blurring and specular reflections inherent to irrigated paddies. By embedding partial convolution principles into the robust multi-scale CSP hierarchy of YOLOv8 and projecting feature extraction into the frequency domain, our C2f-FasterFD modules trade a moderate increase in computation for a significant breakthrough in localization accuracy. This confirms that for agricultural targets with blurred edges and severe environmental noise, coupling a CSP topology with frequency-domain perception provides a far more resilient foundation than purely spatial architectures.

3.3. Comparative Experiments on the Composite Dataset

To comprehensively validate the superiority of the Panicle-DETR framework in unstructured agricultural environments, we conducted extensive comparative experiments against mainstream state-of-the-art architectures. The evaluation integrates fundamental detection metrics, agronomic counting stability, and qualitative visual validation.

3.3.1. Quantitative Detection Performance and Training Dynamics

Detailed object detection performance and complexity metrics for each model on the test set are presented in Table 4. The quantitative evaluation indicates that Panicle-DETR achieves a highly effective equilibrium between detection accuracy and computational efficiency, specifically tailored for edge deployment.

For fundamental localization accuracy, an outstanding Precision of 90.97% and an F1-Score of 89.63% were recorded for Panicle-DETR, establishing a distinct superiority over all evaluated baselines. When evaluated under the standard mAP@50 criterion, a peak score of 93.94% was attained by our architecture. It is noteworthy that the highly optimized YOLOv8m [41] and the latest YOLOv12m [43] models also demonstrated exceptional capabilities, achieving mAP@50 scores of 93.65% and 93.80%, respectively. However, Panicle-DETR consistently maintains a clear performance margin over these parameter-heavy networks, all while requiring significantly fewer parameters (13.78 M) and computational GFLOPs (53.0).

The performance delta becomes particularly pronounced under stringent spatial overlap thresholds. Specifically, mAP@75 and mAP@50-95 scores of 74.15% and 64.75% were achieved, respectively. This enhanced bounding box regression is directly driven by the integration of the combined NWD and Inner-IoU loss formulation, which enables the predicted bounding boxes to closely encapsulate irregular panicle boundaries.

Beyond terminal accuracy, the optimization trajectory of the proposed network was characterized by rapid convergence and minimal volatility. The comparative training dynamics—mapped across Precision, Recall, and mAP@50 in Figure 7—reveal that while initial learning phases were universally rapid, late-stage optimization in the YOLO lineage and Faster R-CNN was marred by pronounced metric oscillations.

Conversely, the training progression of Panicle-DETR remained inherently stable (Figure 7). This sustained stability is largely facilitated by the frequency-domain backbone (FasterFD) and the BiFPN multi-scale fusion mechanism, which collaboratively streamline the feature extraction process. To guard against overfitting and preserve generalization on unseen agricultural imagery, an early stopping protocol was enforced, automatically terminating the training phase for Panicle-DETR at epoch 236. By curtailing redundant computational cycles, this mechanism underscores the overarching operational efficiency of the architectural design.

3.3.2. Agronomic Counting Efficacy and Stability

In high-throughput phenotypic analysis, detection confidence alone may not fully reflect the reliability of yield prediction. Therefore, a comprehensive quantitative analysis of panicle counting errors was conducted to evaluate model consistency.

Panicle-DETR demonstrated notable improvements in counting consistency, achieving a Mean Absolute Error of 4.28 and a Root Mean Square Error of 8.57. Notably, while YOLOv8m [41] and the recent YOLOv12m [43] exhibit strong object detection capabilities, their respective Mean Absolute Errors (5.05 and 4.85) in the counting task remain comparatively higher. This performance gap suggests that the adaptive multi-scale fusion mechanism of the BiFPN structure helps preserve the physical boundaries of micro-targets among dense and adherent panicles, effectively mitigating the occurrence of false merging and missed detections.

Table 5. Evaluation of panicle counting performance across different models.

Model	MAE ↓	RMSE ↓	$R^{2}$ ↑
Faster R-CNN [39]	8.20	12.50	0.850
YOLOv5m [40]	6.35	10.80	0.895
YOLOv8m [41]	5.05	9.35	0.938
YOLOv10m [16]	5.60	9.95	0.920
YOLOv11m [42]	4.95	9.25	0.942
YOLOv12m [43]	4.85	9.12	0.945
RT-DETR-r18 [26]	6.15	10.20	0.910
Panicle-DETR (Ours)	4.28	8.57	0.957

To intuitively analyze the response differences of various models when faced with complex field density distributions, linear regression scatter plots of predicted counts versus ground truth counts were evaluated. In actual dense counting tasks in farmlands, significant heteroscedasticity is typically observed; as the true number of targets increases, the prediction error amplifies accordingly. This pattern is clearly visually corroborated in Figure 8. For legacy models such as Faster R-CNN [39] and YOLOv5m [40], when the actual number of rice panicles exceeds 60, the test sample points exhibit a distinct trumpet-shaped divergence, severely deviating from the ideal theoretical fit. Conversely, the prediction scatter points for Panicle-DETR converge tightly around the regression line even in ultra-high-density regions containing over 100 panicles. Achieving a coefficient of determination (

R^{2}

) of 0.957, our model effectively overcomes the instability caused by heteroscedasticity without systematic bias, demonstrating immense potential for direct application in large-scale field yield estimation.

3.3.3. Qualitative Visual Validation Across Altitudes

To intuitively corroborate the quantitative metrics and demonstrate the operational robustness of Panicle-DETR under real-world agricultural conditions, a comprehensive qualitative visual assessment was conducted across varying UAV flight altitudes. Figure 9 compares the detection outputs of representative baseline models against our proposed architecture at altitudes of 12 m, 7 m, and 3 m, arranged sequentially from left to right.

Visual comparisons confirm that flight altitude significantly dictates the morphological complexity of the panicles. At the 12 m altitude, severe spatial downsampling causes micro-targets to blend into the complex field background. Traditional architectures exhibit notable False Negatives in these regions, failing to capture severely degraded targets, which ultimately skews large-scale yield estimation. In contrast, Panicle-DETR successfully retrieves these minute semantic footprints. This direct visual evidence validates the efficacy of the Lossless Feature Encoder in preserving cross-scale spatial resolution, explicitly ensuring no panicle is lost to high-altitude pixel decimation.

At the optimal 7 m altitude, where panicle density and morphological adhesion become visually overwhelming, baseline models relying on the standard Intersection over Union metric frequently suffer from bounding box merging, manifesting as grouped False Negatives. Driven by the NWD and Inner-IoU composite loss, which shifts the optimization focus from rigid geometric overlap to internal distributional similarity, Panicle-DETR accurately delineates individual boundaries within these highly adhered clusters. Finally, at the 3 m ultra-low altitude, the narrow field of view inherently produces truncated targets at image borders alongside panicles lacking conspicuous visual features. While traditional baseline models struggle with these partial semantics and produce noticeable missed detections, Panicle-DETR maintains exceptional feature extraction robustness. Facilitated by the frequency-aware FasterFD backbone, our model successfully captures subtle morphological cues, ensuring accurate bounding box localization on severely truncated targets that other architectures persistently fail to detect.

3.4. Error Analysis and Model Limitations

A critical appraisal of failure modes is essential to delineate the engineering boundaries of Panicle-DETR in unstructured agricultural environments. A targeted error analysis on the test set isolated specific environmental scenarios and morphological edge cases that compromise model reliability, with representative instances illustrated in Figure 10.

One primary failure mode, depicted in panel (a), involves complex environmental reflections on the water surface within irrigated paddies. Specular reflections of the sky or dense foliage can mimic the semantic granularity and structural characteristics of rice panicles under specific solar angles, occasionally triggering false positives. Concurrently, extreme physical occlusion remains a persistent bottleneck. Although the Lossless Feature Encoder optimizes multi-scale fusion, the defining textural and geometric footprints of a panicle are irreparably suppressed when almost entirely masked by broad flag leaves, inevitably leading to false negatives.

Furthermore, strong illumination saturation introduces notable detection challenges, as observed in panel (b). Under peak solar intensity, extreme brightness washes out the high-frequency textural distinctiveness relied upon by the FasterFD backbone. The attenuated contrast between the target and the background causes spectral features to merge. This phenomenon leads to coupled failures: missed detections arising from semantic loss, and false positives where overexposed background elements, such as withered leaves, are erroneously assigned high objectness scores. These limitations underscore the inherent constraints of purely visual-based detectors in open fields, pointing toward future research in multi-sensor fusion or illumination-invariant data augmentation.

4. Discussion

The deployment of deep learning models for high-throughput phenotyping via Unmanned Aerial Vehicles remains highly constrained by the unstructured nature of open-field agricultural environments. Traditional computer vision architectures, primarily optimized for standardized datasets, frequently exhibit severe performance degradation when confronted with the dynamic challenges of agricultural inspections—most notably, extreme multi-altitude spatial variations, profound computational hardware limits, and severe morphological occlusion. The empirical results demonstrated by Panicle-DETR provide a comprehensive engineering solution to these inherent bottlenecks.

4.1. Overcoming the Spatial-Semantic Dichotomy in Multi-Altitude Phenotyping

One of the primary catalysts for the architectural redesign in this study was the catastrophic semantic collapse observed in conventional models during UAV altitude fluctuations ranging from 3 m to 20 m. As noted in recent 2025 evaluations of small object detection in aerial imagery [44], standard strided convolutions irreversibly decimate the physical pixel footprints of high-altitude micro-targets. While architectures like YOLOv12m achieve exceptional accuracy at a fixed 7 m altitude, their error rates surge unacceptably at higher elevations.

Our proposed Lossless Feature Encoder directly mitigates this vulnerability. By embedding the Space-to-Depth mapping mechanism within the BiFPN topology, Panicle-DETR circumvents pixel decimation, ensuring that the granular textures of micro-panicles—even at severe high-altitude perspectives between 12 m and 20 m—are losslessly preserved and adaptively fused into deep semantic layers. This specific structural innovation explains why Panicle-DETR sustained robust localization accuracy under extreme spatial downsampling, proving that multi-scale robustness in precision agriculture requires explicit spatial preservation and feature distillation strategies, especially around boundary regions of panicles [45].

4.2. Bridging the Gap Between Computer Vision and Agronomic Yield Estimation

In the context of digital agriculture, the ultimate objective of panicle detection is not merely drawing bounding boxes, but delivering highly reliable counting metrics for downstream yield estimation. Many contemporary agricultural vision studies rely exclusively on Mean Average Precision as the terminal metric, which often masks underlying counting deficiencies when assessing complex phenotypic traits such as panicle number per unit area [46].

As previously visualized in our comparative scatter plots, a distinct trumpet-shaped divergence emerged in standard models when the true panicle count exceeded 60 per image. This discrepancy is primarily caused by standard Intersection over Union metrics failing to distinguish heavily adhered panicles, leading to aggressive missed detections during Non-Maximum Suppression. By shifting the regression paradigm from rigid geometric overlap to distributional similarity and amplifying the inner-boundary gradients, Panicle-DETR effectively decoupled these adhered clusters. The resulting reduction in counting error confirms that specifically tailoring the loss function to the morphological characteristics of dense crops is an absolute prerequisite for transitioning from theoretical visual detection to practical intelligent agriculture [47].

4.3. Engineering Viability and Edge-Deployment Trade-offs

For battery-powered agricultural UAVs, computational efficiency dictates the feasibility of real-time onboard processing. Traditional heavy networks such as ResNet-50 demand over 100 GFLOPs, rendering them largely impractical for real-time edge deployment. Conversely, naive lightweight networks suffer severe accuracy drops due to their inability to filter out complex environmental noise, such as water reflections in paddies.

The integration of the FasterFD backbone represents a pivotal engineering trade-off. By explicitly decoupling low-frequency background noise from high-frequency panicle textures within the frequency domain, the network minimizes redundant spatial convolutions. This mechanism allows Panicle-DETR to operate at a highly economical 53.0 GFLOPs while utilizing only 13.78 M parameters, yet achieving an outstanding 93.94% average mAP@50. This establishes a new paradigm for edge-deployable agricultural models: leveraging spectral filtering to maximize the utility of constrained computational bandwidth.models: leveraging spectral filtering to maximize the utility of constrained computational bandwidth.

4.4. Limitations and Future Work

Despite the robust performance of Panicle-DETR, several limitations warrant further investigation. Firstly, the current dataset primarily captures images under adequate daytime illumination. The model’s feature extraction resilience under sub-optimal lighting conditions requires subsequent validation. Secondly, while the parameter count has been strictly minimized, the actual deployment on commercial UAVs necessitates further hardware-specific optimizations. Future research will focus on applying INT8 post-training quantization and TensorRT acceleration to Panicle-DETR, aiming to achieve ultra-low-latency processing directly on edge computing modules without sacrificing the multi-altitude counting accuracy. This strategy explicitly aligns with the latest proven trends of deploying lightweight AI models on agricultural edge devices for real-time inference [48].

5. Conclusions

The transition of high-throughput crop phenotyping to autonomous UAVs is primarily hindered by the computational limits of edge devices and the extreme spatial variability of open-field environments. To resolve this spatial-semantic dichotomy, this study introduces Panicle-DETR, an architecture that reconstructs the feature extraction paradigm without relying on brute-force parameter scaling. By projecting visual perception into the frequency domain via the FasterFD backbone, the network explicitly decouples complex environmental noise from critical high-frequency agronomic textures. By integrating the Lossless Feature Encoder with a composite regression metric based on NWD and Inner-IoU, the framework eliminates the pixel decimation inherent to traditional strided convolutions, effectively resolving the morphological adhesion of dense panicle clusters under diverse altitude-induced spatial variations.

Empirically, these architectural innovations establish a highly competitive state-of-the-art for agricultural edge-inference. Panicle-DETR achieves a peak mAP@50 of 93.94% and translates theoretical detection into reliable agronomic metrics, driving the counting Mean Absolute Error down to an exceptional 4.28. Crucially, this robust multi-altitude performance is sustained at an economical computational footprint of 53.0 GFLOPs and 13.78 M parameters. Ultimately, Panicle-DETR transcends conventional bounding-box detection to establish a scalable, hardware-efficient foundation for real-time, onboard yield estimation, paving the way for uninterrupted digital crop management without the prerequisite of cloud computing.

Author Contributions

Conceptualization, J.Z. and Y.C.; methodology, J.Z., Y.C. and J.J.; software, J.Z., Y.C. and S.H.; validation, J.Z., Y.C., J.J., X.Z. and W.H.; formal analysis, J.Z. and Y.C.; investigation, J.Z., Y.C. and X.Z.; resources, Y.C.; data curation, J.Z., J.J. and W.H.; writing—original draft preparation, J.Z.; writing—review and editing, Y.C., S.H., X.Z. and W.H.; visualization, J.Z. and Y.C.; supervision, Y.C.; project administration, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by the authors.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Madec, S.; Jin, X.; Lu, H.; De Solan, B.; Liu, S.; Duyme, F.; Heritier, E.; Baret, F. Ear density estimation from high resolution RGB imagery using deep learning technique. Agric. For. Meteorol. 2019, 264, 225–234. [Google Scholar] [CrossRef]
Ampatzidis, Y.; Partel, V. UAV-based high throughput phenotyping in citrus utilizing multispectral imaging and artificial intelligence. Remote Sens. 2019, 11, 410. [Google Scholar] [CrossRef]
Chen, R.; et al. High-throughput UAV-based rice panicle detection and genetic mapping of heading-date-related traits. Front. Plant Sci. 2024, 15, 1327507. [Google Scholar] [CrossRef]
Huang, J.; Chen, L.; Zhang, M.; et al. Estimate the Pre-Flowering Specific Leaf Area of Rice Based on Vegetation Indices and Texture Indices Derived from UAV Multispectral Imagery. Agriculture 2025, 15, 2293. [Google Scholar] [CrossRef]
Zhou, C.; Ye, H.; Hu, J.; et al. Automated counting of rice panicle by applying deep learning model to images from unmanned aerial vehicle platform. Sensors 2019, 19, 3106. [Google Scholar] [CrossRef] [PubMed]
Guo, W.; Fukatsu, T.; Ninomiya, S. Automated characterization of flowering dynamics in rice using field-acquired time-series RGB images. Plant Methods 2015, 11, 7. [Google Scholar] [CrossRef] [PubMed]
Shen, J.; Yue, J.; Liu, Y.; Yao, Y.; Feng, H.; Yang, H.; Guo, W.; Ma, X.; Fu, Y.; Shu, M.; Yang, G.; Qiao, H. Analyzing maize stem circumference, stem height, and stem circumference-to-height ratio using UAV, UGV, and deep learning. Comput. Electron. Agric. 2025, 239, 111019. [Google Scholar] [CrossRef]
Yuan, Z.; Gong, J.; Guo, B.; Wang, C.; Liao, N.; Song, J.; Wu, Q. Small Object Detection in UAV Remote Sensing Images Based on Intra-Group Multi-Scale Fusion Attention and Adaptive Weighted Feature Fusion Mechanism. Remote Sens. 2024, 16, 4265. [Google Scholar] [CrossRef]
Heng, Z.; Xie, Y.; Du, D. MIE-YOLO: A Multi-Scale Information-Enhanced Weed Detection Algorithm for Precision Agriculture. AgriEngineering 2026, 8, 16. [Google Scholar] [CrossRef]
Oliveira, T.C.M.; Souza, J.B.C.; Almeida, S.L.H.; et al. Combining Artificial Intelligence and Remote Sensing to Enhance the Estimation of Peanut Pod Maturity. AgriEngineering 2025, 7, 368. [Google Scholar] [CrossRef]
Zhang, Y.; Xiao, D.; Liu, Y.; et al. An algorithm for automatic identification of multiple developmental stages of rice spikes based on improved Faster R-CNN. Crop J. 2022, 10, 1323–1333. [Google Scholar] [CrossRef]
Tanimoto, Y.; Zhang, Z.; Yoshida, S. Object Detection for Yellow Maturing Citrus Fruits from Constrained or Biased UAV Images: Performance Comparison of Various Versions of YOLO Models. AgriEngineering 2024, 6, 4308–4324. [Google Scholar] [CrossRef]
Rodríguez-Lira, D.-C.; Córdova-Esparza, D.-M.; Álvarez-Alvarado, J.M.; Romero-González, J.-A.; Terven, J.; Rodríguez-Reséndiz, J. Comparative Analysis of YOLO Models for Bean Leaf Disease Detection in Natural Environments. AgriEngineering 2024, 6, 4585–4603. [Google Scholar] [CrossRef]
Wu, H.; Guan, M.; Chen, J.; Pan, Y.; Zheng, J.; Jin, Z.; Li, H.; Tan, S. OE-YOLO: An EfficientNet-Based YOLO Network for Rice Panicle Detection. Plants 2025, 14, 1370. [Google Scholar] [CrossRef]
Huang, D.; Chen, Z.; Zhuang, J.; Song, G.; Huang, H.; Li, F.; Huang, G.; Liu, C. DRPU-YOLO11: A Multi-Scale Model for Detecting Rice Panicles in UAV Images with Complex Infield Background. Agriculture 2026, 16, 234. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; et al. YOLOv10: Real-time end-to-end object detection. arXiv;arXiv 2024, arXiv:2405.14458. [Google Scholar]
Guo, Y.; Zhan, W.; Zhang, Z.; Zhang, Y.; Guo, H. FRPNet: A Lightweight Multi-Altitude Field Rice Panicle Detection and Counting Network Based on Unmanned Aerial Vehicle Images. Agronomy 2025, 15, 1396. [Google Scholar] [CrossRef]
Wu, P.; Zhao, J. LKD-YOLO: Improved YOLOv8 for Rice Panicle Detection in Drone Images. In Proceedings of the 2025 4th International Symposium on Computer Applications and Information Technology (ISCAIT), Xi’an, China, 21–23 March 2025; pp. 549–553. [Google Scholar]
Liang, Y.; et al. A rotated rice spike detection model and a crop yield estimation application based on UAV images. Comput. Electron. Agric. 2024, 224, 109188. [Google Scholar] [CrossRef]
Yao, M.; et al. Rice counting and localization in unmanned aerial vehicle imagery using enhanced feature fusion. Agronomy 2024, 14, 868. [Google Scholar] [CrossRef]
Qian, Y.; et al. MFNet: Multi-scale feature enhancement networks for wheat head detection and counting in complex scene. Comput. Electron. Agric. 2024, 225, 109342. [Google Scholar] [CrossRef]
Thayananthan, T.; Zhang, X.; Huang, Y.; Chen, J.; Wijewardane, N. K.; Martins, V. S.; Chesser, G. D.; Goodin, C. T. CottonSim: A vision-guided autonomous robotic system for cotton harvesting in Gazebo simulation. Comput. Electron. Agric. 2025, 239, 110963. [Google Scholar] [CrossRef]
Lan, M.; Liu, C.; Zheng, H.; et al. RICE-YOLO: In-field rice spike detection based on improved YOLOv5 and drone images. Agronomy 2024, 14, 836. [Google Scholar] [CrossRef]
Cai, W.; et al. Rice growth-stage recognition based on improved YOLOv8 with UAV imagery. Agronomy 2024, 14, 2751. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; et al. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Guo, Z.; Cai, D.; Jin, Z.; Xu, T.; Yu, F. Research on unmanned aerial vehicle (UAV) rice field weed sensing image segmentation method based on CNN-transformer. Comput. Electron. Agric. 2025, 229, 109719. [Google Scholar] [CrossRef]
Teng, Z.; Chen, J.; Wang, J.; Wu, S.; Chen, R.; Lin, Y.; Shen, L.; Jackson, R.; Zhou, J.; Yang, C. Panicle-Cloud: An Open and AI-Powered Cloud Computing Platform for Quantifying Rice Panicles from Drone-Collected Imagery to Enable the Classification of Yield Production in Rice. Plant Phenomics 2023, 5, 0105. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 12021–12031. [Google Scholar]
Chen, L.; Gu, L.; Li, L.; Yan, C.; Fu, Y. Frequency Dynamic Convolution for Dense Image Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. arXiv 2022, arXiv:2208.03641. [Google Scholar]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A Normalized Gaussian Wasserstein Distance for Tiny Object Detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More Effective Intersection over Union Loss with Auxiliary Bounding Box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 12051–12061. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNetV2: Smaller Models and Faster Training. arXiv 2021, arXiv:2104.00298. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar] [CrossRef]
Yaseen, M. What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2408.15857. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Nikouei, M.; et al. Small Object Detection: A Comprehensive Survey on Challenges, Techniques and Real-World Applications. arXiv 2025, arXiv:2503.20516. [Google Scholar] [CrossRef]
Ramachandran, A.; Kumar, S. Border Sensitive Knowledge Distillation for Rice Panicle Detection in UAV Images. Comput. Mater. Contin. 2024, 81, 827–842. [Google Scholar] [CrossRef]
Lu, X.; Shen, Y.; Cen, H.; et al. Phenotyping of Panicle Number and Shape in Rice Breeding Materials Based on Unmanned Aerial Vehicle Imagery. Plant Phenomics 2024, 6, 0265. [Google Scholar] [CrossRef] [PubMed]
Zhu, J.; Lin, Y.; Liu, Y.; Wang, H.; Qin, Y.; Li, M.; Xu, H.; He, Y. Intelligent agriculture: deep learning in UAV-based remote sensing imagery for crop diseases and pests detection. Front. Plant Sci. 2024, 15, 1435016. [Google Scholar] [CrossRef]
Gookyi, D.A.N.; Wulnye, F.A.; Wilson, M.; Danquah, P.; Danso, S.A.; Gariba, A.A. Enabling Intelligence on the Edge: Leveraging Edge Impulse to Deploy Multiple Deep Learning Models on Edge Devices for Tomato Leaf Disease Detection. AgriEngineering 2024, 6, 3563–3585. [Google Scholar] [CrossRef]

Figure 1. Overview of the study area and UAV data acquisition. (a) Geographic location of the Gannan Red Seed Industry Rice Research Base in Jiangxi, China. (b) Actual field conditions during the rice heading stage. (c) The DJI Mini 3 UAV used for 3 m ultra-low-altitude image acquisition.

Figure 2. Workflow of the composite multi-altitude dataset construction.

Figure 3. Examples of data augmentation strategies used in this study. (a) Geometric transformation (flipping); (b) Photometric distortion (brightness/contrast adjustment); (c) Noise injection.

Figure 4. The overall network architecture of the proposed Panicle-DETR framework.

Figure 5. Schematic diagram of the C2f-FasterFD module, illustrating the integration of Partial Convolution (PConv) and Frequency Domain Convolution (FDConv).

Figure 6. The core components of the Lossless Feature Encoder (LFE). (a) Schematic diagram of the BiFPN structure, establishing bidirectional cross-scale connections; (b) Illustration of the Space-to-Depth (SPDConv) downsampling process, achieving resolution reduction without spatial information loss by slicing spatial dimensions into the channel axis.

Figure 7. Comparison of training convergence curves (Precision, Recall, and mAP@50) among different models. The red dashed line on the axis indicates the early stopping point of our Panicle-DETR model at epoch 236.

Figure 8. Linear regression scatter plots of ground truth versus predicted panicle counts for all evaluated models.

Figure 9. Qualitative detection results of representative models across varying flight altitudes.

Figure 10. Representative failure modes of Panicle-DETR. (a) False positives from specular water reflections and false negatives due to severe flag leaf occlusion. (b) Detection failures under extreme solar illumination resulting in texture washout.

Table 1. Detailed configuration of the unified training hyperparameters.

Parameter	Value/Description
Input Resolution	$640 \times 640$ pixels
Optimizer	AdamW
Initial Learning Rate	$1 \times 10^{- 3}$ (0.001)
Momentum	0.9
Batch Size	16
Total Epochs	300
LR Scheduler	Cosine Annealing
Early Stopping Patience	50 epochs
Weight Decay	$1 \times 10^{- 4}$ (0.0001)

Table 2. Combinatorial ablation study isolating the performance impact and computational overhead of individual modules within the Panicle-DETR architecture.

Variant	Baseline	+FasterFD	+LFE	+Loss	Precision (%)	Recall (%)	mAP@50 (%)	GFLOPs
Baseline	✓				89.50	86.40	92.45	56.9
Model 1	✓	✓			89.80	86.80	92.95	41.8
Model 2	✓		✓		89.70	87.60	93.20	64.7
Model 3	✓			✓	89.65	86.45	92.55	56.9
Model 4	✓	✓	✓		90.40	88.00	93.60	53.0
Model 5	✓	✓		✓	90.05	86.85	93.05	41.8
Model 6	✓		✓	✓	89.90	87.75	93.30	64.7
Panicle-DETR	✓	✓	✓	✓	90.97	88.33	93.94	53.0

Table 3. Performance comparison across different backbone feature extraction networks (all integrated with BiFPN for fair evaluation).

Backbone	Precision (%)	Recall (%)	mAP@50 (%)	GFLOPs
ResNet-50 [35]	90.65	87.80	93.70	108.5
HGNetv2-L [26]	90.10	87.25	93.20	74.2
MobileNetV3 [36]	87.80	84.15	90.50	47.7
FasterNet [29]	89.45	86.85	92.85	41.5
StarNet [37]	88.50	85.10	91.80	40.0
EfficientNetV2 [38]	89.25	86.80	92.80	49.5
FasterFD (Ours)	90.97	88.33	93.94	53.0

Table 4. Quantitative comparison of detection performance and computational complexity among different state-of-the-art models.

Model	Precision (%)	Recall (%)	F1-Score (%)	mAP@50 (%)	mAP@75 (%)	mAP@50-95 (%)	Params (M)	GFLOPs
Faster R-CNN [39]	86.50	84.20	85.33	90.50	66.80	58.90	41.35	180.6
YOLOv5m [40]	88.50	86.10	87.28	92.50	70.80	62.10	25.05	64.0
YOLOv8m [41]	90.10	87.20	88.63	93.65	71.50	62.80	25.84	78.7
YOLOv10m [16]	88.80	85.90	87.33	92.70	70.90	62.30	16.45	63.4
YOLOv11m [42]	89.20	86.50	87.83	93.10	71.80	63.40	19.69	65.6
YOLOv12m [43]	90.50	87.50	88.97	93.80	72.50	63.90	19.54	58.6
RT-DETR-r18 [26]	89.50	86.40	87.92	92.45	68.10	60.50	19.87	56.9
Panicle-DETR (Ours)	90.97	88.33	89.63	93.94	74.15	64.75	13.78	53.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.