Preprint
Article

This version is not peer-reviewed.

AeroPinWorld: Revisiting Stride-2 Downsampling for Zero-Shot Transferable Open-Vocabulary UAV Detection

   These authors contributed equally to this work.

A peer-reviewed version of this preprint was published in:
Electronics 2026, 15(7), 1364. https://doi.org/10.3390/electronics15071364

Submitted:

27 February 2026

Posted:

02 March 2026

You are already at the latest version

Abstract
Open-vocabulary object detectors enable prompt-driven recognition, yet their zero-shot transfer to unmanned aerial vehicle (UAV) imagery remains fragile under domain shift, where tiny and cluttered targets depend on weak fine-grained cues. We propose AeroPinWorld, a pinwheel-augmented YOLO-World v2 that revisits stride-2 downsampling as a key transfer bottleneck: aggressive resolution reduction can induce aliasing-driven detail loss and sampling-phase sensitivity, which disproportionately harms small-object representations and degrades cross-dataset generalization in aerial scenes. To address this, AeroPinWorld introduces pinwheel-shaped convolution (PConv) as a phase-aware reduction operator. PConv probes complementary offsets via asymmetric padding and directional kernels before feature fusion, strengthening local structure aggregation at downsampling junctions. Importantly, we do not replace all downsampling operations; instead, we selectively substitute PConv at critical pyramid transitions, including the first two backbone reductions (P1/2 and P2/4) and the two bottom-up stride-2 reductions in the head, while keeping later backbone stages unchanged to preserve efficiency. We evaluate under a strict zero-shot cross-dataset protocol by training on COCO2017 for 24 epochs from official pretrained weights and directly testing on two UAV benchmarks, VisDrone2019-DET and UAVDT, without any target-domain fine-tuning, using an offline prompt vocabulary at inference. Experiments demonstrate consistent improvements over the baseline, achieving a +2.3 mAP gain and a +0.9 APS gain on VisDrone, and yielding consistent gains on UAVDT, while maintaining a competitive efficiency profile.
Keywords: 
;  ;  ;  ;  ;  

1. Introduction

Open-vocabulary object detection (OVOD) [1,2,3] aims to detect objects beyond a fixed set of training categories by leveraging vision–language priors and text prompts [4], offering a practical route to reduce annotation demand in open-world deployment. This capability is particularly appealing for unmanned aerial vehicle (UAV) perception, where data collection is easy but large-scale, instance-level labeling is expensive and often lags behind rapidly changing scenes. However, UAV imagery also amplifies long-standing detection difficulties: objects are frequently tiny and densely distributed, captured under large viewpoint/altitude changes, motion blur, compression artifacts, and complex backgrounds [5,6,7,8]. These factors create severe domain shift relative to natural-image training corpora and make generalization a first-order concern [9].
In this work, we study a strict and practically important regime: zero-shot cross-dataset UAV detection. Here, zero-shot means that the detector is trained only on a source dataset and is directly evaluated on a target UAV dataset without any target-domain fine-tuning or target-domain training images. Although YOLO-World v2 [10] provides a strong real-time OVOD baseline by integrating language features into multi-scale visual representations and enabling prompt-driven inference with an offline vocabulary, we observe that zero-shot transfer to UAV benchmarks remains fragile, with small objects being the most vulnerable. This suggests that, beyond cross-modal alignment and prompt engineering, the underlying visual feature formation and its stability under domain shift deserve closer attention. We revisit an often under-emphasized factor for zero-shot transfer: stride-2 downsampling. Standard stride-2 convolution performs aggressive resolution reduction and commits to a single sampling origin, which can induce aliasing-driven loss of fine-grained details and sensitivity to sampling phase under small spatial shifts [11,12]. For UAV imagery, these effects are particularly harmful because tiny objects may occupy only a few pixels in the input and can collapse to one or a few cells at intermediate strides (e.g., 8 or 16). Once such fine-scale cues are attenuated at early downsampling junctions, subsequent feature fusion and text-conditioned reasoning have limited capacity to recover them, leading to missed detections and localization errors under domain shift. Therefore, we argue that stride-2 reduction constitutes a key zero-shot transfer bottleneck for open-vocabulary UAV detection. To address this bottleneck, we propose AeroPinWorld, a pinwheel-augmented YOLO-World v2 that revisits stride-2 downsampling as a first-class design axis for zero-shot transferable OVOD in UAV scenarios. AeroPinWorld introduces pinwheel-shaped convolution (PConv) [13] as a phase-aware reduction operator. PConv probes complementary offsets via asymmetric padding and directional kernels before feature fusion, strengthening local structure aggregation at downsampling junctions and reducing sensitivity to sampling-phase variations. Importantly, we do not replace all downsampling operations. Instead, we selectively substitute PConv at critical pyramid transitions—including the first two backbone reductions (P1/2 and P2/4) and the two bottom-up stride-2 reductions in the head—while keeping later backbone stages unchanged to preserve efficiency. This selective design targets the stages where tiny-object evidence is most vulnerable in zero-shot transfer.
We evaluate AeroPinWorld under a strict zero-shot cross-dataset protocol: the model is trained exclusively on COCO2017 from official pretrained weights and is then directly tested on two UAV benchmarks, VisDrone2019-DET and UAVDT, without any target-domain fine-tuning, using a fixed offline prompt vocabulary at inference time. VisDrone provides dense multi-class scenes with abundant tiny objects, while UAVDT offers a vehicle-centric complementary benchmark, enabling a more reliable assessment of cross-dataset robustness under aerial domain shift.
The results demonstrate consistent improvements over the YOLO-World v2 baseline in zero-shot transfer, with particularly noticeable gains for small aerial objects under domain shift.
Our contributions are as follows:
  • We study zero-shot cross-dataset open-vocabulary UAV detection and identify stride-2 downsampling as a central zero-shot transfer bottleneck that disproportionately harms tiny-object representations under domain shift.
  • We propose AeroPinWorld, a pinwheel-augmented open-vocabulary UAV detector for zero-shot cross-dataset transfer, which introduces pinwheel-shaped convolution (PConv) as a phase-aware stride-2 reduction via selective insertion at transfer-critical pyramid transitions.
  • We establish a rigorous zero-shot cross-dataset evaluation protocol by training solely on COCO2017 and directly evaluating on VisDrone2019-DET and UAVDT without any target-domain training, demonstrating consistent improvements under this protocol.

3. Method

3.1. Overview: Revisiting Stride-2 Downsampling for Transferable OV-UAV Detection

We propose AeroPinWorld, a transfer-oriented variant of YOLO-World v2 for open-vocabulary UAV detection. As illustrated in Figure 1, AeroPinWorld preserves the prompt-driven detection paradigm and the vision-language fusion design of YOLO-World [10], while revisiting stride-2 downsampling as a hidden yet critical bottleneck for cross-dataset generalization in aerial scenes.
Our central insight is that, under domain shift, the model’s ability to preserve and propagate weak fine-grained cues (tiny targets, thin structures, cluttered backgrounds) is often determined at a few stride-2 junctions where spatial resolution is aggressively reduced. These junctions implicitly decide (i) which high-frequency evidence survives, and (ii) how sensitive the representation is to sampling alignment. Therefore, rather than globally modifying the detector, AeroPinWorld introduces a phase-aware stride-2 reduction block (PConv) [13] and deploys it selectively at transfer-critical pyramid transitions, strengthening low-level structure aggregation while maintaining efficiency.
Concretely, AeroPinWorld makes two architectural changes: (1) Transfer-critical downsampling reinforcement: we replace a small subset of stride-2 convolutions with a PConv-based phase-aware reduction block (Section 3.3). (2) Selective integration with minimal disruption: we apply the replacement only to early backbone reductions and bottom-up stride-2 transitions in the head (Section 3.4), leaving later backbone stages untouched to preserve runtime.

3.2. Revisiting Stride-2 Reduction as a Transfer-Critical Operator

Let X R B × C × H × W be an intermediate feature map. A standard stride-2 convolution can be viewed as a fixed-alignment reduction that jointly samples and aggregates local neighborhoods, producing Y R B × C × H 2 × W 2 . While effective in closed-set in-domain training, fixed-alignment reduction can be brittle for transfer: in UAV imagery, tiny objects may occupy only a handful of pixels, and their discriminative patterns are easily attenuated or destabilized by early resolution reduction and minor sampling misalignment. Once these cues are suppressed at early stride-2 steps, subsequent multi-scale fusion and open-vocabulary matching inherit the information loss, degrading cross-dataset generalization.
AeroPinWorld addresses this by replacing the most transfer-sensitive stride-2 reductions with a phase-aware operator that probes multiple offset hypotheses before committing to a coarse-grid representation. This design reduces alignment sensitivity and preserves locally consistent structure evidence, which is particularly important for dense tiny targets.

3.3. PConv-Based Phase-Aware Stride-2 Block

We adopt the plug-and-play PConv block [13] as a phase-aware stride-2 reduction operator. Given X R B × C × H × W , PConv constructs multiple offset probes via asymmetric padding and directional kernels, then fuses them into a downsampled feature.

Offset probing with asymmetric padding.

Let Pad ( · ; l , r , t , b ) denote zero-padding with ( l , r , t , b ) pixels on the left, right, top, and bottom. With kernel size k (we use k = 3 ), PConv forms four complementary probes:
Y w 0 = Conv 1 × k s = 2 Pad ( X ; k , 0 , 1 , 0 ) , Y w 1 = Conv 1 × k s = 2 Pad ( X ; 0 , k , 0 , 1 ) , Y h 0 = Conv k × 1 s = 2 Pad ( X ; 0 , 1 , k , 0 ) , Y h 1 = Conv k × 1 s = 2 Pad ( X ; 1 , 0 , 0 , k ) .
Directional kernels ( 1 × k and k × 1 ) encourage anisotropic structure extraction that is common in aerial scenes (edges, elongated patterns, thin structures), while the multi-offset probing reduces dependence on a single sampling alignment.

Channel fusion into a coarse-grid feature.

The four probes are concatenated and fused by a lightweight convolution:
Y = ϕ Concat [ Y w 0 , Y w 1 , Y h 0 , Y h 1 ] ,
where ϕ ( · ) is implemented as a small convolution (e.g., 2 × 2 ) to mix offset-specific evidence and output the stride-2 feature Y.

Why this helps transfer.

By exposing multiple sub-sampling alignments before reduction, PConv yields a more stable coarse-grid representation that is less sensitive to sampling phase and more robust to domain-specific texture/statistics shifts. This property is crucial when transferring from generic natural-image corpora to UAV imagery dominated by tiny, cluttered targets.

3.4. Selective Integration into YOLO-World v2

AeroPinWorld integrates the proposed phase-aware stride-2 block selectively to maximize transfer gains under a constrained budget. We do not replace every downsampling layer; instead, we identify the stride-2 transitions where (i) fine-detail survival is decided, and (ii) bottom-up pyramid consistency is established.

Replacement locations.

Let D denote a standard stride-2 reduction operator and P denote the proposed phase-aware stride-2 block (Section 3.3). AeroPinWorld replaces stride-2 operators only at a small set of transfer-critical transitions:
T = { P 1 / 2 , P 2 / 4 , P 3 - - P 4 , P 4 - - P 5 } ,
where P1/2 and P2/4 refer to the first two backbone reductions, and P3–P4 and P4–P5 refer to the two bottom-up reductions in the detection head. For all transitions indexed by τ T , we instantiate P in place of D , while keeping all other stride-2 reductions unchanged.
  • Early backbone reductions (detail formation). We apply P to the first two backbone reductions that generate the P1/2 and P2/4 features. Operating on the highest-resolution internal maps, these layers largely determine whether weak high-frequency cues from tiny objects survive and enter the feature pyramid.
  • Bottom-up head reductions (pyramid construction). We further apply P to the two bottom-up stride-2 reductions in the head, namely the P3–P4 and P4–P5 transitions. This improves the stability of coarse-level representations under domain shift and mitigates error propagation through multi-scale fusion.

What we keep unchanged (and why).

Later backbone reductions (e.g., those producing P3/8, P4/16, P5/32 depending on the backbone design) are kept as standard stride-2 operators. These stages operate on smaller spatial maps and typically dominate runtime; replacing them yields diminishing transfer returns relative to cost. Therefore, AeroPinWorld concentrates modifications on the earliest detail-critical reductions and the head’s bottom-up construction steps, which offer the best transfer benefit per compute.

3.5. Open-Vocabulary Detection Head

AeroPinWorld inherits the open-vocabulary detection pipeline of YOLO-World [10]. Given an input image and a set of text prompts, a text encoder maps prompts into text embeddings, which are fused with multi-scale image features via a vision-language aggregation module. The detection head outputs (i) bounding box predictions and (ii) region-level logits used for classification against the prompt vocabulary. AeroPinWorld does not change the head design; instead, it strengthens the pyramid features consumed by the head, making the improvement orthogonal to other open-vocabulary components.

3.6. Optimization Objective

We follow the training objective implemented in our codebase, which couples a classification loss with bounding box regression losses. The total loss consists of three components: a classification term L con , an IoU-based localization term L iou , and a Distribution Focal Loss term L dfl  [27].

Assignment and normalization.

Given predictions across all feature levels, a task-aligned assigner produces (i) a foreground mask F (positive anchors), (ii) assigned target boxes { b i } , and (iii) soft target scores { y i , c } over classes c for each anchor i. We denote the normalization factor
S = max i c y i , c , 1 ,
which normalizes all loss components by the total target score mass.

Classification loss L con .

Let z i , c be the predicted logit for class c at anchor i, and y i , c [ 0 , 1 ] be the corresponding soft target score. We use the element-wise BCEWithLogits loss and normalize by S:
L con = 1 S i c y i , c log σ ( z i , c ) ( 1 y i , c ) log 1 σ ( z i , c ) ,
where σ ( · ) is the sigmoid function.

IoU loss L iou (CIoU).

For each positive anchor i F , let b i be the decoded predicted box and b i the assigned target box. We use Complete IoU (CIoU) [28] and define
L iou = 1 S i F w i 1 CIoU ( b i , b i ) , w i = c y i , c ,
where w i weights each positive anchor by its aggregated target score mass.

Distribution Focal Loss L dfl .

We parameterize a box by four distances from an anchor point: t i = ( l i , t i , r i , b i ) . For each side s { l , t , r , b } , the model predicts a distribution over M bins ( M = reg _ max ). Let p i , s R M be the predicted logits and t i , s the continuous target distance. We clamp the target to [ 0 , M 1 ϵ ) and define
t ˜ i , s = clip ( t i , s , 0 , M 1 ϵ ) , t i , s l = t ˜ i , s , t i , s r = t i , s l + 1 ,
ω i , s l = t i , s r t ˜ i , s , ω i , s r = 1 ω i , s l .
Then the per-side DFL is the weighted sum of two cross-entropies:
dfl ( i , s ) = ω i , s l · CE ( p i , s , t i , s l ) + ω i , s r · CE ( p i , s , t i , s r ) ,
and we average over four sides and apply the positive-anchor weight:
L dfl = 1 S i F w i · 1 4 s { l , t , r , b } dfl ( i , s ) .

Total loss.

Finally, the three components are combined with loss gains:
L = λ box L iou + λ cls L con + λ dfl L dfl ,
where λ box , λ cls , λ dfl are scalar weights (hyperparameters). Since AeroPinWorld keeps the same detection head and objective as the baseline, improvements can be attributed to the enhanced transferability of multi-scale features induced by the selective stride-2 redesign.

4. Experiments

4.1. Datasets

Source dataset (training).

We train all models on COCO2017 [24] as the source domain, following the standard YOLO-World v2 training pipeline.

Target datasets (zero-shot evaluation).

We evaluate under a strict cross-dataset zero-shot transfer protocol on two UAV benchmarks: (1) VisDrone2019-DET [21] using the val split as our primary target dataset, and (2) UAVDT [5] as a complementary vehicle-centric UAV benchmark. In both cases, no target-domain images or annotations are used for training, adaptation, or fine-tuning; the target datasets are only used for evaluation.

4.2. Zero-Shot Transfer Protocol and Prompt Vocabulary

We follow a train-on-source, test-on-target setting. Specifically, the detector is trained only on COCO2017 and then directly evaluated on the target UAV datasets without any target-domain fine-tuning, prompt tuning, or test-time adaptation. This protocol isolates transferability under domain shift and reflects practical deployment where labeled UAV data may be unavailable.

Prompt vocabulary.

We use dataset-aligned prompts for evaluation.
VisDrone2019-DET (10 classes). We adopt the following prompt list:
P VisDrone = { pedestrian , people , bicycle , car , van , truck , tricycle , awning tricycle , bus , motorcycle } .
UAVDT-DET (3 classes). Since UAVDT focuses on vehicle categories, we use:
P UAVDT = { car , truck , bus } .

4.3. Implementation Details

Unless otherwise stated, we keep the training and inference settings identical across methods to ensure a fair comparison, and we only vary the stride-2 design described in Sec. Section 3.

Training setup.

We train for 24 epochs on COCO2017 from the official YOLO-World v2 pretrained weights for all experiments. We follow a standard YOLO-style recipe (SGD with momentum and weight decay, warmup, and common augmentations such as mosaic and flipping), keeping the recipe fixed across the baseline and AeroPinWorld.

Inference setup and image sizes.

To match UAV detection practice, we use 1280 input size for VisDrone evaluation and 640 input size for UAVDT evaluation. We keep the same post-processing strategy (including NMS settings) across all methods.

Environment and hardware.

All experiments are conducted with Python 3.12.12 and PyTorch 2.6.0 with CUDA 12.4. We run training and evaluation on a workstation equipped with an Intel Ultra 5-225H CPU and an NVIDIA GeForce RTX 3090 Ti GPU.

4.4. Baselines and Compared Methods

We compare the following methods under the same recipe:
  • YOLO-World v2 (baseline): the original model with standard stride-2 downsampling.
  • AeroPinWorld (ours): selectively replaces four transfer-critical stride-2 transitions with the proposed phase-aware stride-2 block (Section 3.4), while keeping the rest of the architecture unchanged.
This design ensures that performance differences are attributable to revisiting stride-2 downsampling rather than confounding changes to the head or training protocol.

4.5. Evaluation Metrics

We report standard box detection metrics on target datasets, including mAP (COCO-style AP box ), AP 50 , and AP 75 . Since tiny objects dominate UAV scenarios, we additionally report AP S . For efficiency, we report parameter count and FLOPs using the same model scale.

4.6. Main Results

Zero-shot transfer on VisDrone2019-DET (val, 1280).

Table 1 reports the zero-shot transfer results on VisDrone2019-DET under the strict train-on-COCO / test-on-VisDrone protocol. AeroPinWorld-S consistently improves over the YOLO-World v2-S baseline across all metrics: mAP increases from 0.112 to 0.135 (+0.023, +20.5% relative), AP50 from 0.159 to 0.197 (+0.038), AP75 from 0.098 to 0.118 (+0.020), and APS from 0.054 to 0.063 (+0.009). The gains on both AP50 and AP75 indicate that revisiting a small set of stride-2 junctions improves not only recall but also localization quality under aerial domain shift, while the APs gain confirms stronger robustness on tiny objects.

Comparison with existing baselines in Table 1.

Beyond the direct baseline comparison, AeroPinWorld-S achieves the best mAP/AP50/AP75 among the listed methods in Table 1. Notably, it remains highly competitive despite using substantially fewer parameters than many alternatives: for example, it surpasses ET-FSM (0.131 mAP) and DINO (0.129 mAP) while operating at a compact scale (16.3M parameters). Compared with larger variants such as YOLO-World+UMM (79.3M) or PMG-SAM (136.0M), our improvements suggest that selectively strengthening transfer-critical stride-2 reductions can yield a better transfer gain–efficiency trade-off than merely scaling model capacity.

Zero-shot transfer on UAVDT-DET (640).

Table 2 summarizes results on UAVDT-DET with vehicle prompts {car, truck, bus}. AeroPinWorld shows consistent improvements over the YOLO-World v2 baseline: mAP 0.1440.146 (+0.002), AP500.2650.270 (+0.005), AP750.1400.148 (+0.008), and APS0.0600.068 (+0.008). While the absolute mAP margin is smaller on UAVDT due to fewer categories and a vehicle-centric label space, the larger gains on AP75 and APS indicate improved localization and stronger robustness on small/distant vehicles.

4.7. Ablation Studies

To understand where the transfer gains come from, we conduct ablations by selectively enabling the proposed phase-aware stride-2 block at different stages. We consider: (i) replacing only the early backbone reductions (P1/2 and P2/4), (ii) replacing only the head bottom-up reductions (P3→P4 and P4→P5), and (iii) replacing all four transitions (full AeroPinWorld). Table 3 reports results on VisDrone to reveal the contribution of each part.
Discussion. Both components provide measurable gains, and their effects are complementary. Replacing only the early backbone reductions (Only B) already improves mAP and AP S , suggesting that preserving fine details at the highest-resolution internal maps is beneficial for tiny-object transfer. Replacing only the head bottom-up reductions (Only H) yields a stronger improvement, indicating that stabilizing pyramid construction under domain shift is also critical. Finally, combining both (B+H) achieves the best overall performance, validating the selective four-junction design adopted by AeroPinWorld.

4.8. Qualitative Results

To complement the quantitative results, we provide qualitative comparisons on VisDrone2019-DET (val) in Figure 2. The visualization is conducted under the same zero-shot setting as our main experiments (train on COCO2017 and directly test on VisDrone without any target-domain fine-tuning), and we display representative scenes with dense tiny objects and complex backgrounds.
As shown in Figure 2, AeroPinWorld produces more complete coverage of small and distant targets while suppressing spurious detections in cluttered regions. In particular, the highlighted areas indicate typical failure modes of the baseline in UAV imagery, including (i) missed detections of tiny objects due to weak fine-grained cues and (ii) false positives triggered by background structures (e.g., road markings, poles, or repetitive textures). By selectively strengthening transfer-critical stride-2 junctions, AeroPinWorld preserves more stable local evidence after resolution reduction, which improves the robustness of downstream multi-scale fusion and leads to more reliable predictions in challenging aerial scenes.

4.9. Efficiency Analysis

We report parameters and FLOPs to assess the efficiency impact of AeroPinWorld. As shown in Table 1 and Table 3, AeroPinWorld introduces only a modest FLOPs increase at VisDrone resolution (34.2G → 35.7G, +1.5G) while keeping the parameter count essentially unchanged. This supports the design goal of improving zero-shot transfer robustness by selectively revisiting a small number of stride-2 transitions rather than scaling the overall detector.

5. Conclusions

Open-vocabulary detectors offer prompt-driven recognition, yet their zero-shot transfer to UAV imagery remains fragile under domain shift, where tiny and cluttered targets rely on weak fine-grained cues. In this work, we presented AeroPinWorld, a YOLO-World v2 based detector that revisits stride-2 downsampling as a transfer-critical operator for cross-dataset OV-UAV detection. Instead of scaling model capacity or redesigning the detection head, AeroPinWorld selectively strengthens a small set of stride-2 junctions using a phase-aware reduction block, targeting early detail formation in the backbone and bottom-up pyramid construction in the head.
Extensive zero-shot cross-dataset evaluations validate the effectiveness of this design. When trained on COCO2017 and directly tested on VisDrone2019-DET (val) at 1280 resolution, AeroPinWorld improves mAP from 0.112 to 0.135 and boosts AP S from 0.054 to 0.063 with nearly unchanged parameters and a modest FLOPs increase. On UAVDT with vehicle prompts, AeroPinWorld also achieves consistent gains, indicating that revisiting transfer-critical stride-2 transitions generalizes across multiple UAV benchmarks. Ablation results further reveal that reinforcing either early backbone reductions or head bottom-up reductions is beneficial, and combining both yields the best performance, supporting the proposed selective four-junction design.
In future work, we plan to extend AeroPinWorld along two directions: (i) exploring prompt construction and text vocabulary optimization tailored to UAV categories and long-tail objects, and (ii) integrating lightweight test-time adaptation or domain-aware calibration while preserving the strict efficiency constraints of real-time UAV deployment.

Author Contributions

Conceptualization, J.L. and M.G.; Methodology, J.L. and M.G.; Software, J.L. and J.X.; Validation, X.D., H.C. and J.X.; Formal analysis, J.L. and J.X.; Investigation, X.D., H.C. and M.G.; Resources, Y.L.; Data curation, X.D. and H.C.; Writing—original draft preparation, J.L. and J.X.; Writing—review and editing, M.G., X.D., H.C. and Y.L.; Visualization, J.L.; Supervision, Y.L.; Project administration, M.G.; Funding acquisition, Y.L.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 62376207; in part by the Open Fund of Anhui Engineering Research Center for Intelligent Applications and Security of Industrial Internet, China (No. IASII24-02); in part by the Open Project of Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University (No. MMC202301); in part by the Open Projects Program of State Key Laboratory of Multimodal Artificial Intelligence Systems (No. MAIS2024109); in part by the Xidian University Specially Funded Project for Interdisciplinary Exploration, China; and in part by the Fundamental Research Funds for the Central Universities, China.

Data Availability Statement

The COCO2017 dataset is available at https://cocodataset.org/. The VisDrone2019-DET dataset can be accessed at https://github.com/VisDrone/VisDrone-Dataset. The UAVDT dataset is available at https://sites.google.com/view/grli-uavdt/.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Pham, C.; Vu, T.; Nguyen, K. LP-OVOD: Open-vocabulary object detection by linear probing. In Proceedings of the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024; pp. 779–788. [Google Scholar]
  2. Fang, R.; Pang, G.; Bai, X. Simple image-level classification improves open-vocabulary object detection. Proceedings of the Proceedings of the AAAI conference on artificial intelligence 2024, Vol. 38, 1716–1725. [Google Scholar] [CrossRef]
  3. Zhu, C.; Chen, L. A survey on open-vocabulary detection and segmentation: Past, present, and future. IEEE Transactions on Pattern Analysis and Machine Intelligence 2024, 46, 8954–8975. [Google Scholar] [CrossRef]
  4. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PmLR, 2021; pp. 8748–8763. [Google Scholar]
  5. Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the Proceedings of the European conference on computer vision (ECCV), 2018; pp. 370–386. [Google Scholar]
  6. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE transactions on pattern analysis and machine intelligence 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
  7. Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A survey of object detection for UAVs based on deep learning. Remote Sensing 2023, 16, 149. [Google Scholar] [CrossRef]
  8. Qin, J.; Yu, W.; Feng, X.; Meng, Z.; Tan, C. A UAV Aerial Image Target Detection Algorithm Based on YOLOv7 Improved Model. Electronics 2024, 13, 3277. [Google Scholar] [CrossRef]
  9. Wang, K.; Fu, X.; Huang, Y.; Cao, C.; Shi, G.; Zha, Z.J. Generalized uav object detection via frequency domain disentanglement. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023; pp. 1064–1073. [Google Scholar]
  10. Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024; pp. 16901–16911. [Google Scholar]
  11. Zhang, R. Making convolutional networks shift-invariant again. In Proceedings of the International conference on machine learning. PMLR, 2019; pp. 7324–7334. [Google Scholar]
  12. Azulay, A.; Weiss, Y. Why do deep convolutional networks generalize so poorly to small image transformations? Journal of Machine Learning Research 2019, 20, 1–25. [Google Scholar]
  13. Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped convolution and scale-based dynamic loss for infrared small target detection. Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence 2025, Vol. 39, 9202–9210. [Google Scholar] [CrossRef]
  14. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 2015, 28. [Google Scholar] [CrossRef] [PubMed]
  15. Gu, X.; Lin, T.Y.; Kuo, W.; Cui, Y. Open-vocabulary object detection via vision and language knowledge distillation. arXiv 2021, arXiv:2104.13921. [Google Scholar]
  16. Zhou, X.; Girdhar, R.; Joulin, A.; Krähenbühl, P.; Misra, I. Detecting twenty-thousand classes using image-level supervision. In Proceedings of the European conference on computer vision, 2022; Springer; pp. 350–368. [Google Scholar]
  17. Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.N.; et al. Grounded language-image pre-training. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022; pp. 10965–10975. [Google Scholar]
  18. Zhang, H.; Zhang, P.; Hu, X.; Chen, Y.C.; Li, L.; Dai, X.; Wang, L.; Yuan, L.; Hwang, J.N.; Gao, J. Glipv2: Unifying localization and vision-language understanding. Advances in Neural Information Processing Systems 2022, 35, 36067–36080. [Google Scholar]
  19. Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple open-vocabulary object detection. In Proceedings of the European conference on computer vision, 2022; Springer; pp. 728–755. [Google Scholar]
  20. Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Proceedings of the European conference on computer vision, 2024; Springer; pp. 38–55. [Google Scholar]
  21. Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019; pp. 0–0. [Google Scholar]
  22. Kayhan, O.S.; Gemert, J.C.v. On translation invariance in cnns: Convolutional layers can exploit absolute spatial location. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020; pp. 14274–14285. [Google Scholar]
  23. Li, J. Research on RT-DETR Drone Target Detection Method Based on Bidirectional Feature Attention Fusion. In Proceedings of the 2025 IEEE 7th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), 2025; IEEE; pp. 222–227. [Google Scholar]
  24. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European conference on computer vision, 2014; Springer; pp. 740–755. [Google Scholar]
  25. Zhai, X.; Huang, Z.; Li, T.; Liu, H.; Wang, S. YOLO-Drone: an optimized YOLOv8 network for tiny UAV object detection. Electronics 2023, 12, 3664. [Google Scholar] [CrossRef]
  26. Gong, J.; Yuan, Z.; Li, W.; Li, W.; Guo, Y.; Guo, B. A Lightweight Upsampling and Cross-Modal Feature Fusion-Based Algorithm for Small-Object Detection in UAV Imagery. Electronics 2026, 15, 298. [Google Scholar] [CrossRef]
  27. Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Advances in neural information processing systems 2020, 33, 21002–21012. [Google Scholar]
  28. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. Proceedings of the Proceedings of the AAAI conference on artificial intelligence 2020, Vol. 34, 12993–13000. [Google Scholar] [CrossRef]
  29. Weng, Z.; Yu, Z. Cross-Modal Enhancement and Benchmark for UAV-based Open-Vocabulary Object Detection. arXiv 2025. [Google Scholar]
  30. Gao, J.; Jiang, X.; Wang, A.; Gao, Y.; Fang, Z.; Lew, M.S. PMG-SAM: Boosting Auto-Segmentation of SAM with Pre-Mask Guidance. Sensors 2026, 26, 365. [Google Scholar] [CrossRef] [PubMed]
  31. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2023; pp. 4015–4026. [Google Scholar]
  32. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  33. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017; pp. 2980–2988. [Google Scholar]
  34. Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2019; pp. 9627–9636. [Google Scholar]
  35. Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
  36. Yu, Y.; Sun, X.; Cheng, Q. Expert teacher based on foundation image segmentation model for object detection in aerial images. Scientific Reports 2023, 13, 21964. [Google Scholar] [CrossRef] [PubMed]
  37. Sun, X.; Yu, Y.; Cheng, Q. Unified diffusion-based object detection in multi-modal and low-light remote sensing images. Electronics Letters 2024, 60, e70093. [Google Scholar] [CrossRef]
Figure 1. Overall architecture of AeroPinWorld. AeroPinWorld is a prompt-driven open-vocabulary detector that supports two vocabulary modes: an online vocabulary where text prompts are encoded during training, and an offline vocabulary where prompt embeddings are pre-computed and cached for efficient deployment. Given an input image, AeroPinWorld extracts multi-scale visual features and revisits stride-2 downsampling as a transfer-critical operator for OV-UAV detection. We selectively reinforce four key stride-2 transitions with a phase-aware reduction block (PConv), improving fine-grained cue preservation in early feature formation and stabilizing bottom-up pyramid construction. The resulting features are consumed by a text contrastive head and a box regression head to produce open-vocabulary detections.
Figure 1. Overall architecture of AeroPinWorld. AeroPinWorld is a prompt-driven open-vocabulary detector that supports two vocabulary modes: an online vocabulary where text prompts are encoded during training, and an offline vocabulary where prompt embeddings are pre-computed and cached for efficient deployment. Given an input image, AeroPinWorld extracts multi-scale visual features and revisits stride-2 downsampling as a transfer-critical operator for OV-UAV detection. We selectively reinforce four key stride-2 transitions with a phase-aware reduction block (PConv), improving fine-grained cue preservation in early feature formation and stabilizing bottom-up pyramid construction. The resulting features are consumed by a text contrastive head and a box regression head to produce open-vocabulary detections.
Preprints 200620 g001
Figure 2. Qualitative comparison on VisDrone2019-DET (val). The panels present the input sample, its ground-truth annotations, predictions generated by the baseline, and results from AeroPinWorld. Regions highlighted with yellow dashed lines illustrate cases where AeroPinWorld reduces both missed detections and false positives, yielding more complete coverage of tiny targets.
Figure 2. Qualitative comparison on VisDrone2019-DET (val). The panels present the input sample, its ground-truth annotations, predictions generated by the baseline, and results from AeroPinWorld. Regions highlighted with yellow dashed lines illustrate cases where AeroPinWorld reduces both missed detections and false positives, yielding more complete coverage of tiny targets.
Preprints 200620 g002
Table 1. Zero-shot transfer results on VisDrone2019-DET (val, input size 1280).
Table 1. Zero-shot transfer results on VisDrone2019-DET (val, input size 1280).
Method mAP AP50 AP75 APS Params (M) FLOPs (G)
CAGE [29] 0.105 0.156 0.106 0.048 17.4 36.3
PMG-SAM [30] 0.109 0.157 0.110 0.051 136.0 296.2
SAM [31] 0.105 0.159 0.107 0.048 93.7 233.5
Faster R-CNN [32] 0.111 0.168 0.111 0.051 41.1 133.6
RetinaNet [33] 0.123 0.179 0.112 0.056 36.5 129.0
FCOS [34] 0.108 0.161 0.109 0.049 32.1 125.3
DINO [35] 0.129 0.183 0.111 0.051 47.6 146.0
ET-FSM [36] 0.131 0.192 0.114 0.052 41.1 133.6
YOLO-World [10] 0.126 0.179 0.109 0.056 46.8 180.4
YOLO-World+UMM [37] 0.128 0.182 0.113 0.057 79.3 276.0
YOLO-World v2-S [10] 0.112 0.159 0.098 0.054 16.4 34.2
AeroPinWorld-S (ours) 0.135 0.197 0.118 0.063 16.3 35.7
Table 2. Zero-shot transfer results on UAVDT-DET (input size 640, vehicle prompts: car/truck/bus, “s” model scale).
Table 2. Zero-shot transfer results on UAVDT-DET (input size 640, vehicle prompts: car/truck/bus, “s” model scale).
Method mAP AP50 AP75 APS
CAGE [29] 0.135 0.259 0.134 0.056
PMG-SAM [30] 0.139 0.263 0.141 0.058
SAM [31] 0.137 0.258 0.138 0.056
Faster R-CNN [32] 0.142 0.265 0.136 0.059
RetinaNet [33] 0.139 0.255 0.142 0.060
FCOS [34] 0.141 0.263 0.141 0.058
DINO [35] 0.143 0.261 0.134 0.061
ET-FSM [36] 0.138 0.254 0.138 0.058
YOLO-World [10] 0.145 0.263 0.141 0.063
YOLO-World+UMM [37] 0.144 0.265 0.143 0.059
YOLO-World v2 [10] 0.144 0.265 0.140 0.060
AeroPinWorld (ours) 0.146 0.270 0.148 0.068
Table 3. Ablation on VisDrone2019-DET (val, input size 1280). “B” denotes backbone early reductions (P1/2 + P2/4) and “H” denotes head bottom-up reductions (P3→P4 + P4→P5).
Table 3. Ablation on VisDrone2019-DET (val, input size 1280). “B” denotes backbone early reductions (P1/2 + P2/4) and “H” denotes head bottom-up reductions (P3→P4 + P4→P5).
Setting mAP AP50 AP75 APS Params (M) FLOPs (G)
Baseline (YOLO-World v2) 0.112 0.159 0.098 0.054 16.4027 34.2
Only B (P1/2 + P2/4) 0.124 0.171 0.109 0.057 16.4031 35.3
Only H (P3→P4 + P4→P5) 0.129 0.182 0.111 0.059 16.3745 35.6
B+H (Full AeroPinWorld) 0.135 0.197 0.118 0.063 16.3745 35.7
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated