Submitted:
14 January 2026
Posted:
15 January 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Seeking to emulate the difficulty and evaluation possibilities of the UL14 dataset, we developed a novel UAV–satellite pair dataset that extends the FPI task. Our dataset, UAV-Sat, retains the quality of different satellite size imagery in the test set, yet the content of the dataset is more general-purpose and representative of diverse UAV applications.
- Development of a lightweight transformer-based localization framework featuring a novel FPN design that incorporates multireceptive DCNv2 modules to enhance the final backbone feature layer, along with DCNv2-based feature alignment between high- and mid-level feature maps.
- Extensive comparative analysis of our method on two datasets (UL14, UAV-Sat) and an exhaustive ablation study of our method.
2. Related Work
2.1. Transformer-Based Computer Vision Architectures
2.2. Feature Pyramid Networks
2.3. Finding Point in Map
3. Materials, Method and Evaluation
3.1. Datasets and Preprocessing
3.1.1. UL14 Dataset
3.1.2. The Creation of UAV-Sat Dataset
- 4.
- Inclusion of UAV-satellite pairs within the whole dataset: we chose to exclude pairs that contained UAV images with uninformative or non-pertinent features. For this purpose, UAV images were centered-cropped, and if the images were deemed uninformative after this procedure, they were excluded from the final dataset. We consider an image to be informative if it contains some sort of permanent feature (e.g., a docking station on a shore, some road within the forest, permanent natural landmarks). We also excluded pairs with satellite images for which it was not possible to crop them to a 3500x3500 pixel range without padding.
- 5.
- Train, validation, and test set splitting: after initial image selection, training, validation, and test sets were constructed, splitting the whole dataset into training and test sets (85/15 ratio) and then splitting the training set again into training and validation sets (90/10 ratio). Final splitting proportions were 76.5/8.5/15.
- 6.
- Train set preparation: for the training set, we chose to crop satellite images to 3500x3500 pixels to get enough coverage area for Random Scale Crop (RSC, see 3.1.2) augmentation and enrich the final dataset with bigger satellite coverage images. Such image cropping yields image pairs that cover around 1 square kilometer satellite area. UAV images were not processed further (the same center crop from step 1 was retained).
- 7.
- Validation and test set preparation: as in the training set, we also cropped satellite images to 3500x3500 pixels and retained center crop of UAV from step 1. Additionally, each satellite image underwent RSC augmentation in a similar manner to the UL14 dataset. For every image, we constructed 12 satellite images of varying sizes, with minimum satellite image size being 2400 pixels and maximum being 3500 pixels. Notably, each of those 12 images had its target location pixel randomly sampled.
- 8.
- Image format and saving: after initial processing (steps 1-4) images were saved to a fixed resolution of 512x512 and 1280x1280 for UAV and satellite images, respectively, in the training set and 256x256 and 768x768 for UAV and satellite images, respectively, in the validation/training sets.
3.1.2. Data Preprocessing
3.3. Backbone Network
3.4. MuRDE-FPN
3.5. Experimental Setup and Evaluation
5. Experimental Results
5.1. Evaluation on UL14 Dataset
5.2. Evaluation on UAV-Sat Dataset
6. Ablation Study
7. Discussion
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Manoj, H.M.; Shanthi, D.L.; Lakshmi, B.N.; Archana, K.J.; Venkata Naga Jyothi, E.; Archana, K. AI-Driven Drone Technology and Computer Vision for Early Detection of Crop Disease in Large Agricultural Areas. Sci Rep 2025. [CrossRef]
- Shakhatreh, H.; Sawalmeh, A.H.; Al-Fuqaha, A.; Dou, Z.; Almaita, E.; Khalil, I.; Othman, N.S.; Khreishah, A.; Guizani, M. Unmanned Aerial Vehicles (UAVs): A Survey on Civil Applications and Key Research Challenges. IEEE Access 2019, 7, 48572–48634.
- Shayea, I.; Dushi, P.; Banafaa, M.; Rashid, R.A.; Ali, S.; Sarijari, M.A.; Daradkeh, Y.I.; Mohamad, H. Handover Management for Drones in Future Mobile Networks—A Survey. Sensors 2022, 22.
- Zheng, T.; Xu, A.; Xu, X.; Liu, M. Modeling and Compensation of Inertial Sensor Errors in Measurement Systems. Electronics (Switzerland) 2023, 12. [CrossRef]
- Lowe, D. Object Recognition from Local Scale-Invariant Features. 1999, 1150. [CrossRef]
- Liu, Y.; Wang, Y.; Wang, D.; Wu, W.; Li, X.X.; Sun, W.; Ren, X.; Song, H. A Scalable Benchmark to Evaluate the Robustness of Image Stitching under Simulated Distortions. Sci Rep 2025, 15. [CrossRef]
- Sarlin, P.-E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning Feature Matching with Graph Neural Networks. 2020.
- Gong, F.; Hao, J.; Du, C.; Wang, H.; Zhao, Y.; Yu, Y.; Ji, X. FIM-JFF: Lightweight and Fine-Grained Visual UAV Localization Algorithms in Complex Urban Electromagnetic Environments. Information 2025, 16, 452. [CrossRef]
- Ding, L.; Zhou, J.; Meng, L.; Long, Z. A Practical Cross-view Image Matching Method between Uav and Satellite for Uav-based Geo-localization. Remote Sens (Basel) 2021, 13, 1–22. [CrossRef]
- Cui, Z.; Zhou, P.; Wang, X.; Zhang, Z.; Li, Y.; Li, H.; Zhang, Y. A Novel Geo-Localization Method for UAV and Satellite Images Using Cross-View Consistent Attention. Remote Sens (Basel) 2023, 15. [CrossRef]
- Xu, Y.; Dai, M.; Cai, W.; Yang, W. Precise GPS-Denied UAV Self-Positioning via Context-Enhanced Cross-View Geo-Localization. arXiv (Cornell University) 2025. [CrossRef]
- Dai, M.; Chen, J.; Lu, Y.; Hao, W.; Zheng, E. Finding Point with Image: An End-to-End for Vision-Based UAV Localization. 2022.
- Chen, J.; Zheng, E.; Dai, M.; Chen, Y.; Lu, Y. OS-FPI: A Coarse-to-Fine One-Stream Network for UAV Geolocalization. IEEE J Sel Top Appl Earth Obs Remote Sens 2024, 17, 7852–7866. [CrossRef]
- Chen, N.; Fan, J.; Yuan, J.; Zheng, E. OBTPN: A Vision-Based Network for UAV Geo-Localization in Multi-Altitude Environments. Drones 2025, 9. [CrossRef]
- Fan, J.; Zheng, E.; He, Y.; Yang, J. A Cross-View Geo-Localization Algorithm Using UAV Image and Satellite Image. Sensors 2024, 24. [CrossRef]
- Ju, C.; Xu, W.; Chen, N.; Zheng, E. An Efficient Pyramid Transformer Network for Cross-View Geo-Localization in Complex Terrains. Drones 2025, 9. [CrossRef]
- He, Y.; Chen, F.; Chen, J.; Fan, J.; Zheng, E. DCD-FPI: A Deformable Convolution-Based Fusion Network for Unmanned Aerial Vehicle Localization. IEEE Access 2024, 12, 129308–129318. [CrossRef]
- Tian, L.; Shen, Q.; Gao, Y.; Wang, S.; Liu, Y.; Deng, Z. A Cross-Mamba Interaction Network for UAV-to-Satallite Geolocalization. Drones 2025, 9, 427. [CrossRef]
- Yao, Y.; Sun, C.; Wang, T.; Yang, J.; Zheng, E. UAV Geo-Localization Dataset and Method Based on Cross-View Matching. Sensors 2024, 24. [CrossRef]
- Vaswani, A.; Brain, G.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need; 2017;
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2021.
- Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions. 2021.
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. 2021.
- Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the Design of Spatial Attention in Vision Transformers. arXiv (Cornell University) 2021. [CrossRef]
- Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved Baselines with Pyramid Vision Transformer. 2023. [CrossRef]
- Wang, Q.; Zhang, J.; Yang, K.; Peng, K.; Stiefelhagen, R. MatchFormer: Interleaving Attention in Transformers for Feature Matching. 2022. arXiv (Cornell University) 2022. [CrossRef]
- Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-Free Local Feature Matching with Transformers. 2021 , 8918. [CrossRef]
- Ye, B.; Chang, H.; Ma, B.; Shan, S.; Chen, X. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. 2022; p. 341.
- Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. 2017, 936. [CrossRef]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. 2018 . [CrossRef]
- Ghiasi, G.; Lin, T.-Y.; Pang, R.; Le, Q. V. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. 2019 , 7029. [CrossRef]
- Tan, M.; Pang, R.; Le, Q. V. EfficientDet: Scalable and Efficient Object Detection. 2020, 10778. [CrossRef]
- Sun, G.; Jiang, X.; Lin, W. DBEENet: Dual-Branch Edge-Enhanced Network for Semantic Segmentation of USV Maritime Images. Ocean Engineering 2025, 341. [CrossRef]
- Xie, C.; Li, M.; Zeng, H.; Luo, J.; Zhang, L. MaSS13K: A Matting-Level Semantic Segmentation Benchmark. arXiv (Cornell University) 2025.
- Thuan, N.H.; Oanh, N.T.; Thuy, N.T.; Perry, S.; Sang, D.V. RaBiT: An Efficient Transformer Using Bidirectional Feature Pyramid Network with Reverse Attention for Colon Polyp Segmentation. 2023. [CrossRef]
- Zhang, R.; Xie, M.; Liu, Q. CFRA-Net: Fusing Coarse-to-Fine Refinement and Reverse Attention for Lesion Segmentation in Medical Images. Biomed Signal Process Control 2025, 109. [CrossRef]
- Zhou, G.; Xu, Q.; Liu, Y.; Liu, Q.; Ren, A.; Zhou, X.; Li, H.; Shen, J. Lightweight Multiscale Feature Fusion and Multireceptive Field Feature Enhancement for Small Object Detection in the Aerial Images. IEEE Transactions on Geoscience and Remote Sensing 2025, 63. [CrossRef]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-Efficient Image Transformers & Distillation through Attention. 2020. [CrossRef]
- Wang, G.; Chen, J.; Dai, M.; Zheng, E. WAMF-FPI: A Weight-Adaptive Multi-Feature Fusion Network for UAV Localization. Remote Sens (Basel) 2023, 15. [CrossRef]
- Xu, W.; Yao, Y.; Cao, J.; Wei, Z.; Liu, C.; Wang, J.; Peng, M. UAV-VisLoc: A Large-Scale Dataset for UAV Visual Localization. 2024. [CrossRef]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; IEEE Computer Society, 2020; pp. 11531–11539 . [CrossRef]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision; Institute of Electrical and Electronics Engineers Inc., December 22 2017; Vol. 2017-October, pp. 764–773 . [CrossRef]
- Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More Deformable, Better Results. 2018. [CrossRef]
- Dong, X.; Qin, Y.; Fu, R.; Gao, Y.; Liu, S.; Ye, Y.; Li, B. Multiscale Deformable Attention and Multilevel Features Aggregation for Remote Sensing Object Detection. IEEE Geoscience and Remote Sensing Letters 2022, 19. [CrossRef]
- Fu, X.; Yuan, Z.; Yu, T.; Ge, Y. DA-FPN: Deformable Convolution and Feature Alignment for Object Detection. Electronics (Switzerland) 2023, 12. [CrossRef]
- Huang, S.; Lu, Z.; Cheng, R.; He, C. FaPN: Feature-Aligned Pyramid Network for Dense Image Prediction; In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); October 1 2021; p. 844.
- Li, J.; Wang, Q.; Dong, H. BAFPN: Bidirectionally Aligning Features to Improve Object Localization Accuracy in Remote Sensing Images. Applied Intelligence 2025, 55. [CrossRef]












| Dataset | ||||||
|---|---|---|---|---|---|---|
| UL14 | UAV-Sat (ours) | |||||
| Sat | UAV | Satellite cover area*, km2 | Sat | UAV | Satellite cover area*, km2 | |
| Train | 6768 | 6768 | 0.1475 | 3330 | 3330 | 1.1025 |
| Test | 27972 | 2331 | 0.0441-0.2916 | 7836 | 653 | 0.5184—1.1025 |
| Method | RDS | MA@3 | MA@5 | MA@20 | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|
| FPI [12] | 57.22 | - | 18.63 | 57.67 | 44.48 | 14.88 |
| WAMF-FPI [39] | 65.33 | 12.49 | 26.99 | 69.73 | 48.5 | 13.32 |
| OS-FPI [13] | 76.25 | 22.81 | 44.31 | 82.52 | 14.76 | 14.28 |
| DCD-FPI [17] | 77.15 | 25.09 | 47.03 | 83.39 | 13.96 | 11.54 |
| MuRDE-FPI | 84.26 | 30.93 | 55.42 | 93.06 | 14.15 | 11.81 |
| Method | RDS | MA@10 | MA@20 | MA@30 | MA@40 | MA@50 |
|---|---|---|---|---|---|---|
| FPI [12] | 37.61 | 0.15 | 0.64 | 1.2 | 1.8 | 2.6 |
| WAMF-FPI [39] | 43.95 | 0.18 | 0.69 | 1.4 | 2.3 | 3.5 |
| OS-FPI [13] | 59.03 | 0.26 | 1.1 | 2.1 | 3.3 | 4.5 |
| DCD-FPI [17] | 56.44 | 0.33 | 0.94 | 1.8 | 3.1 | 4.1 |
| MuRDE-FPI | 63.74 | 0.38 | 1.1 | 2.4 | 3.8 | 5.3 |
| Method | RDS | GFLOPs | Params (M) | |||||
|---|---|---|---|---|---|---|---|---|
| FPN | FPN++ | ECA | MuRE | MuRDE | FAM | |||
| + | 68.94 | 10.4 | 13.51 | |||||
| + | 76.12 | 11.24 | 13.62 | |||||
| + | + | 76.01 | 11.24 | 13.62 | ||||
| + | + | + | 80.64 | 11.71 | 13.82 | |||
| + | + | + | 83.67 | 11.74 | 14.08 | |||
| + | + | + | + | 84.26 | 11.81 | 14.15 | ||
| Method | RDS | GFLOPs | Params (M) | |||
|---|---|---|---|---|---|---|
| SDK | DDK-A | DDK-B | TDK | |||
| + | 81.96 | 11.26 | 13.7 | |||
| + | 83.30 | 11.54 | 13.92 | |||
| + | 83.27 | 11.54 | 13.92 | |||
| + | 84.26 | 11.81 | 14.15 | |||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
