Submitted:
14 March 2025
Posted:
17 March 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We propose the Quadrant Insight Module (QIM), which is specifically designed to capture multi-scale spatial features and preserve intra-class consistency, adaptively emphasizing discriminative regions within the features and enhancing the overall representation of the object.
- We employ the Integrated Global-Local Attention Module (IGLAM), which bridges overall and partial information by aggregating high-association feature stripes, enhancing both broad contextual information and specific details while minimizing unnecessary background elements.
- Through extensive experiments conducted on public datasets, namely University-1652 and SUES-200, we prove that our GLQINet achieves superior performance in cross-view geo-localization compared to competing methods.
2. Related Work
2.1. Cross-View Geo-Localization
2.2. Based on Part Feature Methods
2.3. Based on Transformer Methods
3. Proposed Method
3.1. Problem Formulation
3.2. Quadrant Insight Module (QIM)
3.3. Integrated Global-Local Attention Module (IGLAM)
3.4. Loss Function
4. Experiment and Analysis
4.1. Datasets
4.2. Evaluation Metrics
4.3. Implementation Details
4.4. Comparison with State-of-the-Art Methods
4.4.1. Comparisons on University-1652
4.4.2. Comparisons on SUES-200
4.5. Ablation Studies and Analysis
4.6. Visualization
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| GLQINet | Global-Local Quadrant Interaction Network |
| QIM | Quadrant Insight Module |
| IGLAM | Integrated Global-Local Attention Module |
| GPS | Global Positioning System |
| CNNs | Convolutional Neural Networks |
| BEV | Bird’s-Eye View |
| NLP | Natural Language Processing |
| HAP | Horizontal Average Pooling |
| VAP | Vertical Average Pooling |
| DAP | Diagonal Average Pooling |
| AAP | Anti-diagonal Average Pooling |
| R@K | Recall@K |
| AP | Average Precision |
| SGD | Stochastic Gradient Descent |
References
- Zhuang, J.; Dai, M.; Chen, X.; Zheng, E. A faster and more effective cross-view matching method of UAV and satellite images for UAV geolocalization. Remote Sensing 2021, 13, 3979. [CrossRef]
- Cui, Z.; Zhou, P.; Wang, X.; Zhang, Z.; Li, Y.; Li, H.; Zhang, Y. A novel geo-localization method for UAV and satellite images using cross-view consistent attention. Remote Sensing 2023, 15, 4667. [CrossRef]
- Ding, L.; Zhou, J.; Meng, L.; Long, Z. A practical cross-view image matching method between UAV and satellite for UAV-based geo-localization. Remote Sensing 2020, 13, 47. [CrossRef]
- Hou, Q.; Lu, J.; Guo, H.; Liu, X.; Gong, Z.; Zhu, K.; Ping, Y. Feature relation guided cross-view image based geo-localization. Remote Sensing 2023, 15, 5029. [CrossRef]
- Yan, Y.; Wang, M.; Su, N.; Hou, W.; Zhao, C.; Wang, W. IML-Net: A Framework for Cross-View Geo-Localization with Multi-Domain Remote Sensing Data. Remote Sensing 2024, 16, 1249. [CrossRef]
- Middelberg, S.; Sattler, T.; Untzelmann, O.; Kobbelt, L. Scalable 6-dof localization on mobile devices. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II 13. Springer, 2014, pp. 268–283.
- An, Z.; Wang, X.; Li, B.; Xiang, Z.; Zhang, B. Robust visual tracking for UAVs with dynamic feature weight selection. Applied Intelligence 2023, 53, 3836–3849. [CrossRef]
- Krylov, V.A.; Kenny, E.; Dahyot, R. Automatic discovery and geotagging of objects from street view imagery. Remote Sensing 2018, 10, 661. [CrossRef]
- Nassar, A.S.; Lefèvre, S.; Wegner, J.D. Simultaneous multi-view instance detection with learned geometric soft-constraints. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6559–6568.
- Chaabane, M.; Gueguen, L.; Trabelsi, A.; Beveridge, R.; O’Hara, S. End-to-end learning improves static object geo-localization from video. In Proceedings of the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2063–2072.
- Lin, T.Y.; Belongie, S.; Hays, J. Cross-view image geolocalization. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 891–898.
- Castaldo, F.; Zamir, A.; Angst, R.; Palmieri, F.; Savarese, S. Semantic cross-view matching. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015, pp. 9–17.
- Zheng, Z.; Wei, Y.; Yang, Y. University-1652: A multi-view multi-source benchmark for drone-based geo-localization. In Proceedings of the Proceedings of the 28th ACM international conference on Multimedia, 2020, pp. 1395–1403.
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.
- Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 2020.
- Dai, M.; Hu, J.; Zhuang, J.; Zheng, E. A transformer-based feature segmentation and region alignment method for UAV-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology 2021, 32, 4376–4389. [CrossRef]
- Zhu, S.; Shah, M.; Chen, C. Transgeo: Transformer is all you need for cross-view image geo-localization. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1162–1171.
- Yang, H.; Lu, X.; Zhu, Y. Cross-view geo-localization with layer-to-layer transformer. Advances in Neural Information Processing Systems 2021, 34, 29009–29020.
- Wang, T.; Zheng, Z.; Yan, C.; Zhang, J.; Sun, Y.; Zheng, B.; Yang, Y. Each part matters: Local patterns facilitate cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology 2021, 32, 867–879. [CrossRef]
- Lin, J.; Zheng, Z.; Zhong, Z.; Luo, Z.; Li, S.; Yang, Y.; Sebe, N. Joint representation learning and keypoint detection for cross-view geo-localization. IEEE Transactions on Image Processing 2022, 31, 3780–3792. [CrossRef]
- Li, H.; Chen, Q.; Yang, Z.; Yin, J. Drone Satellite Matching based on Multi-scale Local Pattern Network. In Proceedings of the Proceedings of the 2023 Workshop on UAVs in Multimedia: Capturing the World from a New Perspective, 2023, pp. 51–55.
- Vaswani, A. Attention is all you need. Advances in Neural Information Processing Systems 2017.
- Peng, J.; Wang, H.; Xu, F.; Fu, X. Cross domain knowledge learning with dual-branch adversarial network for vehicle re-identification. Neurocomputing 2020, 401, 133–144. [CrossRef]
- Zhuang, J.; Chen, X.; Dai, M.; Lan, W.; Cai, Y.; Zheng, E. A semantic guidance and transformer-based matching method for UAVs and satellite images for UAV geo-localization. Ieee Access 2022, 10, 34277–34287. [CrossRef]
- Kuma, R.; Weill, E.; Aghdasi, F.; Sriram, P. Vehicle re-identification: an efficient baseline using triplet embedding. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–9.
- Tian, Y.; Chen, C.; Shah, M. Cross-view image matching for geo-localization in urban environments. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3608–3616.
- Lin, T.Y.; Cui, Y.; Belongie, S.; Hays, J. Learning deep representations for ground-to-aerial geolocalization. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 5007–5015.
- Workman, S.; Souvenir, R.; Jacobs, N. Wide-area image geolocalization with aerial reference imagery. In Proceedings of the Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3961–3969.
- Liu, L.; Li, H. Lending orientation to neural networks for cross-view geo-localization. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5624–5633.
- Zhu, S.; Yang, T.; Chen, C. Vigor: Cross-view image geo-localization beyond one-to-one retrieval. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3640–3649.
- Sun, Y.; Ye, Y.; Kang, J.; Fernandez-Beltran, R.; Feng, S.; Li, X.; Luo, C.; Zhang, P.; Plaza, A. Cross-view object geo-localization in a local region with satellite imagery. IEEE Transactions on Geoscience and Remote Sensing 2023. [CrossRef]
- Zhu, R.; Yin, L.; Yang, M.; Wu, F.; Yang, Y.; Hu, W. SUES-200: A multi-height multi-scene cross-view image benchmark across drone and satellite. IEEE Transactions on Circuits and Systems for Video Technology 2023, 33, 4825–4839. [CrossRef]
- Workman, S.; Jacobs, N. On the location dependence of convolutional neural network features. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 70–78.
- Shi, Y.; Liu, L.; Yu, X.; Li, H. Spatial-aware feature aggregation for image based cross-view geo-localization. Advances in Neural Information Processing Systems 2019, 32.
- Shen, F.; Shu, X.; Du, X.; Tang, J. Pedestrian-specific Bipartite-aware Similarity Learning for Text-based Person Retrieval. In Proceedings of the Proceedings of the 31th ACM International Conference on Multimedia, 2023.
- Shen, F.; Du, X.; Zhang, L.; Tang, J. Triplet Contrastive Learning for Unsupervised Vehicle Re-identification. arXiv preprint arXiv:2301.09498 2023.
- Shen, F.; Xie, Y.; Zhu, J.; Zhu, X.; Zeng, H. Git: Graph interactive transformer for vehicle re-identification. IEEE Transactions on Image Processing 2023. [CrossRef]
- Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the Proceedings of the European conference on computer vision (ECCV), 2018, pp. 480–496.
- Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 274–282.
- Zhang, X.; Luo, H.; Fan, X.; Xiang, W.; Sun, Y.; Xiao, Q.; Jiang, W.; Zhang, C.; Sun, J. Alignedreid: Surpassing human-level performance in person re-identification. arXiv preprint arXiv:1711.08184 2017.
- Luo, H.; Jiang, W.; Zhang, X.; Fan, X.; Qian, J.; Zhang, C. Alignedreid++: Dynamically matching local information for person re-identification. Pattern Recognition 2019, 94, 53–61. [CrossRef]
- Shen, F.; Tang, J. IMAGPose: A Unified Conditional Framework for Pose-Guided Person Generation. In Proceedings of the The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
- Shen, F.; Jiang, X.; He, X.; Ye, H.; Wang, C.; Du, X.; Li, Z.; Tang, J. IMAGDressing-v1: Customizable Virtual Dressing. arXiv preprint arXiv:2407.12705 2024.
- Shen, F.; Ye, H.; Zhang, J.; Wang, C.; Han, X.; Yang, W. Advancing pose-guided image synthesis with progressive conditional diffusion models. arXiv preprint arXiv:2310.06313 2023.
- Shen, F.; Ye, H.; Liu, S.; Zhang, J.; Wang, C.; Han, X.; Yang, W. Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models. arXiv preprint arXiv:2407.02482 2024.
- Shen, F.; Wang, C.; Gao, J.; Guo, Q.; Dang, J.; Tang, J.; Chua, T.S. Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model. arXiv preprint arXiv:2502.09533 2025.
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11976–11986.
- Dou, Z.Y.; Xu, Y.; Gan, Z.; Wang, J.; Wang, S.; Wang, L.; Zhu, C.; Zhang, P.; Yuan, L.; Peng, N.; et al. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18166–18176.
- Hendricks, L.A.; Mellor, J.; Schneider, R.; Alayrac, J.B.; Nematzadeh, A. Decoupling the role of data, attention, and losses in multimodal transformers. Transactions of the Association for Computational Linguistics 2021, 9, 570–585. [CrossRef]
- Zeng, W.; Wang, T.; Cao, J.; Wang, J.; Zeng, H. Clustering-guided pairwise metric triplet loss for person reidentification. IEEE Internet of Things Journal 2022, 9, 15150–15160. [CrossRef]
- Chen, K.; Lei, W.; Zhao, S.; Zheng, W.S.; Wang, R. PCCT: Progressive class-center triplet loss for imbalanced medical image classification. IEEE Journal of Biomedical and Health Informatics 2023, 27, 2026–2036. [CrossRef]
- Wang, T.; Zheng, Z.; Zhu, Z.; Gao, Y.; Yang, Y.; Yan, C. Learning cross-view geo-localization embeddings via dynamic weighted decorrelation regularization. arXiv preprint arXiv:2211.05296 2022.
- Wang, T.; Zheng, Z.; Sun, Y.; Yan, C.; Yang, Y.; Chua, T.S. Multiple-environment Self-adaptive Network for Aerial-view Geo-localization. Pattern Recognition 2024, 152, 110363. [CrossRef]
- Zhu, Y.; Yang, H.; Lu, Y.; Huang, Q. Simple, effective and general: A new backbone for cross-view image geo-localization. arXiv preprint arXiv:2302.01572 2023.
- Hu, Q.; Li, W.; Xu, X.; Liu, N.; Wang, L. Learning discriminative representations via variational self-distillation for cross-view geo-localization. Computers and Electrical Engineering 2022, 103, 108335. [CrossRef]
- Tian, X.; Shao, J.; Ouyang, D.; Shen, H.T. UAV-satellite view synthesis for cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology 2021, 32, 4804–4815. [CrossRef]
- Bui, D.V.; Kubo, M.; Sato, H. A part-aware attention neural network for cross-view geo-localization between UAV and satellite. Journal of Robotics, Networking and Artificial Life 2022, 9, 275–284.
- Zhu, R.; Yang, M.; Yin, L.; Wu, F.; Yang, Y. Uav’s status is worth considering: A fusion representations matching method for geo-localization. Sensors 2023, 23, 720. [CrossRef]
- Shen, T.; Wei, Y.; Kang, L.; Wan, S.; Yang, Y.H. MCCG: A ConvNeXt-based multiple-classifier method for cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology 2023. [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
- Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12009–12019.
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.






| Method | Drone→ Satellite | Satellite→Drone | ||
| R@1 | AP | R@1 | AP | |
| Self Attention [22] | 91.48 | 92.39 | 93.72 | 91.11 |
| Cross Attention [22] | 91.02 | 92.42 | 94.58 | 91.04 |
| Co-Attention [48] | 91.12 | 92.54 | 94.29 | 90.30 |
| Merged Attention (Ours) [48] | 91.66 | 92.94 | 94.58 | 91.11 |
| Method | University-1652 | |||
| Drone → Satellite | Satellite → Drone | |||
| R@1 | AP | R@1 | AP | |
| U-baseline [13] | 58.49 | 63.31 | 71.18 | 58.74 |
| DWDR [52] | 69.77 | 73.73 | 81.46 | 70.45 |
| MuSe-Net [53] | 74.48 | 77.83 | 88.02 | 75.10 |
| LPN [19] | 75.93 | 79.14 | 86.45 | 74.49 |
| SAIG [54] | 78.85 | 81.62 | 86.45 | 78.48 |
| LDRVSD [55] | 78.66 | 81.55 | 89.30 | 79.17 |
| PCL [56] | 79.47 | 83.63 | 87.69 | 78.51 |
| FSRA [16] | 82.25 | 84.82 | 87.87 | 81.53 |
| SGM [24] | 82.14 | 84.72 | 88.16 | 81.80 |
| PAAN [57] | 84.51 | 86.78 | 91.01 | 82.28 |
| MSBA [1] | 86.61 | 88.55 | 92.15 | 84.45 |
| MBF [58] | 89.05 | 90.61 | 93.15 | 88.17 |
| MCCG [59] | 89.64 | 91.32 | 94.30 | 89.39 |
| Ours | 91.66 | 92.94 | 94.58 | 91.11 |
| Method | Drone → Satellite | |||||||
| 150m | 200m | 250m | 300m | |||||
| R@1 | AP | R@1 | AP | R@1 | AP | R@1 | AP | |
| LCM [3] | 43.42 | 49.65 | 49.42 | 55.91 | 57.47 | 60.31 | 60.43 | 65.78 |
| Vit [32] | 59.32 | 64.94 | 62.30 | 67.22 | 71.35 | 75.48 | 77.17 | 80.67 |
| LPN [19] | 61.58 | 67.23 | 70.85 | 75.96 | 80.38 | 83.80 | 81.47 | 84.53 |
| SwinV2-T [61] | 66.40 | 71.64 | 77.63 | 81.91 | 84.62 | 87.73 | 90.01 | 92.27 |
| FSRA [16] | 68.25 | 73.45 | 83.00 | 85.99 | 90.68 | 92.27 | 91.95 | 93.46 |
| Ours | 82.07 | 85.55 | 91.50 | 93.33 | 96.72 | 97.49 | 96.82 | 97.42 |
| Method | Satellite → Drone | |||||||
| 150m | 200m | 250m | 300m | |||||
| R@1 | AP | R@1 | AP | R@1 | AP | R@1 | AP | |
| LCM [3] | 57.50 | 38.11 | 68.75 | 49.19 | 72.50 | 47.94 | 75.00 | 59.36 |
| Vit [32] | 82.50 | 58.88 | 87.50 | 62.48 | 90.00 | 69.91 | 96.25 | 84.10 |
| LPN [19] | 83.75 | 66.78 | 88.75 | 75.01 | 92.50 | 81.34 | 92.50 | 85.72 |
| SwinV2-T [61] | 82.51 | 71.18 | 90.03 | 82.20 | 93.23 | 92.11 | 97.52 | 92.09 |
| FSRA [16] | 83.75 | 76.67 | 90.00 | 85.34 | 93.75 | 90.17 | 95.00 | 92.03 |
| Ours | 95.00 | 85.44 | 98.75 | 95.26 | 98.75 | 97.14 | 98.75 | 97.98 |
| Method | Backbone | Drone → Satellite | Satellite → Drone | ||
| R@1 | AP | R@1 | AP | ||
| Baseline | ResNet-50 [62] | 58.50 | 63.28 | 79.17 | 57.81 |
| GLQINet | ResNet-50 [62] | 78.78 | 81.86 | 88.59 | 78.49 |
| Baseline | ResNet-101 [62] | 59.56 | 64.00 | 81.17 | 60.56 |
| GLQINet | ResNet-101 [62] | 83.01 | 85.64 | 89.87 | 82.93 |
| Baseline | ConvNeXt-Tiny [47] | 76.74 | 79.01 | 88.45 | 77.12 |
| GLQINet | ConvNeXt-Tiny [47] | 91.66 | 92.94 | 94.58 | 91.11 |
| Baseline | ConvNeXt-Small [47] | 82.43 | 85.03 | 91.30 | 82.89 |
| GLQINet | ConvNeXt-Small [47] | 92.40 | 93.65 | 95.58 | 91.90 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).