1. Introduction
Global localization is a prerequisite for long-horizon GNSS-denied UAV autonomy. Visual–inertial odometry (VIO) provides locally accurate relative motion but drifts without bound. A natural correction source is aerial-to-satellite visual place recognition (VPR), matching onboard views to geo-referenced satellite tiles. In practice, the cross-view domain gap and strong perceptual aliasing produce frequent high-similarity false matches at geographically distant locations, making per-frame anchoring unreliable.
Public benchmarks for this setting remain limited. Despite searching available datasets and benchmarks, we were unable to find an open-source dataset that jointly provides low-altitude (50–150 m) UAV imagery, synchronized VIO, and a geo-referenced satellite tile database suitable for trajectory-level evaluation in visually complex rural/village environments. Many available datasets are higher altitude, less challenging visually, lack VIO, omit standardized baselines, or are not publicly released. To study this practically relevant regime, we therefore evaluate NaviLoc on a prepared real-world UAV-to-satellite dataset (Table 1) with VIO and curated satellite tiles.
NaviLoc addresses this failure mode by estimating a trajectory-level solution rather than committing to individual matches. Stage 1 (Global Align) searches for a single global SE(2) transform whose implied trajectory yields consistently high local retrieval scores, exploiting the fact that incorrect transforms are not supported coherently across many frames. Stage 2 (Refinement) refines the aligned trajectory through bounded weighted Procrustes updates on overlapping windows. Stage 3 (Smoothing) computes a closed-form MAP estimate that fuses VIO displacements with VPR anchors while suppressing low-confidence anchors detected from the similarity distribution.
Contributions.
A training-free three-stage trajectory-level localization method with explicit objectives and closed-form solutions.
An evaluation on low-altitude (50–150 m) UAV imagery showing
19.5 m mean localization error and
16.0× improvement over AnyLoc-VLAD [
5].
An ablation demonstrating robustness to hyperparameter variations, and an empirical study showing that distilled ViT descriptors are more effective for trajectory-level alignment than larger foundation models on our benchmark.
An embedded implementation achieving 9 FPS end-to-end inference on Raspberry Pi 5.
Figure 1 illustrates the three-stage pipeline. We now review related work.
2. Related Work
Visual Place Recognition. VPR evolved from handcrafted descriptors [
1] through bag-of-words [
2,
3] to learned representations. NetVLAD [
4] introduced end-to-end learning for place recognition. AnyLoc [
5] aggregates foundation model features using VLAD or GeM pooling, achieving state-of-the-art results on diverse benchmarks. MixVPR [
6] proposes feature mixing for compact descriptors. These methods produce per-image descriptors without trajectory-level consistency constraints.
Cross-Domain and Aerial Localization. The aerial-to-satellite domain gap presents challenges distinct from ground-level VPR. CVM-Net [
13] learns cross-view representations. VIGOR [
14] and CVUSA [
15] established benchmarks for ground-to-aerial matching. Recent UAV-specific methods [
16,
17] address the UAV-to-satellite gap. Our approach is complementary: we accept that individual matches are unreliable and leverage trajectory-level statistics.
UAV-to-satellite geo-localization. Several recent works study cross-view matching between UAV imagery and satellite maps in the remote sensing literature [
17,
18,
19,
20]. These methods improve per-frame matching, but still face perceptual aliasing under viewpoint and appearance changes; NaviLoc instead aggregates evidence across time using an explicit trajectory-level objective. For broader context on absolute visual localization pipelines and design choices, we refer to a recent survey in
Drones [
7].
Point Set Registration. The Iterative Closest Point (ICP) algorithm [
8,
9] alternates between correspondence assignment and transform estimation for point cloud alignment. Procrustes analysis [
10] provides closed-form solutions for rigid alignment given correspondences. Extensions include weighted variants [
11,
12] and robust formulations. Our Stage 1 adapts ICP-style alternating optimization to the VPR similarity objective with provable monotone improvement over evaluated candidates.
Robust Estimation. Huber’s M-estimators [
27] downweight outliers in regression. Pose graph optimization [
25,
26] fuses odometry and visual constraints. Switchable constraints [
28] and graduated non-convexity [
29] handle outliers in SLAM. Our Stage 3 performs z-score outlier detection and
clamps detected outliers to the VIO prior (by setting
), yielding a closed-form strictly convex solve without iterative robust optimization.
Visual-Inertial Odometry. VIO methods including MSCKF [
21], OKVIS [
22], VINS-Mono [
23], and ORB-SLAM3 [
24] provide accurate relative motion but accumulate drift. We use VIO as a relative motion prior in Stage 3.
Figure 2 illustrates the cross-view domain gap that NaviLoc overcomes.
3. Method
3.1. Problem Formulation
Given N aerial query images with descriptors and VIO-derived positions in a local frame, along with VIO displacements , and a geo-referenced satellite map with M reference tiles having descriptors and coordinates , we seek global positions .
3.2. Stage 1: Global Align
We estimate a global SE(2) transform
by maximizing the trajectory-level similarity objective:
where
is the 2D rotation matrix,
is the set of reference tiles within radius
r, and
denotes cosine similarity.
Algorithm and intuition. Stage 1 treats VPR as a noisy correspondence generator: for a fixed rotation
, each frame proposes a translation “measurement”
based on the global top-1 match
. Under perceptual aliasing, many
are outliers; we therefore aggregate
with a robust location estimator. We use coordinate ascent: (1) scan a grid of
K rotation angles
; (2) for each
, compute the L1-optimal translation (component-wise median) [
27]:
where
is the global top-1 match; (3) evaluate
and select the best; (4) refine with alternating maximization using local targets.
We denote the resulting aligned trajectory by . The following result formalizes why the median aggregator is optimal for robust translation estimation under outlier contamination:
Theorem (L1-Optimal Translation (Median)). Let be translation residuals for fixed θ. The minimizer of
is given by the component-wise median: and . Consequently, is robust to outlier residuals: as long as more than half of the residuals are inliers per coordinate, the estimate is controlled by the inlier set rather than the outliers.
Proof.The L1 norm
decomposes across coordinates, so the objective separates into two independent 1D problems:
and
. The minimizer of
over
is the median of
[
27]. Robustness follows because the median is unaffected by arbitrarily large perturbations to fewer than half of the samples. □
Remark (monotone acceptance). In our implementation, we only accept candidate updates that improve (or tie) J, so the sequence of evaluated objective values is monotone non-decreasing and bounded above by 1, hence convergent. This guarantees convergence of the objective sequence, not global optimality.
3.3. Stage 2: Refinement
After global alignment, we refine positions using sliding-window bounded Procrustes. For each frame index
j, we compute a
local VPR target as the coordinate of the best-matching reference tile within radius
r of the current predicted position
, and record its cosine similarity score
. For each window
with targets
and weights
, we solve:
This weighted Procrustes problem admits a closed-form solution for both rotation and translation. The rotation constraint prevents overcorrection when local VPR targets are noisy:
Theorem (Optimal Bounded Procrustes). For the constrained problem (4) in 2D, let and be the weighted centroids, and let be the weighted cross-covariance matrix. The optimal rotation angle is , where , and the optimal translation is .
Proof.After centering by the weighted centroids, the objective separates into rotation and translation terms. For the rotation, we seek to maximize
. Writing
, the trace expands to
[
10,
11]. This sinusoidal function of
is maximized at
. The constraint
clips this to the boundary if
lies outside. The optimal translation follows as
. □
Windows overlap (stride ), so each frame receives corrections from multiple windows; we average these contributions. Multiple passes are performed because each pass updates the trajectory, which changes the local VPR targets (since they depend on the current predicted positions). Empirically, 2–3 passes suffice for convergence.
3.4. Stage 3: Smoothing
Let
denote the Stage 2 output. We treat these positions as
anchors with
, and fuse them with VIO constraints via:
where
is the
first-difference matrix.
Outlier detection. For each anchor , we recompute its local VPR similarity as the best cosine similarity among reference tiles within radius r of . We then compute z-scores . Frames with (default ) are marked as outliers and assigned (implemented as ), effectively clamping them to the VIO prior. Inliers use . With the anchor weights thus defined, the optimization problem has a unique closed-form solution:
Theorem (Unique Solution). If for at least one i, then the objective (5) is strictly convex with unique minimizer given by the linear system:
Proof.The objective (
5) is quadratic in
. Setting the gradient to zero yields the normal equations
. The matrix
is the graph Laplacian of a path, which is positive semidefinite with nullspace spanned by the constant vector
. Adding
with at least one
eliminates this nullspace: any
has
. Thus the combined matrix is positive definite, guaranteeing strict convexity and a unique solution [
30]. □
3.5. Implementation Details
Algorithm 1 summarizes the complete pipeline. We use DeiT-Tiny-Distilled [
33], a distilled Vision Transformer [
31], for feature extraction (192-dim descriptors). Fixed parameters: search radius
m,
angles, window size
, stride
,
rad (
),
z-threshold
.
Detailed pseudocode for each stage is provided in
Appendix A.
4. Experiments
4.1. Dataset
To the best of our knowledge, there is no publicly available
challenging dataset for our target setting—low-altitude (50–150 m) UAV imagery with synchronized VIO and a geo-referenced satellite tile database for trajectory-level evaluation—in particular with long real-world trajectories. We therefore evaluate on a prepared real-world UAV-to-satellite benchmark collected over rural terrain in Ukraine.
Table 1 summarizes the dataset statistics. The dataset comprises 58 aerial query frames captured at 50–150 m AGL along a 2.3 km trajectory with 40 m inter-frame spacing. Reference imagery consists of 462 geo-referenced satellite tiles at 0.3 m/px resolution covering 1.6 km
2 with 40 m tile spacing. The benchmark is challenging due to strong perceptual aliasing in rural/village scenes (repetitive texture, limited distinctive landmarks) combined with low-altitude viewpoint and appearance changes relative to the satellite map.
Data collection and preparation. The query stream was recorded from a real UAV flight with an onboard camera and a visual-inertial odometry pipeline providing frame-to-frame displacements and a locally consistent 2D trajectory (up to drift). Ground truth positions for evaluation were obtained from the flight’s GNSS logs and used only for scoring (MLE/ATE) and for generating the correspondence visualization in
Figure 2. The reference database was built by sampling satellite map imagery via the Google Maps API at zoom level 19 on a 40 m grid over the flight area; each tile is associated with its geographic coordinate. We then convert both query frames and reference tiles into embeddings using the backbones in Table 5.
Table 1.
Dataset statistics.
Table 1.
Dataset statistics.
| Property |
Value |
| Query frames |
58 |
| Query spacing |
40 m |
| Trajectory length |
2,323 m |
| Flight altitude (AGL) |
50–150 m |
| Reference tiles |
462 |
| Reference tile spacing |
40 m |
| Reference coverage |
1.6 km2
|
4.2. Baselines
We compare against:
Raw VIO (IP only): VIO trajectory with initial position aligned to ground truth (no rotation).
VIO+IP+IR (oracle): VIO with initial position and initial rotation estimated from the first displacement (one-segment oracle) and from the first displacements (multi-segment oracle).
VIO SE(2) oracle: VIO with oracle global SE(2) alignment to ground truth (best possible rigid alignment).
Per-frame VPR (top-k): For each query frame, retrieve the k reference tiles with highest cosine similarity and predict position as their coordinate mean (top-1 uses the single best match; top-3 averages the three best).
AnyLoc-VLAD [
5]: State-of-the-art VPR using DINOv2 [
32] with VLAD aggregation, evaluated with top-3 retrieval.
4.3. Results
On this challenging real-world benchmark, NaviLoc achieves
19.5 m MLE—a
16.0× improvement over the state-of-the-art AnyLoc-VLAD, and a
32.1× improvement over raw VIO drift.
Figure 3 and
Figure 4 summarize these results.
4.4. Ablation Studies
Stage contributions. Each stage provides substantial improvement (
Table 2). Stage 1 alone achieves 9× improvement by finding the correct global transform. Stage 2 refines local errors, and Stage 3 leverages VIO constraints while excluding outliers.
Outlier threshold. Table 3 shows sensitivity to the z-score threshold.
Hyperparameter sensitivity. Table 4 varies one hyperparameter at a time (others fixed to defaults). Performance is stable across moderate changes, while overly permissive local search radii or underconstrained refinement settings degrade accuracy.
Backbone comparison. Table 5 reveals that DeiT-Tiny-Distilled achieves the best NaviLoc performance despite not having the best per-frame VPR accuracy. Empirically, distillation appears to yield descriptors whose scores are more consistent
across time, making trajectory-level alignment easier.
4.5. Computational Efficiency
NaviLoc operates on precomputed descriptors. We benchmark on Raspberry Pi 5 (ARM Cortex-A76, 8GB RAM).
Table 6 summarizes the computational performance.
Inference strategy. At each timestep, the system samples an aerial image, extracts a 192-dimensional DeiT-Tiny-Distilled embedding (79 ms), and updates the global trajectory estimate using the C++ NaviLoc algorithm (32 ms). The combined end-to-end latency of 111 ms yields 9.0 FPS on Raspberry Pi 5, enabling real-time embedded deployment. Feature extraction becomes the bottleneck; DeiT-Tiny-Distilled requires 1.3 GFLOPs per image versus 21.1 GFLOPs for DINOv2-S, providing 16× faster extraction while achieving superior localization accuracy.
5. Discussion
We now analyze NaviLoc’s performance across multiple dimensions, compare with alternative approaches, and discuss limitations and future directions.
5.1. Comparison with Existing Methods
Versus per-frame VPR. Per-frame VPR methods treat each query independently, producing 342.1 m MLE (top-3) on our benchmark. NaviLoc achieves 19.5 m—a 17.5× improvement. The key insight is that perceptual aliasing causes individual VPR errors that are spatially inconsistent: incorrect matches scatter across the map, while correct matches cluster near the true trajectory. Trajectory-level optimization exploits this statistical property.
Versus state-of-the-art. AnyLoc [
5] represents the state-of-the-art in universal VPR, aggregating DINOv2 [
32] features via VLAD pooling. On our benchmark, AnyLoc-VLAD achieves 312.2 m MLE (top-3)—only marginally better than simple per-frame VPR. NaviLoc reduces this to 19.5 m, a
16.0× improvement. This demonstrates that even powerful foundation model features remain susceptible to cross-view aliasing without trajectory-level reasoning.
Versus VIO baselines. Raw VIO accumulates 626.7 m drift over the 2.3 km trajectory. Even with oracle initial rotation alignment, VIO achieves only 90.1 m MLE due to scale and heading drift. NaviLoc’s 19.5 m MLE represents a 32.1× improvement over raw VIO, highlighting the benefit of trajectory-level use of noisy VPR measurements to correct long-horizon drift.
Versus graph-based SLAM. Traditional pose graph optimization [
25,
26] formulates localization as nonlinear least squares over loop closure constraints. These methods typically require multiple sensors (LiDAR, stereo cameras, IMU), 3D point cloud processing, and GPU acceleration for real-time operation. NaviLoc is fundamentally lighter: it operates on a single monocular camera plus VIO, uses 2D tile descriptors rather than 3D geometry, and requires no GPU. Each stage employs closed-form solutions (median, SVD, linear solve) rather than iterative solvers (Gauss-Newton, Levenberg-Marquardt), and handles outliers via z-score clamping rather than iterative robust optimization [
28,
29]. This yields deterministic, bounded runtime on CPU-only embedded platforms.
5.2. Computational Efficiency
Real-time embedded deployment. NaviLoc achieves 9 FPS end-to-end on Raspberry Pi 5, combining 79 ms feature extraction (DeiT-Tiny-Distilled) with 32 ms trajectory optimization (C++). This demonstrates that trajectory-level cross-view localization can run in real time on a CPU-only embedded platform.
Comparison with foundation model approaches. While larger backbones can improve standalone VPR on some benchmarks, they are substantially more expensive for embedded CPU deployment. Our ablation (
Table 5) shows that DINOv2-ViT-G/14+VLAD achieves 162.0 m MLE versus 19.5 m for DeiT-Tiny-Distilled—larger models did not improve NaviLoc on this cross-view dataset.
Runtime scaling. Let
N denote trajectory length and
M reference database size. Stage 1 performs
global nearest-neighbor queries, each requiring
comparisons, yielding
. The angle grid size and iteration counts are small constants. Stage 2 runs in
per pass with constant window size. Stage 3 solves a tridiagonal system in
via the Thomas algorithm [
34]. The overall time complexity is
; in practice, feature extraction dominates runtime. For larger reference databases, approximate nearest-neighbor structures could reduce this to sublinear in
M.
5.3. Portability and Transferability
Training-free operation. Unlike learned cross-view methods [
13,
14] that require domain-specific training data, NaviLoc operates on pretrained features without fine-tuning. This is critical for GNSS-denied scenarios where ground truth correspondences are unavailable for training.
Backbone flexibility. NaviLoc accepts any image descriptor; our ablation tested five backbones (
Table 5). This modularity allows leveraging future feature extractors without algorithmic changes.
Simplicity. NaviLoc has no learned components beyond the feature extractor, no hyperparameter schedules, and no complex data association. This facilitates debugging and deployment in safety-critical applications.
5.4. Robustness Analysis
Hyperparameter stability. Table 4 demonstrates that NaviLoc maintains sub-25 m accuracy across
perturbations of most hyperparameters. The outlier threshold (
Table 3) shows graceful degradation outside the optimal range. This robustness simplifies deployment: practitioners can use default settings without extensive tuning.
Adaptive outlier detection. Z-score normalization adapts to per-trajectory statistics: a similarity of 0.25 may indicate an outlier in one flight but a reliable match in another. This eliminates per-dataset threshold tuning.
5.5. Feature Descriptor Analysis
Table 5 shows that DeiT-Tiny-Distilled (5M parameters) achieves 19.5 m MLE, while DINOv2-ViT-G/14 (1.1B parameters) achieves 162.0 m—
8× worse despite being 200× larger. This empirical finding suggests that model selection for trajectory-level VPR should consider compatibility with geometric reasoning, not just per-frame retrieval accuracy.
5.6. Limitations and Future Work
VIO dependency. NaviLoc assumes VIO provides accurate relative motion over short horizons. Significant VIO failures (e.g., from aggressive maneuvers or texture-poor environments) would propagate to the final estimate. Future work could jointly estimate VIO scale and bias.
Trajectory length. Very short trajectories (<10 frames) provide insufficient statistics for robust similarity aggregation. A minimum flight distance of ∼400 m is recommended.
Seasonal and environmental variation. Our evaluation uses satellite imagery captured under similar conditions to the query flight. Performance under seasonal changes (snow cover, vegetation differences) or varying lighting/weather conditions between the reference map and query images remains underexplored and is a direction for future work.
3D extension. The current SE(2) formulation assumes approximately planar flight. Extending to SE(3) would support altitude variations and complex terrain, potentially leveraging 3D point cloud or mesh references.
Online operation. NaviLoc currently processes trajectories in batch. An incremental variant that updates estimates as new frames arrive would enable tighter integration with flight controllers for real-time autonomous navigation.
Energy efficiency. Reducing onboard computation energy extends flight endurance and lowers environmental impact. Systematic energy profiling [
35] could guide model selection and duty-cycling to further improve efficiency.
6. Conclusions
We presented NaviLoc, a three-stage trajectory-level localization pipeline with rigorous mathematical foundations. Each stage has a well-defined objective and provable convergence properties. NaviLoc achieves 19.5 m Mean Localization Error on a UAV-to-satellite benchmark, representing a 16× improvement over AnyLoc-VLAD, the state-of-the-art VPR method. End-to-end inference runs at 9 FPS on Raspberry Pi 5, demonstrating practical viability for real-time embedded UAV navigation.
Author Contributions
Conceptualization, methodology, software, experiments, visualization, writing—original draft: P.S.; supervision, formal analysis, writing—review and editing, funding acquisition: T.P. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by Academia Tech, Taras Shevchenko National University of Kyiv, and Hackathon Expert NGO.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The experimental data used in this study are available from the corresponding author upon reasonable request for non-commercial, academic research purposes.
Acknowledgments
The author thanks Academia Tech for providing datasets and computational infrastructure to support this research.
Conflicts of Interest
The author declares no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| AGL |
Above Ground Level |
| ATE |
Absolute Trajectory Error |
| CPU |
Central Processing Unit |
| FPS |
Frames Per Second |
| GNSS |
Global Navigation Satellite System |
| GPS |
Global Positioning System |
| GPU |
Graphics Processing Unit |
| ICP |
Iterative Closest Point |
| IMU |
Inertial Measurement Unit |
| LiDAR |
Light Detection and Ranging |
| MAP |
Maximum A Posteriori |
| MLE |
Mean Localization Error |
| RAM |
Random Access Memory |
| SE(2) |
Special Euclidean Group in 2D |
| SLAM |
Simultaneous Localization and Mapping |
| SVD |
Singular Value Decomposition |
| UAV |
Unmanned Aerial Vehicle |
| VIO |
Visual-Inertial Odometry |
| ViT |
Vision Transformer |
| VLAD |
Vector of Locally Aggregated Descriptors |
| VPR |
Visual Place Recognition |
Appendix A. Detailed Algorithm Pseudocode
This appendix provides detailed pseudocode for each stage of NaviLoc with complete mathematical specifications.
Appendix A.1. Stage 1: Global Align
Appendix A.2. Stage 2: Refinement
Appendix A.3. Stage 3: Smoothing
References
- Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Sivic, J.; Zisserman, A. Video Google: A Text Retrieval Approach to Object Matching in Videos. In Proceedings of the IEEE ICCV, Nice, France, 2003; pp. 1470–1477. [Google Scholar] [CrossRef]
- Gálvez-López, D.; Tardós, J.D. Bags of Binary Words for Fast Place Recognition in Image Sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
- Arandjelović, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proceedings of the IEEE CVPR, Las Vegas, NV, USA, 2016; pp. 5297–5307. [Google Scholar] [CrossRef]
- Keetha, N.; Mishra, A.; Karhade, J.; Jatavallabhula, K.M.; Scherer, S.; Krishna, M.; Garg, S. AnyLoc: Towards Universal Visual Place Recognition. IEEE Robot. Autom. Lett. 2024, 9, 1286–1293. [Google Scholar] [CrossRef]
- Ali-bey, A.; Chaib-draa, B.; Giguère, P. MixVPR: Feature Mixing for Visual Place Recognition. In Proceedings of the IEEE WACV, Waikoloa, HI, USA, 2023; pp. 2998–3007. [Google Scholar] [CrossRef]
- Couturier, A.; Akhloufi, M.A. A Review on Deep Learning for UAV Absolute Visual Localization. Drones 2024, 8, 622. [Google Scholar] [CrossRef]
- Besl, P.J.; McKay, N.D. A Method for Registration of 3-D Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 1992, 14, 239–256. [Google Scholar] [CrossRef]
- Chen, Y.; Medioni, G. Object Modelling by Registration of Multiple Range Images. Image Vis. Comput. 1992, 10, 145–155. [Google Scholar] [CrossRef]
- Schönemann, P.H. A Generalized Solution of the Orthogonal Procrustes Problem. Psychometrika 1966, 31, 1–10. [Google Scholar] [CrossRef]
- Arun, K.S.; Huang, T.S.; Blostein, S.D. Least-Squares Fitting of Two 3-D Point Sets. IEEE Trans. Pattern Anal. Mach. Intell. 1987, 9, 698–700. [Google Scholar] [CrossRef] [PubMed]
- Umeyama, S. Least-Squares Estimation of Transformation Parameters Between Two Point Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 376–380. [Google Scholar] [CrossRef]
- Hu, S.; Feng, M.; Nguyen, R.M.; Lee, G.H. CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization. In Proceedings of the IEEE CVPR, Salt Lake City, UT, USA, 2018; pp. 7258–7267. [Google Scholar] [CrossRef]
- Zhu, S.; Shah, M.; Chen, C. VIGOR: Cross-View Image Geo-localization beyond One-to-one Retrieval. In Proceedings of the IEEE CVPR, Nashville, TN, USA, 2021; pp. 3640–3649. [Google Scholar] [CrossRef]
- Workman, S.; Souvenir, R.; Jacobs, N. Wide-Area Image Geolocalization with Aerial Reference Imagery. In Proceedings of the IEEE ICCV, Santiago, Chile, 2015; pp. 3961–3969. [Google Scholar] [CrossRef]
- Zheng, Z.; Wei, Y.; Yang, Y. University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization. In Proceedings of the ACM MM, Seattle, WA, USA, 2020; pp. 1395–1403. [Google Scholar] [CrossRef]
- Ding, L.; Zhou, J.; Meng, L.; Long, Z. A Practical Cross-View Image Matching Method between UAV and Satellite for UAV-Based Geo-Localization. Remote Sens. 2021, 13, 47. [Google Scholar] [CrossRef]
- Zhuo, X.; Koch, T.; Kurz, F.; Fraundorfer, F.; Reinartz, P. Automatic UAV Image Geo-Registration by Matching UAV Images to Georeferenced Image Data. Remote Sensing 2017, 9, 376. [Google Scholar] [CrossRef]
- Zhuang, J.; Dai, M.; Chen, X.; Zheng, E. A Faster and More Effective Cross-View Matching Method of UAV and Satellite Images for UAV Geolocalization. Remote Sensing 2021, 13, 3979. [Google Scholar] [CrossRef]
- Cui, Z.; Zhou, P.; Wang, X.; Zhang, Z.; Li, Y.; Li, H.; Zhang, Y. A Novel Geo-Localization Method for UAV and Satellite Images Using Cross-View Consistent Attention. Remote Sensing 2023, 15, 4667. [Google Scholar] [CrossRef]
- Mourikis, A.I.; Roumeliotis, S.I. A Multi-State Constraint Kalman Filter for Vision-Aided Inertial Navigation. In Proceedings of the IEEE ICRA, Rome, Italy, 2007; pp. 3565–3572. [Google Scholar] [CrossRef]
- Leutenegger, S.; Lynen, S.; Bosse, M.; Siegwart, R.; Furgale, P. Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization. Int. J. Robot. Res. 2015, 34, 314–334. [Google Scholar] [CrossRef]
- Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
- Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
- Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
- Kümmerle, R.; Grisetti, G.; Strasdat, H.; Konolige, K.; Burgard, W. g2o: A General Framework for Graph Optimization. In Proceedings of the IEEE ICRA, Shanghai, China, 2011; pp. 3607–3613. [Google Scholar] [CrossRef]
- Huber, P.J. Robust Statistics; John Wiley & Sons: New York, NY, USA, 1981; ISBN 978-0-471-41805-4. [Google Scholar]
- Sünderhauf, N.; Protzel, P. Switchable Constraints for Robust Pose Graph SLAM. In Proceedings of the IEEE IROS, Vilamoura, Portugal, 2012; pp. 1879–1884. [Google Scholar] [CrossRef]
- Yang, H.; Antonante, P.; Tzoumas, V.; Carlone, L. Graduated Non-Convexity for Robust Spatial Perception. IEEE Robot. Autom. Lett. 2020, 5, 1127–1134. [Google Scholar] [CrossRef]
- Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; ISBN 978-0-521-83378-3. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Proceedings of ICLR, 2021; Available online: https://arxiv.org/abs/2010.11929.
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.V.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. Available online: https://arxiv.org/abs/2304.07193.
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-efficient Image Transformers & Distillation through Attention. In Proceedings of ICML, 2021; pp. 10347–10357. Available online: https://arxiv.org/abs/2012.12877.
- Golub, G.H.; Van Loan, C.F. Matrix Computations, 4th ed.; Johns Hopkins University Press: Baltimore, MD, USA, 2013; Available online: https://www.worldcat.org/isbn/9781421407944ISBN 978-1-4214-0794-4.
- Panchenko, T.V.; Piatygorskiy, N.D. Enrichment of the HEPscore Benchmark by Energy Consumption Assessment. Technologies 2025, 13, 362. [Google Scholar] [CrossRef]
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).