NaviLoc: Trajectory-Level Visual Localization for GNSS-Denied UAV Navigation

Pavel Shpagin; Taras Panchenko

doi:10.20944/preprints202512.2340.v1

Submitted:

22 December 2025

Posted:

25 December 2025

You are already at the latest version

Abstract

Aerial-to-satellite visual localization enables GNSS-denied UAV navigation, but the appearance gap between low-altitude (50–150 m) UAV imagery and nadir satellite tiles makes per-frame visual place recognition (VPR) unreliable. Under perceptual aliasing, high similarity matches are often geographically inconsistent, so naïve anchoring fails. We introduce NaviLoc, a training-free three-stage trajectory-level estimator that treats VPR as a noisy measurement source and exploits visual-inertial odometry (VIO) as a relative-motion prior. Stage 1 (Global Align) estimates a global SE(2) transform by maximizing an explicit trajectory-level similarity objective. Stage 2 (Refinement) performs sliding-window bounded weighted Procrustes updates. Stage 3 (Smoothing) computes a strictly convex MAP trajectory estimate that fuses VIO displacements with VPR anchors while clamping detected outliers. On a challenging low-altitude rural UAV benchmark, NaviLoc attains 19.5 m mean localization error (MLE) – a 16.0x reduction compared to state-of-the-art localization method AnyLoc-VLAD, and 32.1x compared to raw VIO drift. End-to-end inference runs at 9 FPS on Raspberry Pi 5, enabling real-time embedded deployment.

Keywords:

visual localization

;

UAV navigation

;

visual place recognition

;

GNSS-denied navigation

;

satellite imagery

;

trajectory optimization

;

robust estimation

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Global localization is a prerequisite for long-horizon GNSS-denied UAV autonomy. Visual–inertial odometry (VIO) provides locally accurate relative motion but drifts without bound. A natural correction source is aerial-to-satellite visual place recognition (VPR), matching onboard views to geo-referenced satellite tiles. In practice, the cross-view domain gap and strong perceptual aliasing produce frequent high-similarity false matches at geographically distant locations, making per-frame anchoring unreliable.

Public benchmarks for this setting remain limited. Despite searching available datasets and benchmarks, we were unable to find an open-source dataset that jointly provides low-altitude (50–150 m) UAV imagery, synchronized VIO, and a geo-referenced satellite tile database suitable for trajectory-level evaluation in visually complex rural/village environments. Many available datasets are higher altitude, less challenging visually, lack VIO, omit standardized baselines, or are not publicly released. To study this practically relevant regime, we therefore evaluate NaviLoc on a prepared real-world UAV-to-satellite dataset (Table 1) with VIO and curated satellite tiles.

NaviLoc addresses this failure mode by estimating a trajectory-level solution rather than committing to individual matches. Stage 1 (Global Align) searches for a single global SE(2) transform whose implied trajectory yields consistently high local retrieval scores, exploiting the fact that incorrect transforms are not supported coherently across many frames. Stage 2 (Refinement) refines the aligned trajectory through bounded weighted Procrustes updates on overlapping windows. Stage 3 (Smoothing) computes a closed-form MAP estimate that fuses VIO displacements with VPR anchors while suppressing low-confidence anchors detected from the similarity distribution.

Contributions.

A training-free three-stage trajectory-level localization method with explicit objectives and closed-form solutions.
An evaluation on low-altitude (50–150 m) UAV imagery showing 19.5 m mean localization error and 16.0× improvement over AnyLoc-VLAD [5].
An ablation demonstrating robustness to hyperparameter variations, and an empirical study showing that distilled ViT descriptors are more effective for trajectory-level alignment than larger foundation models on our benchmark.
An embedded implementation achieving 9 FPS end-to-end inference on Raspberry Pi 5.

Figure 1 illustrates the three-stage pipeline. We now review related work.

2. Related Work

Visual Place Recognition. VPR evolved from handcrafted descriptors [1] through bag-of-words [2,3] to learned representations. NetVLAD [4] introduced end-to-end learning for place recognition. AnyLoc [5] aggregates foundation model features using VLAD or GeM pooling, achieving state-of-the-art results on diverse benchmarks. MixVPR [6] proposes feature mixing for compact descriptors. These methods produce per-image descriptors without trajectory-level consistency constraints.

Cross-Domain and Aerial Localization. The aerial-to-satellite domain gap presents challenges distinct from ground-level VPR. CVM-Net [13] learns cross-view representations. VIGOR [14] and CVUSA [15] established benchmarks for ground-to-aerial matching. Recent UAV-specific methods [16,17] address the UAV-to-satellite gap. Our approach is complementary: we accept that individual matches are unreliable and leverage trajectory-level statistics.

UAV-to-satellite geo-localization. Several recent works study cross-view matching between UAV imagery and satellite maps in the remote sensing literature [17,18,19,20]. These methods improve per-frame matching, but still face perceptual aliasing under viewpoint and appearance changes; NaviLoc instead aggregates evidence across time using an explicit trajectory-level objective. For broader context on absolute visual localization pipelines and design choices, we refer to a recent survey in Drones [7].

Point Set Registration. The Iterative Closest Point (ICP) algorithm [8,9] alternates between correspondence assignment and transform estimation for point cloud alignment. Procrustes analysis [10] provides closed-form solutions for rigid alignment given correspondences. Extensions include weighted variants [11,12] and robust formulations. Our Stage 1 adapts ICP-style alternating optimization to the VPR similarity objective with provable monotone improvement over evaluated candidates.

Robust Estimation. Huber’s M-estimators [27] downweight outliers in regression. Pose graph optimization [25,26] fuses odometry and visual constraints. Switchable constraints [28] and graduated non-convexity [29] handle outliers in SLAM. Our Stage 3 performs z-score outlier detection and clamps detected outliers to the VIO prior (by setting

α_{i} \to \infty

), yielding a closed-form strictly convex solve without iterative robust optimization.

Visual-Inertial Odometry. VIO methods including MSCKF [21], OKVIS [22], VINS-Mono [23], and ORB-SLAM3 [24] provide accurate relative motion but accumulate drift. We use VIO as a relative motion prior in Stage 3.

Figure 2 illustrates the cross-view domain gap that NaviLoc overcomes.

3. Method

3.1. Problem Formulation

Given N aerial query images with descriptors

F_{q} = {f_{i}^{q}}_{i = 1}^{N}

and VIO-derived positions

V = {(v_{i}^{x}, v_{i}^{y})}_{i = 1}^{N}

in a local frame, along with VIO displacements

Δ V = {v_{i + 1} - v_{i}}

, and a geo-referenced satellite map with M reference tiles having descriptors

F_{r} = {f_{j}^{r}}_{j = 1}^{M}

and coordinates

G = {(g_{j}^{x}, g_{j}^{y})}_{j = 1}^{M}

, we seek global positions

P = {(p_{i}^{x}, p_{i}^{y})}_{i = 1}^{N}

.

3.2. Stage 1: Global Align

We estimate a global SE(2) transform

(θ, t)

by maximizing the trajectory-level similarity objective:

J (θ, t) = \frac{1}{N} \sum_{i = 1}^{N} max_{j \in B (R_{θ} v_{i} + t, r)} 〈 f_{i}^{q}, f_{j}^{r} 〉

(1)

where

R_{θ}

is the 2D rotation matrix,

B (x, r) = {j : ∥ g_{j} - x ∥ \leq r}

is the set of reference tiles within radius r, and

〈 \cdot, \cdot 〉

denotes cosine similarity.

Algorithm and intuition. Stage 1 treats VPR as a noisy correspondence generator: for a fixed rotation

θ

, each frame proposes a translation “measurement”

d_{i} (θ) = g_{j_{i}^{*}} - R_{θ} v_{i}

based on the global top-1 match

j_{i}^{*}

. Under perceptual aliasing, many

j_{i}^{*}

are outliers; we therefore aggregate

{d_{i} (θ)}

with a robust location estimator. We use coordinate ascent: (1) scan a grid of K rotation angles

θ \in {- π, - π + 2 π / K, \dots}

; (2) for each

θ

, compute the L1-optimal translation (component-wise median) [27]:

t (θ) = median {g_{j_{i}^{*}} - R_{θ} v_{i}}_{i = 1}^{N}

(2)

where

j_{i}^{*} = arg {max}_{j} 〈 f_{i}^{q}, f_{j}^{r} 〉

is the global top-1 match; (3) evaluate

J (θ, t (θ))

and select the best; (4) refine with alternating maximization using local targets.

We denote the resulting aligned trajectory by

P^{(1)} = {p_{i}^{(1)}}_{i = 1}^{N}

. The following result formalizes why the median aggregator is optimal for robust translation estimation under outlier contamination:

Theorem (L1-Optimal Translation (Median)). Let

d_{i} = g_{j_{i}} - R_{θ} v_{i} \in R^{2}

be translation residuals for fixed θ. The minimizer of

min_{t \in R^{2}} \sum_{i = 1}^{N} {∥ d_{i} - t ∥}_{1}

(3)

is given by the component-wise median:

t_{x} = median {d_{i, x}}

and

t_{y} = median {d_{i, y}}

. Consequently,

t (θ)

is robust to outlier residuals: as long as more than half of the residuals are inliers per coordinate, the estimate is controlled by the inlier set rather than the outliers.

Proof.The L1 norm

∥ d_{i} {- t ∥}_{1} = | d_{i, x} - t_{x} | + | d_{i, y} - t_{y} |

decomposes across coordinates, so the objective separates into two independent 1D problems:

{min}_{t_{x}} \sum_{i} | d_{i, x} - t_{x} |

and

{min}_{t_{y}} \sum_{i} | d_{i, y} - t_{y} |

. The minimizer of

\sum_{i} | x_{i} - t |

over

t \in R

is the median of

{x_{i}}

[27]. Robustness follows because the median is unaffected by arbitrarily large perturbations to fewer than half of the samples. □

Remark (monotone acceptance). In our implementation, we only accept candidate updates that improve (or tie) J, so the sequence of evaluated objective values is monotone non-decreasing and bounded above by 1, hence convergent. This guarantees convergence of the objective sequence, not global optimality.

3.3. Stage 2: Refinement

After global alignment, we refine positions using sliding-window bounded Procrustes. For each frame index j, we compute a local VPR target

t_{j} \in R^{2}

as the coordinate of the best-matching reference tile within radius r of the current predicted position

p_{j}^{(1)}

, and record its cosine similarity score

s_{j}

. For each window

W = {i, i + 1, \dots, i + W - 1}

with targets

{t_{j}}_{j \in W}

and weights

w_{j} = max {(0, s_{j})}^{2}

, we solve:

min_{R \in S O (2), t} \sum_{j \in W} w_{j} ∥ R p_{j}^{(1)} + t - t_{j} ∥^{2}, s . t . | θ | \leq θ_{max}

(4)

This weighted Procrustes problem admits a closed-form solution for both rotation and translation. The rotation constraint prevents overcorrection when local VPR targets are noisy:

Theorem (Optimal Bounded Procrustes). For the constrained problem (4) in 2D, let

\bar{p} = \sum_{j} w_{j} p_{j}^{(1)} / \sum_{j} w_{j}

and

\bar{t} = \sum_{j} w_{j} t_{j} / \sum_{j} w_{j}

be the weighted centroids, and let

H = \sum_{j \in W} w_{j} (p_{j}^{(1)} - \bar{p}) {(t_{j} - \bar{t})}^{⊤}

be the

2 \times 2

weighted cross-covariance matrix. The optimal rotation angle is

θ^{*} = clip (\hat{θ}, - θ_{max}, θ_{max})

, where

\hat{θ} = atan 2 (H_{10} - H_{01}, H_{00} + H_{11})

, and the optimal translation is

t^{*} = \bar{t} - R_{θ^{*}} \bar{p}

.

Proof.After centering by the weighted centroids, the objective separates into rotation and translation terms. For the rotation, we seek to maximize

tr (R_{θ} H)

. Writing

R_{θ} = \begin{matrix} cos θ & - sin θ \\ sin θ & cos θ \end{matrix}

, the trace expands to

(H_{00} + H_{11}) cos θ + (H_{10} - H_{01}) sin θ

[10,11]. This sinusoidal function of

θ

is maximized at

\hat{θ} = atan 2 (H_{10} - H_{01}, H_{00} + H_{11})

. The constraint

| θ | \leq θ_{max}

clips this to the boundary if

\hat{θ}

lies outside. The optimal translation follows as

t^{*} = \bar{t} - R_{θ^{*}} \bar{p}

. □

Windows overlap (stride

S < W

), so each frame receives corrections from multiple windows; we average these contributions. Multiple passes are performed because each pass updates the trajectory, which changes the local VPR targets

{t_{j}}

(since they depend on the current predicted positions). Empirically, 2–3 passes suffice for convergence.

3.4. Stage 3: Smoothing

Let

P^{(2)} = {p_{i}^{(2)}}_{i = 1}^{N}

denote the Stage 2 output. We treat these positions as anchors

A = {a_{i}}_{i = 1}^{N}

with

a_{i} : = p_{i}^{(2)}

, and fuse them with VIO constraints via:

min_{P} {∥ D P - Δ V ∥}^{2} + \sum_{i = 1}^{N} α_{i} {∥ p_{i} - a_{i} ∥}^{2}

(5)

where

D

is the

(N - 1) \times N

first-difference matrix.

Outlier detection. For each anchor

a_{i}

, we recompute its local VPR similarity

s_{i}

as the best cosine similarity among reference tiles within radius r of

a_{i}

. We then compute z-scores

z_{i} = (s_{i} - \bar{s}) / σ_{s}

. Frames with

z_{i} < τ

(default

τ = - 1.5

) are marked as outliers and assigned

α_{i} \to \infty

(implemented as

10^{6}

), effectively clamping them to the VIO prior. Inliers use

α_{i} = α_{in} = 0.05

. With the anchor weights thus defined, the optimization problem has a unique closed-form solution:

Theorem (Unique Solution). If

α_{i} > 0

for at least one i, then the objective (5) is strictly convex with unique minimizer given by the linear system:

(D^{⊤} D + diag (α)) P = D^{⊤} Δ V + α ⊙ A

(6)

Proof.The objective (5) is quadratic in

P

. Setting the gradient to zero yields the normal equations

(D^{⊤} D + diag (α)) P = D^{⊤} Δ V + α ⊙ A

. The matrix

D^{⊤} D

is the graph Laplacian of a path, which is positive semidefinite with nullspace spanned by the constant vector

1

. Adding

diag (α)

with at least one

α_{i} > 0

eliminates this nullspace: any

x = c 1

has

x^{⊤} diag (α) x = c^{2} \sum_{i} α_{i} > 0

. Thus the combined matrix is positive definite, guaranteeing strict convexity and a unique solution [30]. □

3.5. Implementation Details

Algorithm 1 summarizes the complete pipeline. We use DeiT-Tiny-Distilled [33], a distilled Vision Transformer [31], for feature extraction (192-dim descriptors). Fixed parameters: search radius

r = 150

m,

K = 72

angles, window size

W = 10

, stride

S = 7

,

θ_{max} = 0.09

rad (

\approx 5 °

), z-threshold

τ = - 1.5

.

Detailed pseudocode for each stage is provided in Appendix A.

4. Experiments

4.1. Dataset

To the best of our knowledge, there is no publicly available challenging dataset for our target setting—low-altitude (50–150 m) UAV imagery with synchronized VIO and a geo-referenced satellite tile database for trajectory-level evaluation—in particular with long real-world trajectories. We therefore evaluate on a prepared real-world UAV-to-satellite benchmark collected over rural terrain in Ukraine. Table 1 summarizes the dataset statistics. The dataset comprises 58 aerial query frames captured at 50–150 m AGL along a 2.3 km trajectory with 40 m inter-frame spacing. Reference imagery consists of 462 geo-referenced satellite tiles at 0.3 m/px resolution covering 1.6 km² with 40 m tile spacing. The benchmark is challenging due to strong perceptual aliasing in rural/village scenes (repetitive texture, limited distinctive landmarks) combined with low-altitude viewpoint and appearance changes relative to the satellite map.

Data collection and preparation. The query stream was recorded from a real UAV flight with an onboard camera and a visual-inertial odometry pipeline providing frame-to-frame displacements and a locally consistent 2D trajectory (up to drift). Ground truth positions for evaluation were obtained from the flight’s GNSS logs and used only for scoring (MLE/ATE) and for generating the correspondence visualization in Figure 2. The reference database was built by sampling satellite map imagery via the Google Maps API at zoom level 19 on a 40 m grid over the flight area; each tile is associated with its geographic coordinate. We then convert both query frames and reference tiles into embeddings using the backbones in Table 5.

Table 1. Dataset statistics.

Property	Value
Query frames	58
Query spacing	40 m
Trajectory length	2,323 m
Flight altitude (AGL)	50–150 m
Reference tiles	462
Reference tile spacing	40 m
Reference coverage	1.6 km²

4.2. Baselines

We compare against:

Raw VIO (IP only): VIO trajectory with initial position aligned to ground truth (no rotation).
VIO+IP+IR (oracle): VIO with initial position and initial rotation estimated from the first displacement (one-segment oracle) and from the first $k = 10$ displacements (multi-segment oracle).
VIO SE(2) oracle: VIO with oracle global SE(2) alignment to ground truth (best possible rigid alignment).
Per-frame VPR (top-k): For each query frame, retrieve the k reference tiles with highest cosine similarity and predict position as their coordinate mean (top-1 uses the single best match; top-3 averages the three best).
AnyLoc-VLAD [5]: State-of-the-art VPR using DINOv2 [32] with VLAD aggregation, evaluated with top-3 retrieval.

4.3. Results

On this challenging real-world benchmark, NaviLoc achieves 19.5 m MLE—a 16.0× improvement over the state-of-the-art AnyLoc-VLAD, and a 32.1× improvement over raw VIO drift. Figure 3 and Figure 4 summarize these results.

4.4. Ablation Studies

Stage contributions. Each stage provides substantial improvement (Table 2). Stage 1 alone achieves 9× improvement by finding the correct global transform. Stage 2 refines local errors, and Stage 3 leverages VIO constraints while excluding outliers.

Outlier threshold. Table 3 shows sensitivity to the z-score threshold.

Hyperparameter sensitivity. Table 4 varies one hyperparameter at a time (others fixed to defaults). Performance is stable across moderate changes, while overly permissive local search radii or underconstrained refinement settings degrade accuracy.

Backbone comparison. Table 5 reveals that DeiT-Tiny-Distilled achieves the best NaviLoc performance despite not having the best per-frame VPR accuracy. Empirically, distillation appears to yield descriptors whose scores are more consistent across time, making trajectory-level alignment easier.

4.5. Computational Efficiency

NaviLoc operates on precomputed descriptors. We benchmark on Raspberry Pi 5 (ARM Cortex-A76, 8GB RAM). Table 6 summarizes the computational performance.

Inference strategy. At each timestep, the system samples an aerial image, extracts a 192-dimensional DeiT-Tiny-Distilled embedding (79 ms), and updates the global trajectory estimate using the C++ NaviLoc algorithm (32 ms). The combined end-to-end latency of 111 ms yields 9.0 FPS on Raspberry Pi 5, enabling real-time embedded deployment. Feature extraction becomes the bottleneck; DeiT-Tiny-Distilled requires 1.3 GFLOPs per image versus 21.1 GFLOPs for DINOv2-S, providing 16× faster extraction while achieving superior localization accuracy.

5. Discussion

We now analyze NaviLoc’s performance across multiple dimensions, compare with alternative approaches, and discuss limitations and future directions.

5.1. Comparison with Existing Methods

Versus per-frame VPR. Per-frame VPR methods treat each query independently, producing 342.1 m MLE (top-3) on our benchmark. NaviLoc achieves 19.5 m—a 17.5× improvement. The key insight is that perceptual aliasing causes individual VPR errors that are spatially inconsistent: incorrect matches scatter across the map, while correct matches cluster near the true trajectory. Trajectory-level optimization exploits this statistical property.

Versus state-of-the-art. AnyLoc [5] represents the state-of-the-art in universal VPR, aggregating DINOv2 [32] features via VLAD pooling. On our benchmark, AnyLoc-VLAD achieves 312.2 m MLE (top-3)—only marginally better than simple per-frame VPR. NaviLoc reduces this to 19.5 m, a 16.0× improvement. This demonstrates that even powerful foundation model features remain susceptible to cross-view aliasing without trajectory-level reasoning.

Versus VIO baselines. Raw VIO accumulates 626.7 m drift over the 2.3 km trajectory. Even with oracle initial rotation alignment, VIO achieves only 90.1 m MLE due to scale and heading drift. NaviLoc’s 19.5 m MLE represents a 32.1× improvement over raw VIO, highlighting the benefit of trajectory-level use of noisy VPR measurements to correct long-horizon drift.

Versus graph-based SLAM. Traditional pose graph optimization [25,26] formulates localization as nonlinear least squares over loop closure constraints. These methods typically require multiple sensors (LiDAR, stereo cameras, IMU), 3D point cloud processing, and GPU acceleration for real-time operation. NaviLoc is fundamentally lighter: it operates on a single monocular camera plus VIO, uses 2D tile descriptors rather than 3D geometry, and requires no GPU. Each stage employs closed-form solutions (median, SVD, linear solve) rather than iterative solvers (Gauss-Newton, Levenberg-Marquardt), and handles outliers via z-score clamping rather than iterative robust optimization [28,29]. This yields deterministic, bounded runtime on CPU-only embedded platforms.

5.2. Computational Efficiency

Real-time embedded deployment. NaviLoc achieves 9 FPS end-to-end on Raspberry Pi 5, combining 79 ms feature extraction (DeiT-Tiny-Distilled) with 32 ms trajectory optimization (C++). This demonstrates that trajectory-level cross-view localization can run in real time on a CPU-only embedded platform.

Comparison with foundation model approaches. While larger backbones can improve standalone VPR on some benchmarks, they are substantially more expensive for embedded CPU deployment. Our ablation (Table 5) shows that DINOv2-ViT-G/14+VLAD achieves 162.0 m MLE versus 19.5 m for DeiT-Tiny-Distilled—larger models did not improve NaviLoc on this cross-view dataset.

Runtime scaling. Let N denote trajectory length and M reference database size. Stage 1 performs

O (N)

global nearest-neighbor queries, each requiring

O (M)

comparisons, yielding

O (N M)

. The angle grid size and iteration counts are small constants. Stage 2 runs in

O (N)

per pass with constant window size. Stage 3 solves a tridiagonal system in

O (N)

via the Thomas algorithm [34]. The overall time complexity is

O (N M)

; in practice, feature extraction dominates runtime. For larger reference databases, approximate nearest-neighbor structures could reduce this to sublinear in M.

5.3. Portability and Transferability

Training-free operation. Unlike learned cross-view methods [13,14] that require domain-specific training data, NaviLoc operates on pretrained features without fine-tuning. This is critical for GNSS-denied scenarios where ground truth correspondences are unavailable for training.

Backbone flexibility. NaviLoc accepts any image descriptor; our ablation tested five backbones (Table 5). This modularity allows leveraging future feature extractors without algorithmic changes.

Simplicity. NaviLoc has no learned components beyond the feature extractor, no hyperparameter schedules, and no complex data association. This facilitates debugging and deployment in safety-critical applications.

5.4. Robustness Analysis

Hyperparameter stability. Table 4 demonstrates that NaviLoc maintains sub-25 m accuracy across

\pm 50 %

perturbations of most hyperparameters. The outlier threshold (Table 3) shows graceful degradation outside the optimal range. This robustness simplifies deployment: practitioners can use default settings without extensive tuning.

Adaptive outlier detection. Z-score normalization adapts to per-trajectory statistics: a similarity of 0.25 may indicate an outlier in one flight but a reliable match in another. This eliminates per-dataset threshold tuning.

5.5. Feature Descriptor Analysis

Table 5 shows that DeiT-Tiny-Distilled (5M parameters) achieves 19.5 m MLE, while DINOv2-ViT-G/14 (1.1B parameters) achieves 162.0 m—8× worse despite being 200× larger. This empirical finding suggests that model selection for trajectory-level VPR should consider compatibility with geometric reasoning, not just per-frame retrieval accuracy.

5.6. Limitations and Future Work

VIO dependency. NaviLoc assumes VIO provides accurate relative motion over short horizons. Significant VIO failures (e.g., from aggressive maneuvers or texture-poor environments) would propagate to the final estimate. Future work could jointly estimate VIO scale and bias.

Trajectory length. Very short trajectories (<10 frames) provide insufficient statistics for robust similarity aggregation. A minimum flight distance of ∼400 m is recommended.

Seasonal and environmental variation. Our evaluation uses satellite imagery captured under similar conditions to the query flight. Performance under seasonal changes (snow cover, vegetation differences) or varying lighting/weather conditions between the reference map and query images remains underexplored and is a direction for future work.

3D extension. The current SE(2) formulation assumes approximately planar flight. Extending to SE(3) would support altitude variations and complex terrain, potentially leveraging 3D point cloud or mesh references.

Online operation. NaviLoc currently processes trajectories in batch. An incremental variant that updates estimates as new frames arrive would enable tighter integration with flight controllers for real-time autonomous navigation.

Energy efficiency. Reducing onboard computation energy extends flight endurance and lowers environmental impact. Systematic energy profiling [35] could guide model selection and duty-cycling to further improve efficiency.

6. Conclusions

We presented NaviLoc, a three-stage trajectory-level localization pipeline with rigorous mathematical foundations. Each stage has a well-defined objective and provable convergence properties. NaviLoc achieves 19.5 m Mean Localization Error on a UAV-to-satellite benchmark, representing a 16× improvement over AnyLoc-VLAD, the state-of-the-art VPR method. End-to-end inference runs at 9 FPS on Raspberry Pi 5, demonstrating practical viability for real-time embedded UAV navigation.

Author Contributions

Conceptualization, methodology, software, experiments, visualization, writing—original draft: P.S.; supervision, formal analysis, writing—review and editing, funding acquisition: T.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Academia Tech, Taras Shevchenko National University of Kyiv, and Hackathon Expert NGO.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The experimental data used in this study are available from the corresponding author upon reasonable request for non-commercial, academic research purposes.

Acknowledgments

The author thanks Academia Tech for providing datasets and computational infrastructure to support this research.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AGL	Above Ground Level
ATE	Absolute Trajectory Error
CPU	Central Processing Unit
FPS	Frames Per Second
GNSS	Global Navigation Satellite System
GPS	Global Positioning System
GPU	Graphics Processing Unit
ICP	Iterative Closest Point
IMU	Inertial Measurement Unit
LiDAR	Light Detection and Ranging
MAP	Maximum A Posteriori
MLE	Mean Localization Error
RAM	Random Access Memory
SE(2)	Special Euclidean Group in 2D
SLAM	Simultaneous Localization and Mapping
SVD	Singular Value Decomposition
UAV	Unmanned Aerial Vehicle
VIO	Visual-Inertial Odometry
ViT	Vision Transformer
VLAD	Vector of Locally Aggregated Descriptors
VPR	Visual Place Recognition

Appendix A. Detailed Algorithm Pseudocode

This appendix provides detailed pseudocode for each stage of NaviLoc with complete mathematical specifications.

Appendix A.1. Stage 1: Global Align

Appendix A.2. Stage 2: Refinement

Appendix A.3. Stage 3: Smoothing

References

Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Sivic, J.; Zisserman, A. Video Google: A Text Retrieval Approach to Object Matching in Videos. In Proceedings of the IEEE ICCV, Nice, France, 2003; pp. 1470–1477. [Google Scholar] [CrossRef]
Gálvez-López, D.; Tardós, J.D. Bags of Binary Words for Fast Place Recognition in Image Sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
Arandjelović, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proceedings of the IEEE CVPR, Las Vegas, NV, USA, 2016; pp. 5297–5307. [Google Scholar] [CrossRef]
Keetha, N.; Mishra, A.; Karhade, J.; Jatavallabhula, K.M.; Scherer, S.; Krishna, M.; Garg, S. AnyLoc: Towards Universal Visual Place Recognition. IEEE Robot. Autom. Lett. 2024, 9, 1286–1293. [Google Scholar] [CrossRef]
Ali-bey, A.; Chaib-draa, B.; Giguère, P. MixVPR: Feature Mixing for Visual Place Recognition. In Proceedings of the IEEE WACV, Waikoloa, HI, USA, 2023; pp. 2998–3007. [Google Scholar] [CrossRef]
Couturier, A.; Akhloufi, M.A. A Review on Deep Learning for UAV Absolute Visual Localization. Drones 2024, 8, 622. [Google Scholar] [CrossRef]
Besl, P.J.; McKay, N.D. A Method for Registration of 3-D Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 1992, 14, 239–256. [Google Scholar] [CrossRef]
Chen, Y.; Medioni, G. Object Modelling by Registration of Multiple Range Images. Image Vis. Comput. 1992, 10, 145–155. [Google Scholar] [CrossRef]
Schönemann, P.H. A Generalized Solution of the Orthogonal Procrustes Problem. Psychometrika 1966, 31, 1–10. [Google Scholar] [CrossRef]
Arun, K.S.; Huang, T.S.; Blostein, S.D. Least-Squares Fitting of Two 3-D Point Sets. IEEE Trans. Pattern Anal. Mach. Intell. 1987, 9, 698–700. [Google Scholar] [CrossRef] [PubMed]
Umeyama, S. Least-Squares Estimation of Transformation Parameters Between Two Point Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 376–380. [Google Scholar] [CrossRef]
Hu, S.; Feng, M.; Nguyen, R.M.; Lee, G.H. CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization. In Proceedings of the IEEE CVPR, Salt Lake City, UT, USA, 2018; pp. 7258–7267. [Google Scholar] [CrossRef]
Zhu, S.; Shah, M.; Chen, C. VIGOR: Cross-View Image Geo-localization beyond One-to-one Retrieval. In Proceedings of the IEEE CVPR, Nashville, TN, USA, 2021; pp. 3640–3649. [Google Scholar] [CrossRef]
Workman, S.; Souvenir, R.; Jacobs, N. Wide-Area Image Geolocalization with Aerial Reference Imagery. In Proceedings of the IEEE ICCV, Santiago, Chile, 2015; pp. 3961–3969. [Google Scholar] [CrossRef]
Zheng, Z.; Wei, Y.; Yang, Y. University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization. In Proceedings of the ACM MM, Seattle, WA, USA, 2020; pp. 1395–1403. [Google Scholar] [CrossRef]
Ding, L.; Zhou, J.; Meng, L.; Long, Z. A Practical Cross-View Image Matching Method between UAV and Satellite for UAV-Based Geo-Localization. Remote Sens. 2021, 13, 47. [Google Scholar] [CrossRef]
Zhuo, X.; Koch, T.; Kurz, F.; Fraundorfer, F.; Reinartz, P. Automatic UAV Image Geo-Registration by Matching UAV Images to Georeferenced Image Data. Remote Sensing 2017, 9, 376. [Google Scholar] [CrossRef]
Zhuang, J.; Dai, M.; Chen, X.; Zheng, E. A Faster and More Effective Cross-View Matching Method of UAV and Satellite Images for UAV Geolocalization. Remote Sensing 2021, 13, 3979. [Google Scholar] [CrossRef]
Cui, Z.; Zhou, P.; Wang, X.; Zhang, Z.; Li, Y.; Li, H.; Zhang, Y. A Novel Geo-Localization Method for UAV and Satellite Images Using Cross-View Consistent Attention. Remote Sensing 2023, 15, 4667. [Google Scholar] [CrossRef]
Mourikis, A.I.; Roumeliotis, S.I. A Multi-State Constraint Kalman Filter for Vision-Aided Inertial Navigation. In Proceedings of the IEEE ICRA, Rome, Italy, 2007; pp. 3565–3572. [Google Scholar] [CrossRef]
Leutenegger, S.; Lynen, S.; Bosse, M.; Siegwart, R.; Furgale, P. Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization. Int. J. Robot. Res. 2015, 34, 314–334. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
Kümmerle, R.; Grisetti, G.; Strasdat, H.; Konolige, K.; Burgard, W. g2o: A General Framework for Graph Optimization. In Proceedings of the IEEE ICRA, Shanghai, China, 2011; pp. 3607–3613. [Google Scholar] [CrossRef]
Huber, P.J. Robust Statistics; John Wiley & Sons: New York, NY, USA, 1981; ISBN 978-0-471-41805-4. [Google Scholar]
Sünderhauf, N.; Protzel, P. Switchable Constraints for Robust Pose Graph SLAM. In Proceedings of the IEEE IROS, Vilamoura, Portugal, 2012; pp. 1879–1884. [Google Scholar] [CrossRef]
Yang, H.; Antonante, P.; Tzoumas, V.; Carlone, L. Graduated Non-Convexity for Robust Spatial Perception. IEEE Robot. Autom. Lett. 2020, 5, 1127–1134. [Google Scholar] [CrossRef]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; ISBN 978-0-521-83378-3. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Proceedings of ICLR, 2021; Available online: https://arxiv.org/abs/2010.11929.
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.V.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. Available online: https://arxiv.org/abs/2304.07193.
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-efficient Image Transformers & Distillation through Attention. In Proceedings of ICML, 2021; pp. 10347–10357. Available online: https://arxiv.org/abs/2012.12877.
Golub, G.H.; Van Loan, C.F. Matrix Computations, 4th ed.; Johns Hopkins University Press: Baltimore, MD, USA, 2013; Available online: https://www.worldcat.org/isbn/9781421407944ISBN 978-1-4214-0794-4.
Panchenko, T.V.; Piatygorskiy, N.D. Enrichment of the HEPscore Benchmark by Energy Consumption Assessment. Technologies 2025, 13, 362. [Google Scholar] [CrossRef]

Figure 1. NaviLoc pipeline. Global Align (Stage 1): global VPR yields coarse targets used to optimize a trajectory-level similarity objective over SE(2). Refinement (Stage 2): windowed local VPR supports bounded weighted Procrustes updates. Smoothing (Stage 3): VIO deltas are enforced while low-confidence anchors are detected via similarity statistics and clamped to the VIO prior.

Figure 2. Ground truth cross-view correspondences. Upper: UAV query frames (50–150 m AGL, above ground level). Lower: corresponding satellite tiles matched by GPS (ground truth). Lines connect each query to its geographically correct reference. The visual similarity between aerial and satellite views is often ambiguous even to human observers—cross-view matching is inherently challenging. Per-frame VPR retrieval frequently returns distant tiles, motivating trajectory-level optimization.

Figure 3. Mean Localization Error comparison. NaviLoc (19.5 m) outperforms all baselines by a wide margin.

Figure 4. Trajectory comparison: ground truth (green, dashed) versus NaviLoc estimate (navy, solid). NaviLoc correctly recovers the global trajectory despite per-frame VPR noise.

Table 2. Quantitative comparison. MLE = Mean Localization Error; ATE = Absolute Trajectory Error.

Method	MLE (m)	ATE (m)	Improvement
Raw VIO (IP only)	626.7	728.9	1.0×
VIO+IP+IR (first delta, oracle)*	98.4	132.5	6.4×
VIO+IP+IR (first $k = 10$ deltas, oracle)*	90.1	132.2	7.0×
Per-frame VPR (top-1)	404.3	471.8	1.5×
Per-frame VPR (top-3)	342.1	413.8	1.8×
AnyLoc-VLAD (top-3) [5]	312.2	368.5	2.0×
VIO SE(2) oracle*	54.2	67.9	11.6×
NaviLoc: Stage 1	69.3	76.9	9.0×
NaviLoc: Stages 1+2	36.7	42.6	17.1×
NaviLoc: Full	19.5	21.6	32.1×

*Oracle baselines use ground truth information.

Table 3. Outlier detection threshold ablation.

Threshold	Outliers	MLE (m)	Improvement
None	0	26.6	23.5×
$z < - 2.0$	3	21.4	29.3×
$z < - 1.5$	7	19.5	32.1×
$z < - 1.0$	9	27.2	23.1×

Table 4. One-at-a-time hyperparameter sensitivity (full NaviLoc pipeline). Default settings are bold.

Hyperparameter	Setting	MLE (m)
Rotation grid (K angles)	36 / 72 / 144	32.7 / 19.5 / 23.8
Local search radius r (m)	100 / 150 / 200	36.6 / 19.5 / 83.0
Stage 1 iterations	1 / 2 / 3 / 4	19.8 / 22.2 / 19.5 / 19.5
Window size W (frames)	9 / 10 / 11 / 12	27.6 / 19.5 / 24.6 / 27.2
Stride S (frames)	5 / 7 / 9	33.2 / 19.5 / 35.2
Rotation bound $θ_{max}$ (rad)	0.05 / 0.09 / 0.12	24.3 / 19.5 / 19.2
Anchor weight $α_{in}$	0.02 / 0.05 / 0.10	20.6 / 19.5 / 22.6

Table 5. Backbone comparison.

Backbone	Dimensions	VPR MLE (m)	NaviLoc MLE (m)
DeiT-Tiny-Distilled	192	342.1	19.5
ConvNeXt-Tiny	768	364.1	42.0
LeViT-256 (distilled)	512	346.7	70.6
DeiT-Tiny	192	430.4	105.1
DINOv2-ViT-G/14 + VLAD	98304	356.7	162.0

Table 6. Computational performance on Raspberry Pi 5.

Component	Time (ms)	FPS
Feature Extraction (DeiT-Tiny)	79	12.7
NaviLoc Algorithm (C++)	32	31
End-to-End Inference	111	9.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

NaviLoc: Trajectory-Level Visual Localization for GNSS-Denied UAV Navigation

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

3. Method

3.1. Problem Formulation

3.2. Stage 1: Global Align

3.3. Stage 2: Refinement

3.4. Stage 3: Smoothing

3.5. Implementation Details

4. Experiments

4.1. Dataset

4.2. Baselines

4.3. Results

4.4. Ablation Studies

4.5. Computational Efficiency

5. Discussion

5.1. Comparison with Existing Methods

5.2. Computational Efficiency

5.3. Portability and Transferability

5.4. Robustness Analysis

5.5. Feature Descriptor Analysis

5.6. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Detailed Algorithm Pseudocode

Appendix A.1. Stage 1: Global Align

Appendix A.2. Stage 2: Refinement

Appendix A.3. Stage 3: Smoothing

References

MDPI Initiatives

Important Links

Subscribe