Cross-Modal Invariant Representation Learning for Robust Image-to-PointCloud Place Recognition

Shuxin Mo; Bowen Lou

doi:10.20944/preprints202601.2307.v1

Submitted:

28 January 2026

Posted:

29 January 2026

You are already at the latest version

Abstract

Image-to-PointCloud place recognition is vital for autonomous systems, yet faces challenges from the inherent modality gap and drastic environmental variations. We propose Cross-Modal Invariant Representation Learning (CMIRL) to learn highly invariant cross-modal global descriptors. CMIRL introduces an Adaptive Cross-Modal Alignment (ACMA) module, which dynamically projects point clouds based on image semantics to generate view-optimized dense depth maps. A Dual-Stream Invariant Feature Encoder, featuring a Transformer-based Cross-Modal Attention Fusion (CMAF) module, then explicitly learns and emphasizes features shared across modalities and insensitive to environmental perturbations. These fused local features are subsequently aggregated into a robust global descriptor using an enhanced multi-scale NetVLAD network. Extensive experiments on the challenging KITTI dataset demonstrate that CMIRL significantly outperforms state-of-the-art methods in terms of top-one recall and overall recall. An ablation study validates the effectiveness of each proposed module, and qualitative analysis confirms enhanced robustness under adverse conditions, including low light, heavy shadows, simulated weather, and significant viewpoint changes. Strong generalization capabilities on an unseen dataset and competitive computational efficiency further highlight CMIRL's potential for reliable long-term autonomous localization.

Keywords:

place recognition

;

image-to-pointcloud

;

cross-modal invariant representation

;

transformer

;

global descriptors

Subject:

Computer Science and Mathematics - Computer Vision and Graphics

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Cross-Modal Invariant Representation Learning for Robust Image-to-PointCloud Place Recognition

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe