Submitted:
27 June 2026
Posted:
30 June 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Perception Evolution. Analogous to humans moving from coarse visual impressions to precise spatial cognition, VLN perception has evolved from panoramic vision-language alignment to contextualized spatial understanding, enabling semantic entity grounding, 3D spatial construction, and streaming multi-source perception.
- Cognition Evolution. Similar to humans using mental maps to imagine routes before acting, VLN cognition has shifted from reactive decisions based on immediate observations to world-model-driven predictive planning, enabling agents to transition from observation-driven reaction to model-based deliberation.
- Learning Evolution. Analogous to humans progressing from imitation to intrinsic learning, VLN learning has evolved from supervised imitation to reward-driven optimization, enabling agents to learn from experience and self-correct via expert trajectories and foundation-model-guided rewards.
- Generalization Evolution. As humans transfer knowledge to novel settings and adapt lifelong, VLN generalization has evolved from closed-benchmark evaluation toward reliable open-world operation, spanning environment, horizon, lifelong, scene, and safety dimensions.
2. Preliminaries: The VLN Landscape
2.1. Task Formulation
2.2. Representative Benchmarks
2.3. Standard Metrics
- Success Rate (SR). The percentage of episodes in which the agent stops within a threshold distance (typically 3 meters) of the goal.
- Oracle Success Rate (OSR). SR computed using the closest point along the agent’s trajectory to the goal, indicating whether the agent ever passes near the target.
- Path Length (PL). The total distance traveled by the agent during task completion, where shorter paths indicate higher navigation efficiency.
- Success weighted by Path Length (SPL). SR normalized by the ratio of shortest-path length to actual path length, penalizing unnecessarily long trajectories.
- Navigation Error (NE). The average distance between the agent’s final position and the goal.
- Normalized Dynamic Time Warping (nDTW). A measure of the fidelity of the agent’s trajectory to the reference path.
- Trajectory Length (TL). The total distance traveled by the agent during the navigation episode.
3. Perception Evolution: From Visual Grounding to Situated Spatial Understanding
3.1. Semantic Granularity Evolution: From Holistic Views to Open-Vocabulary Semantic Anchors
3.1.1. Holistic Image Perception
3.1.2. Entity and Scene-Level Perception
3.1.3. Open-Vocabulary Entity Perception
3.2. Spatial Structure Evolution: From Local Views to Embodied 3D Space
3.2.1. Topological Spatial Perception
3.2.2. BEV and Map-Based Spatial Representation
3.2.3. 3D Spatial Representation
3.3. Input Realism Evolution: From Static Observations to Situated Sensory Streams
3.3.1. Video Streaming Perception
3.3.2. Multi-Source Perception
3.4. Open Challenges in Perception
4. Cognition Evolution: From Instruction Interpretation to Predictive World Modeling
4.1. Instruction Abstraction Evolution: From Raw Instructions to Executable Task Structures
4.1.1. Fine-Grained Instruction Decomposition
4.1.2. Structured Instruction Constraints
4.2. Spatial Reasoning Evolution: From Grounded Anchors to Relational Spatial Inference
4.2.1. Spatial Relation Reasoning
4.2.2. Memory-Augmented Spatial Inference
4.3. Deliberative Planning Evolution: From Implicit Policies to Explicit Reasoning
4.3.1. Explicit Reasoning Traces
4.3.2. Self-Monitoring and Robust Planning
4.4. World Model Evolution: From Reasoning over Observations to Imagining Future States
4.4.1. Future Prediction and Visual Imagination
4.4.2. Foundation-Model and Self-Evolving World Models
4.4.3. Toward World-Action Models for VLN
6. Generalization Evolution: From Closed Benchmarks to Open-World Deployment
6.1. Environment Generalization: From Closed-Set Evaluation to Zero-Shot Open-World Navigation
6.1.1. LLMs as External Reasoning Modules
6.1.2. VLMs as End-to-End Navigation Engines
6.2. Horizon Generalization: From Short-Horizon Instruction Following to Long-Horizon Agentic Navigation
6.2.1. Benchmarks and Evaluation for Long-Horizon VLN
6.2.2. Hierarchical Planning for Long-Horizon Navigation
6.2.3. Agentic Reasoning for Long-Horizon Navigation
6.3. Lifelong Adaptation: From Episodic Isolation to Continual Learning and Self-Evolution
6.3.1. Continual Learning for Lifelong Deployment
6.3.2. Self-Evolution for Lifelong Navigation
6.4. Scene Generalization: From Structured Indoor Environments to Cross-Platform and City-Scale Navigation
6.4.1. Platform Extension from Indoor Ground Navigation to Outdoor Navigation
6.4.2. Scale Extension toward City-Scale VLN
6.5. Safety Generalization: From Controlled Simulation to Trustworthy Real-World Deploymen
6.5.1. Instruction and Perceptual Robustness
6.5.2. Social Awareness and Embodied Deployment Reliability
6.6. Open Challenges in Generalization
7. Conclusion
References
- Gu, J.; Stefani, E.; Wu, Q.; Thomason, J.; Wang, X. Vision-and-language navigation: A survey of tasks, methods, and future directions. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 7606–7623.
- Krantz, J.; Wijmans, E.; Majumdar, A.; Batra, D.; Lee, S. Beyond the nav-graph: Vision-and-language navigation in continuous environments. In Proceedings of the European Conference on Computer Vision. Springer, 2020, pp. 104–120. [CrossRef]
- Plikynas, D.; Žvironas, A.; Budrionis, A.; Gudauskis, M. Indoor navigation systems for visually impaired persons: Mapping the features of existing technologies to user needs. Sensors 2020, 20, 636.
- Anderson, P.; Wu, Q.; Teney, D.; Bruce, J.; Johnson, M.; Sünderhauf, N.; Reid, I.; Gould, S.; Van Den Hengel, A. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683.
- Fried, D.; Hu, R.; Cirik, V.; Rohrbach, A.; Andreas, J.; Morency, L.P.; Berg-Kirkpatrick, T.; Saenko, K.; Klein, D.; Darrell, T. Speaker-follower models for vision-and-language navigation. Advances in neural information processing systems 2018, 31.
- Tan, H.; Yu, L.; Bansal, M. Learning to navigate unseen environments: Back translation with environmental dropout. In Proceedings of the Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 2610–2621.
- Hong, Y.; Wu, Q.; Qi, Y.; Rodriguez-Opazo, C.; Gould, S. A recurrent vision-and-language bert for navigation. arXiv preprint arXiv:2011.13922 2020.
- Chen, S.; Guhur, P.L.; Schmid, C.; Laptev, I. History aware multimodal transformer for vision-and-language navigation. Advances in neural information processing systems 2021, 34, 5834–5847.
- Zhou, G.; Hong, Y.; Wu, Q. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2024, Vol. 38, pp. 7641–7649.
- Zhou, G.; Hong, Y.; Wang, Z.; Wang, X.E.; Wu, Q. Navgpt-2: Unleashing navigational reasoning capability for large vision-language models. In Proceedings of the European Conference on Computer Vision. Springer, 2024, pp. 260–278.
- Qiao, Y.; Lyu, W.; Wang, H.; Wang, Z.; Li, Z.; Zhang, Y.; Tan, M.; Wu, Q. Open-nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 6710–6717.
- Chen, J.; Lin, B.; Xu, R.; Chai, Z.; Liang, X.; Wong, K.Y. Mapgpt: Map-guided prompting with adaptive path planning for vision-and-language navigation. In Proceedings of the Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9796–9810.
- Shi, X.; Li, Z.; Lyu, W.; Xia, J.; Dayoub, F.; Qiao, Y.; Wu, Q. Smartway: Enhanced waypoint prediction and backtracking for zero-shot vision-and-language navigation. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 16923–16930.
- Krantz, J.; Banerjee, S.; Zhu, W.; Corso, J.; Anderson, P.; Lee, S.; Thomason, J. Iterative vision-and-language navigation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14921–14930.
- Hong, Y.; Wu, Q.; Qi, Y.; Rodriguez-Opazo, C.; Gould, S. Vln bert: A recurrent vision-and-language bert for navigation. In Proceedings of the Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2021, pp. 1643–1653.
- Chen, S.; Guhur, P.L.; Tapaswi, M.; Schmid, C.; Laptev, I. Think global, act local: Dual-scale graph transformer for vision-and-language navigation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16537–16547.
- Li, P.; Wu, K.; Xu, S.; Li, F.; Zhao, L.; Chen, L.; Yang, Z.X.; Zheng, N. Think before Go: Hierarchical Reasoning for Image-goal Navigation. arXiv preprint arXiv:2604.17407 2026.
- Qiao, Y.; Qi, Y.; Hong, Y.; Yu, Z.; Wang, P.; Wu, Q. Hop: History-and-order aware pre-training for vision-and-language navigation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15418–15427.
- Wang, Z.; Li, J.; Hong, Y.; Wang, Y.; Wu, Q.; Bansal, M.; Gould, S.; Tan, H.; Qiao, Y. Scaling data generation in vision-and-language navigation. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 12009–12020.
- Wang, L.; He, Z.; Dang, R.; Shen, M.; Liu, C.; Chen, Q. Vision-and-language navigation via causal learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 13139–13150.
- Wu, W.; Chang, T.; Li, X.; Yin, Q.; Hu, Y. Vision-language navigation: a survey and taxonomy. Neural Computing and Applications 2024, 36, 3291–3316.
- Zhang, Y.; Ma, Z.; Li, J.; Qiao, Y.; Wang, Z.; Chai, J.; Wu, Q.; Bansal, M.; Kordjamshidi, P. Vision-and-language navigation today and tomorrow: A survey in the era of foundation models. arXiv preprint arXiv:2407.07035 2024.
- Khan, J.; Aafaq, N.; Ali, Q.; Mohsin, M. A comprehensive review of recent advancements in vision-and-language navigation. Discover Computing 2026, 29, 167.
- Pan, H.; Huang, S.; Yang, J.; Mi, J.; Li, K.; You, X.; Liang, P.; Yang, J.; Liu, Y.; Zhang, J.; et al. Robot Navigation via Foundation Language Models: A Review. ACM Computing Surveys 2026, 58, 1–38. [CrossRef]
- Nguyen, K.; Dey, D.; Brockett, C.; Dolan, B. Vision-based navigation with language-based assistance via imitation learning with indirect intervention. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12527–12537.
- Krantz, J.; Gokaslan, A.; Batra, D.; Lee, S.; Maksymets, O. Waypoint models for instruction-guided navigation in continuous environments. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15162–15171.
- An, D.; Wang, H.; Wang, W.; Wang, Z.; Huang, Y.; He, K.; Wang, L. Etpnav: Evolving topological planning for vision-language navigation in continuous environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 2024.
- Jain, V.; Magalhaes, G.; Ku, A.; Vaswani, A.; Ie, E.; Baldridge, J. Stay on the path: Instruction fidelity in vision-and-language navigation. In Proceedings of the Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 1862–1872.
- Chen, H.; Suhr, A.; Misra, D.; Snavely, N.; Artzi, Y. Touchdown: Natural language navigation and spatial reasoning in visual street environments. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 12538–12547.
- Mirowski, P.; Banki-Horvath, A.; Anderson, K.; Teplyashin, D.; Hermann, K.M.; Malinowski, M.; Grimes, M.K.; Simonyan, K.; Kavukcuoglu, K.; Zisserman, A.; et al. The streetlearn environment and dataset. arXiv preprint arXiv:1903.01292 2019.
- Ku, A.; Anderson, P.; Patel, R.; Ie, E.; Baldridge, J. Room-across-room: Multilingual vision-and-language navigation with dense spatiotemporal grounding. In Proceedings of the Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 4392–4412.
- Qi, Y.; Wu, Q.; Anderson, P.; Wang, X.; Wang, W.Y.; Shen, C.; Hengel, A.v.d. Reverie: Remote embodied visual referring expression in real indoor environments. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9982–9991.
- Batra, D.; Gokaslan, A.; Kembhavi, A.; Maksymets, O.; Mottaghi, R.; Savva, M.; Toshev, A.; Wijmans, E. Objectnav revisited: On evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171 2020.
- Thomason, J.; Murray, M.; Cakmak, M.; Zettlemoyer, L. Vision-and-dialog navigation. In Proceedings of the Conference on Robot Learning. PMLR, 2020, pp. 394–406.
- Padmakumar, A.; Thomason, J.; Shrivastava, A.; Lange, P.; Narayan-Chen, A.; Gella, S.; Piramuthu, R.; Tur, G.; Hakkani-Tur, D. Teach: Task-driven embodied agents that chat. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2022, Vol. 36, pp. 2017–2025. [CrossRef]
- Gao, X.; Gao, Q.; Gong, R.; Lin, K.; Thattai, G.; Sukhatme, G.S. Dialfred: Dialogue-enabled agents for embodied instruction following. IEEE Robotics and Automation Letters 2022, 7, 10049–10056. [CrossRef]
- Song, X.; Chen, W.; Liu, Y.; Chen, W.; Li, G.; Lin, L. Towards long-horizon vision-language navigation: Platform, benchmark and method. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 12078–12088.
- Zhang, J.; Ma, K. MG-VLN: Benchmarking Multi-Goal and Long-Horizon Vision-Language Navigation with Language Enhanced Memory Map. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 7750–7757.
- Wang, T.; Li, X.; Lu, F.; Gong, T.; Dong, J.; Xue, W.; Qu, S.; Bai, C.; Chen, G. CoNavBench: Collaborative Long-Horizon Vision-Language Navigation Benchmark. In Proceedings of the The Fourteenth International Conference on Learning Representations, 2026.
- Yue, L.; Zhou, D.; Xie, L.; Zhang, F.; Yan, Y.; Yin, E. Safe-vln: Collision avoidance for vision-and-language navigation of autonomous robots operating in continuous environments. IEEE Robotics and Automation Letters 2024, 9, 4918–4925.
- Li, H.; Li, M.; Cheng, Z.Q.; Dong, Y.; Zhou, Y.; He, J.Y.; Dai, Q.; Mitamura, T.; Hauptmann, A.G. Human-aware vision-and-language navigation: Bridging simulation to reality with dynamic human interactions. Advances in Neural Information Processing Systems 2024, 37, 119411–119442.
- Li, Z.; Lv, Y.; Tu, Z.; Shang, D.; Qiao, H. Vision-language navigation with continual learning. arXiv preprint arXiv:2409.02561 2024.
- Jeong, S.; Kang, G.C.; Choi, S.; Kim, J.; Zhang, B.T. Continual vision-and-language navigation. arXiv preprint arXiv:2403.15049 2024.
- Su, H.T.; Wang, T.J.; Yeh, J.F.; Sun, M.; Hsu, W.H. Vln-nf: Feasibility-aware vision-and-language navigation with false-premise instructions. arXiv preprint arXiv:2604.10533 2026.
- Liu, S.; Zhang, H.; Qi, Y.; Wang, P.; Zhang, Y.; Wu, Q. Aerialvln: Vision-and-language navigation for uavs. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15384–15394.
- Lee, J.; Miyanishi, T.; Kurita, S.; Sakamoto, K.; Azuma, D.; Matsuo, Y.; Inoue, N. CityNav: Language-Goal Aerial Navigation Dataset Using Geographic Information 2024.
- Gao, Y.; Li, C.; You, Z.; Liu, J.; Li, Z.; Chen, P.; Chen, Q.; Tang, Z.; Wang, L.; Yang, P.; et al. OpenFly: A comprehensive platform for aerial vision-language navigation. arXiv preprint arXiv:2502.18041 2025.
- Cai, H.; Rao, Y.; Huang, L.; Zhong, Z.; Dong, J.; Tan, J.; Lu, W.; Zhong, R. AirNav: A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions. arXiv preprint arXiv:2601.03707 2026.
- Chi, T.C.; Shen, M.; Eric, M.; Kim, S.; Hakkani-Tur, D. Just ask: An interactive learning framework for vision and language navigation. In Proceedings of the Proceedings of the AAAI conference on artificial intelligence, 2020, Vol. 34, pp. 2459–2466. [CrossRef]
- Shridhar, M.; Thomason, J.; Gordon, D.; Bisk, Y.; Han, W.; Mottaghi, R.; Zettlemoyer, L.; Fox, D. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10740–10749.
- Banerjee, S.; Thomason, J.; Corso, J. The robotslang benchmark: Dialog-guided robot localization and navigation. In Proceedings of the Conference on Robot Learning. PMLR, 2021, pp. 1384–1393.
- Mehta, H.; Artzi, Y.; Baldridge, J.; Ie, E.; Mirowski, P. Retouchdown: Releasing touchdown on StreetLearn as a public resource for language grounding tasks in street view. In Proceedings of the Proceedings of the third international workshop on spatial language understanding, 2020, pp. 56–62.
- Zhu, F.; Liang, X.; Zhu, Y.; Yu, Q.; Chang, X.; Liang, X. Soon: Scenario oriented object navigation with graph-based exploration. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12689–12699.
- Vasudevan, A.B.; Dai, D.; Van Gool, L. Talk2nav: Long-range vision-and-language navigation with dual attention and spatial memory. International Journal of Computer Vision 2021, 129, 246–266.
- Chen, S.; Guhur, P.L.; Tapaswi, M.; Schmid, C.; Laptev, I. Learning from unlabeled 3d environments for vision-and-language navigation. In Proceedings of the European Conference on Computer Vision. Springer, 2022, pp. 638–655.
- Taioli, F.; Rosa, S.; Castellini, A.; Natale, L.; Del Bue, A.; Farinelli, A.; Cristani, M.; Wang, Y. Mind the error! detection and localization of instruction errors in vision-and-language navigation. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 12993–13000.
- Zheng, D.; Huang, S.; Zhao, L.; Zhong, Y.; Wang, L. Towards learning a generalist model for embodied navigation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13624–13634.
- Li, J.; Padmakumar, A.; Sukhatme, G.; Bansal, M. Vln-video: Utilizing driving videos for outdoor vision-and-language navigation. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2024, Vol. 38, pp. 18517–18526.
- O’Neill, A.; Rehman, A.; Maddukuri, A.; Gupta, A.; Padalkar, A.; Lee, A.; Pooley, A.; Gupta, A.; Mandlekar, A.; Jain, A.; et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903.
- Hong, H.; Qiao, Y.; Wang, S.; Liu, J.; Wu, Q. General scene adaptation for vision-and-language navigation. arXiv preprint arXiv:2501.17403 2025.
- Dong, Y.; Wu, F.; He, Q.; Cheng, Z.Q.; Li, H.; Li, M.; Cheng, Z.; Zhou, Y.; Sun, J.; Dai, Q.; et al. Ha-vln 2.0: An open benchmark and leaderboard for human-aware navigation in discrete and continuous environments with dynamic multi-human interactions. arXiv preprint arXiv:2503.14229 2025.
- Wang, L.; Xia, X.; Zhao, H.; Wang, H.; Wang, T.; Chen, Y.; Liu, C.; Chen, Q.; Pang, J. Rethinking the embodied gap in vision-and-language navigation: A holistic study of physical and visual disparities. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9455–9465.
- Zhu, S.; Mou, L.; Li, D.; Ye, B.; Huang, R.; Zhao, H. Vr-robo: A real-to-sim-to-real framework for visual robot navigation and locomotion. IEEE Robotics and Automation Letters 2025.
- Saxena, P.; Raghuvanshi, N.; Goveas, N. Uav-vln: End-to-end vision language guided navigation for uavs. In Proceedings of the 2025 European Conference on Mobile Robots (ECMR). IEEE, 2025, pp. 1–6.
- Wei, M.; Wan, C.; Yu, X.; Wang, T.; Yang, Y.; Mao, X.; Zhu, C.; Cai, W.; Wang, H.; Chen, Y.; et al. Streamvln: Streaming vision-and-language navigation via slowfast context modeling. arXiv preprint arXiv:2507.05240 2025.
- Lin, S.; Li, Z.; Zhao, X.; Zhou, G.; Wang, L.; Wei, R.; Tang, R.; Li, J.; Wang, H.; Pang, J.; et al. VLNVerse: A Benchmark for Vision-Language Navigation with Versatile, Embodied, Realistic Simulation and Evaluation. arXiv preprint arXiv:2512.19021 2025.
- Zhao, X.; Liu, C.; Ji, R.; Zhang, Z.; Zhu, M.; Song, L.; Ren, Z.; Qingliang, L.; Gao, Y.; Du, Z.; et al. CoT-VLNBench: A Benchmark for Visual Chain-of-Thought Reasoning in Vision-Language-Navigation Robots. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2026, Vol. 40, pp. 36573–36581.
- Ma, C.Y.; Lu, J.; Wu, Z.; AlRegib, G.; Kira, Z.; Socher, R.; Xiong, C. Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035 2019.
- Wang, X.; Huang, Q.; Celikyilmaz, A.; Gao, J.; Shen, D.; Wang, Y.F.; Wang, W.Y.; Zhang, L. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6629–6638.
- Hao, W.; Li, C.; Li, X.; Carin, L.; Gao, J. Towards learning a generic agent for vision-and-language navigation via pre-training. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13137–13146.
- Qi, Y.; Pan, Z.; Hong, Y.; Yang, M.H.; Van Den Hengel, A.; Wu, Q. The road to know-where: An object-and-room informed sequential bert for indoor vision-language navigation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1655–1664.
- He, K.; Huang, Y.; Wu, Q.; Yang, J.; An, D.; Sima, S.; Wang, L. Landmark-rxr: Solving vision-and-language navigation with fine-grained alignment supervision. Advances in Neural Information Processing Systems 2021, 34, 652–663.
- Wang, S.; Montgomery, C.; Orbay, J.; Birodkar, V.; Faust, A.; Gur, I.; Jaques, N.; Waters, A.; Baldridge, J.; Anderson, P. Less is more: Generating grounded navigation instructions from landmarks. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15428–15438.
- Zhang, Y.; Kordjamshidi, P. Explicit object relation alignment for vision and language navigation. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, 2022, pp. 322–331.
- Cui, Y.; Xie, L.; Zhang, Y.; Zhang, M.; Yan, Y.; Yin, E. Grounded entity-landmark adaptive pre-training for vision-and-language navigation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 12043–12053.
- Cui, Y.; Xie, L.; Zhao, Y.; Sun, J.; Yin, E. Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations. Information Fusion 2025, p. 104107.
- Lin, B.; Nie, Y.; Wei, Z.; Zhu, Y.; Xu, H.; Ma, S.; Liu, J.; Liang, X. Correctable landmark discovery via large models for vision-language navigation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2024, 46, 8534–8548. [CrossRef]
- Liu, Q.; Zhang, S.; Qiao, Y.; Zhu, J.; Li, X.; Guo, L.; Wang, Q.; He, X.; Wu, Q.; Liu, J. GroundingMate: Aiding Object Grounding for Goal-Oriented Vision-and-Language Navigation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 1775–1784.
- Raychaudhuri, S.; Ta, D.; Ashton, K.; Chang, A.X.; Wang, J.; Bucher, B. Nl-slam for oc-vln: Natural language grounded slam for object-centric vln. arXiv preprint arXiv:2411.07848 2024.
- Zhao, G.; Li, G.; Chen, W.; Yu, Y. Over-nav: Elevating iterative vision-and-language navigation with open-vocabulary detection and structured representation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16296–16306.
- Wen, S.; Zhang, Z.; Sun, Y.; Wang, Z. Ovl-map: An online visual language map approach for vision-and-language navigation in continuous environments. IEEE Robotics and Automation Letters 2025.
- Li, D.; Yang, Z.; Qi, G.; Pang, S.; Shang, G.; Ma, Q.; Yang, Z. OpenMap: Instruction Grounding via Open-Vocabulary Visual-Language Mapping. In Proceedings of the Proceedings of the 33rd ACM International Conference on Multimedia, 2025, pp. 7444–7452.
- Long, Y.; Cai, W.; Wang, H.; Zhan, G.; Dong, H. Instructnav: Zero-shot system for generic instruction navigation in unexplored environment. arXiv preprint arXiv:2406.04882 2024.
- Zhang, Y.; Yu, H.; Xiao, J.; Feroskhan, M. Grounded vision-language navigation for uavs with open-vocabulary goal understanding. arXiv preprint arXiv:2506.10756 2025.
- Chen, K.; Chen, J.K.; Chuang, J.; Vázquez, M.; Savarese, S. Topological planning with transformers for vision-and-language navigation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11276–11286.
- Li, H.; Dong, X.; Jiang, H.; Zhou, Y.; Ma, X. CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval. arXiv preprint arXiv:2603.07997 2026.
- Liu, J.; Zhang, Z.; Li, X.; Wang, B.; Hu, Y.; Yin, B. TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation. arXiv preprint arXiv:2603.02972 2026.
- Georgakis, G.; Schmeckpeper, K.; Wanchoo, K.; Dan, S.; Miltsakaki, E.; Roth, D.; Daniilidis, K. Cross-modal map learning for vision and language navigation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 15460–15470.
- Chen, P.; Ji, D.; Lin, K.; Zeng, R.; Li, T.; Tan, M.; Gan, C. Weakly-supervised multi-granularity map learning for vision-and-language navigation. Advances in Neural Information Processing Systems 2022, 35, 38149–38161.
- Liu, R.; Wang, X.; Wang, W.; Yang, Y. Bird’s-eye-view scene graph for vision-language navigation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 10968–10980.
- Wang, Z.; Li, X.; Yang, J.; Liu, Y.; Jiang, S. Gridmm: Grid memory map for vision-and-language navigation. In Proceedings of the Proceedings of the IEEE/CVF International conference on computer vision, 2023, pp. 15625–15636.
- Zhang, L.; Hao, X.; Xu, Q.; Zhang, Q.; Zhang, X.; Wang, P.; Zhang, J.; Wang, Z.; Zhang, S.; Xu, R. Mapnav: A novel memory representation via annotated semantic maps for vlm-based vision-and-language navigation. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 13032–13056.
- Liu, R.; Wang, W.; Yang, Y. Volumetric environment representation for vision-language navigation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 16317–16328.
- Wang, Z.; Li, X.; Yang, J.; Liu, Y.; Hu, J.; Jiang, M.; Jiang, S. Lookahead exploration with neural radiance representation for continuous vision-language navigation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 13753–13762.
- Dai, G.; Zhao, J.; Chen, Y.; Qin, Y.; Zhao, H.; Xie, G.; Yao, Y.; Shu, X.; Li, X. Unitedvln: Generalizable gaussian splatting for continuous vision-language navigation. arXiv preprint arXiv:2411.16053 2024.
- Gao, J.; Liu, R.; Wang, W. 3d gaussian map with open-set semantic grouping for vision-language navigation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9252–9262.
- Miao, B.; Wei, R.; Ge, Z.; Gao, S.; Zhu, J.; Wang, R.; Tang, S.; Xiao, J.; Tang, R.; Li, J.; et al. Towards Physically Executable 3D Gaussian for Embodied Navigation. arXiv preprint arXiv:2510.21307 2025.
- Gao, J.; Liu, R.; Xu, Y.; Cao, T.; Zhang, Y.; Zhang, Z.; Peng, S.; Yang, Y.; Wang, W. Uncertainty-aware gaussian map for vision-language navigation. In Proceedings of the The Fourteenth International Conference on Learning Representations, 2026.
- Zhang, J.; Wang, K.; Xu, R.; Zhou, G.; Hong, Y.; Fang, X.; Wu, Q.; Zhang, Z.; Wang, H. Navid: Video-based vlm plans the next step for vision-and-language navigation. arXiv preprint arXiv:2402.15852 2024.
- Zhang, J.; Wang, K.; Wang, S.; Li, M.; Liu, H.; Wei, S.; Wang, Z.; Zhang, Z.; Wang, H. Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks. arXiv preprint arXiv:2412.06224 2024.
- Wang, S.; Wang, Y.; Fan, Z.; Wang, Y.; Chen, M.; Wang, K.; Su, Z.; Li, W.; Cai, X.; Jin, Y.; et al. Monodream: Monocular vision-language navigation with panoramic dreaming. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2026, Vol. 40, pp. 10074–10082.
- Zheng, D.; Huang, S.; Li, Y.; Wang, L. Efficient-VLN: A Training-Efficient Vision-Language Navigation Model. arXiv preprint arXiv:2512.10310 2025.
- Zhang, J.; Li, A.; Qi, Y.; Li, M.; Liu, J.; Wang, S.; Liu, H.; Zhou, G.; Wu, Y.; Li, X.; et al. Embodied navigation foundation model. arXiv preprint arXiv:2509.12129 2025.
- Zheng, Z.; Mao, Z.; Zhou, X.; Chen, J.; Li, M.; Sun, X.; Zou, H.; Zhang, Z.; Liu, X.; Cao, D.; et al. VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness. arXiv preprint arXiv:2603.07080 2026.
- Chen, C.; Jain, U.; Schissler, C.; Gari, S.V.A.; Al-Halah, Z.; Ithapu, V.K.; Robinson, P.; Grauman, K. Soundspaces: Audio-visual navigation in 3d environments. In Proceedings of the European conference on computer vision. Springer, 2020, pp. 17–36.
- Paul, S.; Roy-Chowdhury, A.; Cherian, A. Avlen: Audio-visual-language embodied navigation in 3d environments. Advances in Neural Information Processing Systems 2022, 35, 6236–6249.
- Liu, X.; Paul, S.; Chatterjee, M.; Cherian, A. Caven: An embodied conversational agent for efficient audio-visual navigation in noisy environments. In Proceedings of the Proceedings of the AAAI conference on artificial intelligence, 2024, Vol. 38, pp. 3765–3773. [CrossRef]
- Yang, Z.; Liu, J.; Chen, P.; Cherian, A.; Marks, T.K.; Le Roux, J.; Gan, C. Rila: Reflective and imaginative language agent for zero-shot semantic audio-visual navigation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 16251–16261.
- Zhu, Y.; Weng, Y.; Zhu, F.; Liang, X.; Ye, Q.; Lu, Y.; Jiao, J. Self-motivated communication agent for real-world vision-dialog navigation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1594–1603.
- Han, L.; Min, H.; Hwangbo, G.; Choi, J.; Seo, P.H. DialNav: Multi-turn Dialog Navigation with a Remote Guide. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 8514–8523.
- Zhou, S.; Wu, Y.; Wang, T.; Li, X.; Chen, G.; Liu, L.; Bai, C.; Li, X. DeCoNav: Dialog enhanced Long-Horizon Collaborative Vision-Language Navigation. arXiv preprint arXiv:2604.12486 2026.
- Yu, B.; Kasaei, H.; Cao, M. Co-navgpt: Multi-robot cooperative visual semantic navigation using large language models. arXiv preprint arXiv:2310.07937 2023.
- Wu, S.; Fu, X.; Wu, F.; Zha, Z.J. Cross-modal semantic alignment pre-training for vision-and-language navigation. In Proceedings of the Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 4233–4241.
- Du, M.; Wu, B.; Zhang, J.; Fan, Z.; Li, Z.; Luo, R.; Huang, X.J.; Wei, Z. Delan: Dual-level alignment for vision-and-language navigation by cross-modal contrastive learning. In Proceedings of the Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 2024, pp. 4605–4616.
- Hwang, M.; Jeong, J.; Kim, M.; Oh, Y.; Oh, S. Meta-explore: Exploratory hierarchical vision-and-language navigation using scene object spectrum grounding. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6683–6693.
- Li, S.; Wang, Z.; Zhou, G.; Li, J.; Zeng, X.; Wang, L.; Qiao, Y.; Wu, Q.; Bansal, M.; Wang, Y. Learning Goal-Oriented Language-Guided Navigation with Self-Improving Demonstrations at Scale. arXiv preprint arXiv:2509.24910 2025.
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International conference on machine learning. PmLR, 2021, pp. 8748–8763.
- Zhou, X.; Girdhar, R.; Joulin, A.; Krähenbühl, P.; Misra, I. Detecting twenty-thousand classes using image-level supervision. In Proceedings of the European conference on computer vision. Springer, 2022, pp. 350–368.
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026.
- Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Proceedings of the European conference on computer vision. Springer, 2024, pp. 38–55.
- Huang, C.; Mees, O.; Zeng, A.; Burgard, W. Visual language maps for robot navigation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 10608–10615.
- Jatavallabhula, K.M.; Kuwajerwala, A.; Gu, Q.; Omama, M.; Chen, T.; Maalouf, A.; Li, S.; Iyer, G.; Saryazdi, S.; Keetha, N.; et al. Conceptfusion: Open-set multimodal 3d mapping. arXiv preprint arXiv:2302.07241 2023.
- Peng, S.; Genova, K.; Jiang, C.; Tagliasacchi, A.; Pollefeys, M.; Funkhouser, T.; et al. Openscene: 3d scene understanding with open vocabularies. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 815–824.
- Lu, S.; Chang, H.; Jing, E.P.; Boularias, A.; Bekris, K. Ovir-3d: Open-vocabulary 3d instance retrieval without training on 3d data. In Proceedings of the Conference on Robot Learning. PMLR, 2023, pp. 1610–1620.
- Werby, A.; Huang, C.; Büchner, M.; Valada, A.; Burgard, W. Hierarchical open-vocabulary 3d scene graphs for language-grounded robot navigation. In Proceedings of the First Workshop on Vision-Language Models for Navigation and Manipulation at ICRA 2024, 2024.
- Wang, H.; Wang, W.; Liang, W.; Xiong, C.; Shen, J. Structured scene memory for vision-language navigation. In Proceedings of the Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2021, pp. 8455–8464.
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 2023.
- He, K.; Jing, Y.; Huang, Y.; Lu, Z.; An, D.; Wang, L. Memory-adaptive vision-and-language navigation. Pattern Recognition 2024, 153, 110511.
- Zhang, S.; Qiao, Y.; Wang, Q.; Yan, Z.; Wu, Q.; Wei, Z.; Liu, J. Cosmo: Combination of selective memorization for low-cost vision-and-language navigation. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 5511–5522.
- An, D.; Qi, Y.; Li, Y.; Huang, Y.; Wang, L.; Tan, T.; Shao, J. Bevbert: Multimodal map pre-training for language-guided navigation. arXiv preprint arXiv:2212.04385 2022.
- Zhang, X.; Xu, Y.; Li, J.; Liu, R.; Hu, Z. Agent journey beyond rgb: Hierarchical semantic-spatial representation enrichment for vision-and-language navigation. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2026, Vol. 40, pp. 18791–18799.
- Wang, Z.; Li, M.; Wu, M.; Moens, M.F.; Tuytelaars, T. Instruction-guided path planning with 3D semantic maps for vision-language navigation. Neurocomputing 2025, 625, 129457.
- Zeng, S.; Qi, D.; Chang, X.; Xiong, F.; Xie, S.; Wu, X.; Liang, S.; Xu, M.; Wei, X. Janusvln: Decoupling semantics and spatiality with dual implicit memory for vision-language navigation. arXiv preprint arXiv:2509.22548 2025.
- Qi, Z.; Zhang, Z.; Yu, Y.; Wang, J.; Zhao, H. Vln-r1: Vision-language navigation via reinforcement fine-tuning. arXiv preprint arXiv:2506.17221 2025.
- Lu, Y.; Sun, S.; Liu, N.; Jiang, B.; Zhang, Y.; Chen, J.; Du, C. STEP-Nav: Spatial-Temporal Efficient Visual Token Pruning for Vision-and-Language Navigation with Large Language Models. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2026, Vol. 40, pp. 24097–24105.
- Huang, C.; Mees, O.; Zeng, A.; Burgard, W. Audio visual language maps for robot navigation. In Proceedings of the International Symposium on Experimental Robotics. Springer, 2023, pp. 105–117.
- Shi, Z.; Zhang, L.; Li, L.; Shen, Y. Towards audio-visual navigation in noisy environments: a large-scale benchmark dataset and an architecture considering multiple sound-sources. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2025, Vol. 39, pp. 14673–14680.
- Fan, J.; Chen, P.; Li, C.; Du, Q.; Chen, J.; Tan, M. NaVLA2: A Vision-Language-Audio-Action Model for Multimodal Instruction Navigation. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2026, Vol. 40, pp. 18234–18242.
- Fan, Y.; Chen, W.; Jiang, T.; Zhou, C.; Zhang, Y.; Wang, X. Aerial vision-and-dialog navigation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 3043–3061.
- Su, Y.; An, D.; Chen, K.; Yu, W.; Ning, B.; Ling, Y.; Huang, Y.; Wang, L. Learning fine-grained alignment for aerial vision-dialog navigation. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2025, Vol. 39, pp. 7060–7068.
- Zhu, W.; Hu, H.; Chen, J.; Deng, Z.; Jain, V.; Ie, E.; Sha, F. Babywalk: Going farther in vision-and-language navigation by taking baby steps. In Proceedings of the Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 2539–2556.
- Hong, Y.; Rodriguez, C.; Wu, Q.; Gould, S. Sub-instruction aware vision-and-language navigation. In Proceedings of the Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), 2020, pp. 3360–3376.
- Zhang, Y.; Kordjamshidi, P. Vln-trans: Translator for the vision and language navigation agent. In Proceedings of the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 13219–13233. [CrossRef]
- Wang, X.; Wang, W.; Shao, J.; Yang, Y. Lana: A language-capable navigator for instruction following and generation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19048–19058.
- He, Z.; Wang, L.; Li, S.; Yan, Q.; Liu, C.; Chen, Q. A multilevel attention network with sub-instructions for continuous vision-and-language navigation: Z. He et al. Applied Intelligence 2025, 55, 657.
- Huang, B.; Zheng, Y.; Lan, C.; Sui, D.; Zhao, X.; Zhang, X.; Xiao, M.; Yu, D. Action-Aware Visual-Textual Alignment for Long-Instruction Vision-and-Language Navigation. ACM Transactions on Multimedia Computing, Communications and Applications 2025, 21, 1–22.
- Wang, S.; Wang, Y.; Lian, G.; Wang, Y.; Chen, M.; Wang, K.; Zhang, B.; Su, Z.; Zhou, Y.; Li, W.; et al. Progress-Think: Semantic Progress Reasoning for Vision-Language Navigation. arXiv preprint arXiv:2511.17097 2025.
- Chen, K.; An, D.; Huang, Y.; Xu, R.; Su, Y.; Ling, Y.; Reid, I.; Wang, L. Constraint-aware zero-shot vision-language navigation in continuous environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 2025.
- Yin, H.; Wei, H.; Xu, X.; Guo, W.; Zhou, J.; Lu, J. GC-VLN: Instruction as graph constraints for training-free vision-and-language navigation. arXiv preprint arXiv:2509.10454 2025.
- Gao, Y.; Wang, Z.; Han, P.; Jing, L.; Wang, D.; Zhao, B. Exploring spatial representation to enhance LLM reasoning in aerial vision-language navigation. arXiv preprint arXiv:2410.08500 2024.
- Zhou, L.; Xue, R.; Luo, X. Structured Instruction Parsing and Scene Alignment For UAV Vision-Language Navigation. In Proceedings of the 2025 IEEE International Conference on Image Processing (ICIP). IEEE, 2025, pp. 2600–2605.
- Zhang, W.; Gao, C.; Yu, S.; Peng, R.; Zhao, B.; Zhang, Q.; Cui, J.; Chen, X.; Li, Y. Citynavagent: Aerial vision-and-language navigation with hierarchical semantic planning and global memory. In Proceedings of the Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 31292–31309.
- Ma, T.; Zhang, Y.; Wang, Z.; Kordjamshidi, P. Breaking down and building up: Mixture of skill-based vision-and-language navigation agents. arXiv preprint arXiv:2508.07642 2025.
- Chen, B.; Xu, Z.; Kirmani, S.; Ichter, B.; Sadigh, D.; Guibas, L.; Xia, F. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14455–14465.
- Bai, Q.; Chen, Z.; Luo, L.; Du, H.; Lei, Y.; Jiao, Z. Endowing embodied agents with spatial reasoning capabilities for vision-and-language navigation. arXiv preprint arXiv:2504.08806 2025.
- Du, Y.; Fu, T.; Chen, Z.; Li, B.; Su, S.; Zhao, Z.; Wang, C. Vl-nav: real-time vision-language navigation with spatial reasoning. arXiv preprint arXiv:2502.00931 2025.
- Liu, F.; Li, G.; Zou, L.; Chen, Y.; Cheng, P. DroneNav: Unified text-visual representation and structured spatial reasoning for robust UAV vision-and-language navigation. Neurocomputing 2026, p. 133492.
- Yue, L.; Fan, Y.; Lian, S.; Zhao, Y.; Yu, J.; Xie, L.; Zhang, F. Spatial-VLN: Zero-Shot Vision-and-Language Navigation With Explicit Spatial Perception and Exploration. arXiv preprint arXiv:2601.12766 2026.
- Qiao, Y.; Lyu, W.; Wang, H.; Wang, Z.; Li, Z.; Zhang, Y.; Tan, M.; Wu, Q. Open-nav: Exploring zero-shot vision-and-language navigation in continuous environment with open-source llms. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 6710–6717.
- Jiang, Z.; Wang, X. SpatialGPT: Zero-Shot Vision-and-Language Navigation via Spatial CoT over Structured Spatial Memory. In Proceedings of the Proceedings of the 33rd ACM International Conference on Advances in Geographic Information Systems, 2025, pp. 423–435.
- Liu, C.; Zhou, Z.; Zhang, J.; Zhang, M.; Huang, S.; Duan, H. Msnav: Zero-shot vision-and-language navigation with dynamic memory and llm spatial reasoning. In Proceedings of the ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2026, pp. 20112–20116.
- Zhou, X.; Xiao, T.; Liu, L.; Wang, Y.; Chen, M.; Meng, X.; Wang, X.; Feng, W.; Sui, W.; Su, Z. FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph. arXiv preprint arXiv:2509.13733 2025.
- Zhang, J.; Li, Z.; Wang, S.; Shi, X.; Wei, Z.; Wu, Q. SpatialNav: Leveraging Spatial Scene Graphs for Zero-Shot Vision-and-Language Navigation. arXiv preprint arXiv:2601.06806 2026.
- Li, X.; Wang, Z.; Yang, J.; Wang, Y.; Jiang, S. Kerm: Knowledge enhanced reasoning for vision-and-language navigation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2583–2592.
- Zhou, G.; Hong, Y.; Wu, Q. Navgpt: Explicit reasoning in vision-and-language navigation with large language models. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2024, Vol. 38, pp. 7641–7649.
- Lyu, K.; Wu, K.; Li, P.; Hu, X.; Si, Q.; Miao, C.; Yang, N.; Wang, Z.; Xiao, L.; Hu, L.; et al. HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System. arXiv preprint arXiv:2603.14807 2026.
- Lin, B.; Nie, Y.; Wei, Z.; Chen, J.; Ma, S.; Han, J.; Xu, H.; Chang, X.; Liang, X. Navcot: Boosting llm-based vision-and-language navigation via learning disentangled reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence 2025.
- Wang, S.; Wang, Y.; Li, W.; Cai, X.; Wang, Y.; Chen, M.; Wang, K.; Su, Z.; Li, D.; Fan, Z. Aux-think: Exploring reasoning strategies for data-efficient vision-language navigation. arXiv preprint arXiv:2505.11886 2025.
- Zuo, J.; Mu, L.; Jiang, F.; Ma, C.; Xu, M.; Qi, Y. FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation. arXiv preprint arXiv:2601.13976 2026.
- Ding, X.; Wei, J.; Yang, Y.; Jiang, S.; Zhang, Q.; Wu, H.; Jia, F.; Mi, L.; Yan, Y.; Wang, W.; et al. AdaNav: Adaptive Reasoning with Uncertainty for Vision-Language Navigation. arXiv preprint arXiv:2509.24387 2025.
- Xue, W.; Li, M.; Wu, X.; Tang, J.; Yang, D.; Zhang, L. ProFocus: Proactive Perception and Focused Reasoning in Vision-and-Language Navigation. arXiv preprint arXiv:2603.05530 2026.
- Li, X.; Lyu, F.; Wu, H.; Liu, M.; Liu, J.N.; Liu, G. Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning. arXiv preprint arXiv:2604.02318 2026.
- Xin, Z.; Li, W.; Jiang, Y.; Wang, B.; Cong, R.; Qin, J.; Huang, S. DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation. arXiv preprint arXiv:2603.13133 2026.
- Fang, X.; Fang, W.; Wang, C. Hierarchical semantic-augmented navigation: Optimal transport and graph-driven reasoning for vision-language navigation. In Proceedings of the The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025.
- Lin, B.; Nie, Y.; Zai, K.L.; Wei, Z.; Han, M.; Xu, R.; Niu, M.; Han, J.; Zhang, H.; Lin, L.; et al. EvolveNav: Empowering LLM-Based Vision-Language Navigation via Self-Improving Embodied Reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence 2026.
- Wang, H.; Liang, W.; Van Gool, L.; Wang, W. Dreamwalker: Mental planning for continuous vision-language navigation. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 10873–10883.
- Bar, A.; Zhou, G.; Tran, D.; Darrell, T.; LeCun, Y. Navigation world models. In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 15791–15801.
- Pan, Y.; Xu, Y.; Liu, Z.; Wang, H. Planning from imagination: Episodic simulation and episodic memory for vision-and-language navigation. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2025, Vol. 39, pp. 6345–6353.
- Perincherry, A.; Krantz, J.; Lee, S. Do visual imaginations improve vision-and-language navigation agents? In Proceedings of the Proceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 3846–3855.
- Huang, Y.; Wu, M.; Li, R.; Tu, Z. Vista: Generative visual imagination for vision-and-language navigation. arXiv preprint arXiv:2505.07868 2025.
- Lian, G.; Wang, S.; Wang, Y.; Wang, Y.; Chen, M.; Wang, K.; Zhang, B.; Su, Z.; Li, D.; Fan, Z. MapDream: Task-Driven Map Learning for Vision-Language Navigation. arXiv preprint arXiv:2602.00222 2026.
- Dai, G.; Wang, S.; Zhao, H.; Zhu, B.; Sun, Q.; Shu, X. ThinkMatter: Panoramic-Aware Instructional Semantics for Monocular Vision-and-Language Navigation. IEEE Transactions on Image Processing 2026.
- Liu, F.; Xie, S.; Luo, M.; Chu, Z.; Hu, J.; Wu, X.; Xu, M. NavForesee: A Unified Vision-Language World Model for Hierarchical Planning and Dual-Horizon Navigation Prediction. arXiv preprint arXiv:2512.01550 2025.
- Hu, J.; Chen, J.; Bai, H.; Luo, M.; Xie, S.; Chen, Z.; Liu, F.; Chu, Z.; Xue, X.; Ren, B.; et al. AstraNav-World: World Model for Foresight Control and Consistency. arXiv preprint arXiv:2512.21714 2025.
- Huang, C.; Tang, L.; Zhan, Z.; Yu, L.; Zeng, R.; Liu, Z.; Wang, Z.; Li, J. UNeMo: Collaborative Visual-Language Reasoning and Navigation via a Multimodal World Model. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2026, Vol. 40, pp. 18315–18323.
- Fan, Z.; Lyu, W.; Song, W.; Zhao, L.; Yang, Y.; Wang, X.; He, J.; Huang, L.; Liu, H.; Sun, B.; et al. PROSPECT: Unified Streaming Vision-Language Navigation via Semantic–Spatial Fusion and Latent Predictive Representation. arXiv preprint arXiv:2603.03739 2026.
- Chen, H.; Jiang, S.; Su, T.; Gao, C.; Chen, X.; Li, Y.; Chen, Z. WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models. arXiv preprint arXiv:2604.07957 2026.
- Wu, K.; Li, P.; Lyu, K.; Zhao, L.; He, Q.; Wang, J.; Liu, J. Dual-Anchoring: Addressing State Drift in Vision-Language Navigation. arXiv preprint arXiv:2604.17473 2026.
- Liu, R.; Wu, S.; Lin, D.; Zhang, W. CVLN-Think: Causal Inference with Counterfactual Style Adaptation for Continuous Vision-and-Language Navigation. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 15299–15305.
- Ye, S.; Ge, Y.; Zheng, K.; Gao, S.; Yu, S.; Kurian, G.; Indupuru, S.; Tan, Y.L.; Zhu, C.; Xiang, J.; et al. World Action Models are Zero-shot Policies, 2026, [arXiv:cs.RO/2602.15922].
- Yuan, T.; Dong, Z.; Liu, Y.; Zhao, H. Fast-WAM: Do World Action Models Need Test-time Future Imagination? arXiv preprint arXiv:2603.16666 2026.
- Ye, A.; Wang, B.; Ni, C.; Huang, G.; Zhao, G.; Li, H.; Li, H.; Li, J.; Lv, J.; Liu, J.; et al. GigaWorld-Policy: An Efficient Action-Centered World–Action Model. arXiv preprint arXiv:2603.17240 2026.
- Wang, L.; Zheng, Y.; Chen, Q.; Li, S.; Zhang, Y.; Xing, Z.; Zhang, Q.; Li, X.; Qian, D.; Yang, P.; et al. Latent-WAM: Latent World Action Modeling for End-to-End Autonomous Driving. arXiv preprint arXiv:2603.24581 2026.
- Zhou, Y.; Wang, X.; Shao, H.; Wang, L.; Zhao, G.; Shao, J.; Zhu, J.; Yu, T.; Zhu, Z.; Huang, G.; et al. DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning. arXiv preprint arXiv:2604.01765 2026.
- Yang, H.; Long, Y.; Yu, Z.; Yang, Z.; Wang, M.; Xu, J.; Wang, Y.; Yu, Z.; Cai, W.; Kang, L.; et al. NavSpace: How Navigation Agents Follow Spatial Intelligence Instructions. arXiv preprint arXiv:2510.08173 2025.
- Zhao, X.; Liu, C.; Ji, R.; Zhang, Z.; Zhu, M.; Song, L.; Ren, Z.; Qingliang, L.; Gao, Y.; Du, Z.; et al. CoT-VLNBench: A Benchmark for Visual Chain-of-Thought Reasoning in Vision-Language-Navigation Robots. Proceedings of the AAAI Conference on Artificial Intelligence 2026, 40, 36573–36581. [CrossRef]
- Guo, W.; Xu, X.; Liu, Y.; Li, X.; Yin, H.; Chen, H.; Zheng, W.; Feng, J.; Zhou, J.; Lu, J. AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 4065–4075.
- Gao, P.; Wang, P.; Wang, F.; Fujita, H.; Aljuaid, H.; Shang, J.L. DeepVLN: Vision-and-Language Navigation via Deep Reasoning and Collaborative Mechanisms Based on Large Language Models. IEEE Journal of Selected Topics in Signal Processing 2026, 20, 47–62. [CrossRef]
- Zhang, Z.; Li, Z.; Rahmati, B.; Yang, R.H.; Ma, Y.; Rasouli, A.; Pakdamansavoji, S.; Wu, Y.; Zhang, L.; Cao, T.; et al. Do World Action Models Generalize Better than VLAs? A Robustness Study, 2026, [arXiv:cs.RO/2603.22078].
- Li, X.; Li, C.; Xia, Q.; Bisk, Y.; Celikyilmaz, A.; Gao, J.; Smith, N.A.; Choi, Y. Robust navigation with language pretraining and stochastic sampling. In Proceedings of the Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 1494–1499.
- Ma, C.Y.; Wu, Z.; AlRegib, G.; Xiong, C.; Kira, Z. The regretful agent: Heuristic-aided navigation through progress estimation. In Proceedings of the Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2019, pp. 6732–6740.
- Ke, L.; Li, X.; Bisk, Y.; Holtzman, A.; Gan, Z.; Liu, J.; Gao, J.; Choi, Y.; Srinivasa, S. Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6741–6749.
- Zhu, F.; Zhu, Y.; Chang, X.; Liang, X. Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10012–10022.
- Liu, R.; Wang, W.; Yang, Y. Vision-language navigation with energy-based policy. Advances in Neural Information Processing Systems 2024, 37, 108208–108230.
- Cheng, A.C.; Ji, Y.; Yang, Z.; Gongye, Z.; Zou, X.; Kautz, J.; Bıyık, E.; Yin, H.; Liu, S.; Wang, X. Navila: Legged robot vision-language-action model for navigation. arXiv preprint arXiv:2412.04453 2024.
- Zhu, W.; Zhang, Z.; Wang, X.; Pan, H.; Wang, T.; Geng, T.; Xu, R.; Zheng, F. NaVIDA: Vision-Language Navigation with Inverse Dynamics Augmentation. arXiv preprint arXiv:2601.18188 2026.
- Li, P.; Wu, K.; Xu, S.; Li, F.; Li, H.; Zhao, L.; Lyu, K.; Chen, L.; Yang, Z.X.; Zheng, N. SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation. arXiv preprint arXiv:2604.27620 2026.
- Hao, H.; Chen, L.; Han, M.; Li, C.; An, D.; Yang, Y.; Li, Z.; Chang, X. LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning. arXiv preprint arXiv:2603.29165 2026.
- Castro, M.G.; Rajagopal, S.; Gorbatov, D.; Schmittle, M.; Baijal, R.; Zhang, O.; Scalise, R.; Talia, S.; Romig, E.; de Melo, C.; et al. VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation. arXiv preprint arXiv:2510.20818 2025.
- Cai, W.; Peng, J.; Yang, Y.; Zhang, Y.; Wei, M.; Wang, H.; Chen, Y.; Wang, T.; Pang, J. Navdp: Learning sim-to-real navigation diffusion policy with privileged information guidance. arXiv preprint arXiv:2505.08712 2025.
- Sun, X.; Si, W.; Ni, W.; Li, Y.; Wu, D.; Xie, F.; Guan, R.; Xu, H.Y.; Ding, H.; Wu, Y.; et al. AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild. arXiv preprint arXiv:2602.09657 2026.
- Xu, P.; Deng, Z.; Deng, J.; Gu, Z.; Wan, S. AerialVLA: A Vision-Language-Action Model for UAV Navigation via Minimalist End-to-End Control. arXiv preprint arXiv:2603.14363 2026.
- Wang, X.; Xiong, W.; Wang, H.; Wang, W.Y. Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In Proceedings of the Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 37–53.
- Wang, X.; Huang, Q.; Celikyilmaz, A.; Gao, J.; Shen, D.; Wang, Y.F.; Wang, W.Y.; Zhang, L. Vision-language navigation policy learning and adaptation. IEEE transactions on pattern analysis and machine intelligence 2020, 43, 4205–4216. [CrossRef]
- Chen, J.; Gao, C.; Meng, E.; Zhang, Q.; Liu, S. Reinforced structured state-evolution for vision-language navigation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15450–15459.
- Wang, J.; Wang, T.; Xu, L.; He, Z.; Sun, C. Discovering intrinsic subgoals for vision-and-language navigation via hierarchical reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems 2024, 36, 6516–6528.
- Wang, J.; Wang, T.; Cai, W.; Xu, L.; Sun, C. Boosting efficient reinforcement learning for vision-and-language navigation with open-sourced llm. IEEE Robotics and Automation Letters 2024, 10, 612–619.
- Liu, R.; Kong, P.; Wu, S.; Zhang, W. RewardVLN: AnImproved Agent Navigation Based On Visual-Instruction Alignment. In Proceedings of the 2024 International Conference on Advanced Robotics and Mechatronics (ICARM). IEEE, 2024, pp. 126–133.
- Wang, Y.; Sun, Z.; Zhang, J.; Xian, Z.; Biyik, E.; Held, D.; Erickson, Z. Rl-vlm-f: Reinforcement learning from vision language foundation model feedback. arXiv preprint arXiv:2402.03681 2024.
- Zhang, Z.; Zhu, W.; Pan, H.; Wang, X.; Xu, R.; Sun, X.; Zheng, F. Activevln: Towards active exploration via multi-turn rl in vision-and-language navigation. arXiv preprint arXiv:2509.12618 2025.
- Ye, S.; Mao, S.; Cui, Y.; Yu, X.; Zhai, S.; Chen, W.; Zhou, S.; Xiong, R.; Wang, Y. ETP-R1: Evolving Topological Planning with Reinforcement Fine-tuning for Vision-Language Navigation in Continuous Environments. arXiv preprint arXiv:2512.20940 2025.
- Li, J.; Wan, C.; Dong, S.; Ding, C.; Wang, Q.; Ma, Z.; Gong, Y. Trajectory-Diversity-Driven Robust Vision-and-Language Navigation. arXiv preprint arXiv:2603.15370 2026.
- Wang, Z.; Lin, Z.; Yang, Y.; Fu, H.; Ye, D. SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization. arXiv preprint arXiv:2512.02631 2025.
- Huang, T.; Li, D.; Yang, R.; Zhang, Z.; Yang, Z.; Tang, H. Mobilevla-r1: Reinforcing vision-language-action for mobile robots. arXiv preprint arXiv:2511.17889 2025.
- Li, H.; Liu, R.; Fan, H.; Yang, Y. Let’s Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments. arXiv preprint arXiv:2603.09740 2026.
- Liu, Q.; Huang, T.; Zhang, Z.; Tang, H. Nav-r1: Reasoning and navigation in embodied scenes. arXiv preprint arXiv:2509.10884 2025.
- Wang, S.; Luo, Y.; Chen, X.; Luo, A.; Li, D.; Liu, C.; Chen, S.; Zhang, Y.; Yu, J. VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory. arXiv preprint arXiv:2601.08665 2026.
- Ross, S.; Bagnell, D. Efficient reductions for imitation learning. In Proceedings of the Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 2010, pp. 661–668.
- Shi, H.; Deng, X.; Li, Z.; Chen, G.; Wang, Y.; Nie, L. DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation. arXiv preprint arXiv:2508.09444 2025.
- He, G.; Liu, Z.; Xu, K.; Xu, L.; Qiao, T.; Yu, W.; Wu, C.; Xie, W. Nipping the Drift in the Bud: Retrospective Rectification for Robust Vision-Language Navigation. arXiv preprint arXiv:2602.06356 2026.
- Liang, X.; Ma, L.; Guo, S.; Han, J.; Xu, H.; Ma, S.; Liang, X. Cornav: Autonomous agent with self-corrected planning for zero-shot vision-and-language navigation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 12538–12559.
- Long, Y.; Li, X.; Cai, W.; Dong, H. Discuss before moving: Visual language navigation via multi-expert discussions. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 17380–17387.
- Xin, Z.; Li, W.; Jiang, Y.; Huang, Z.; Wang, B.; Li, P.; Zhu, J.; Qin, J.; Huang, S. AgentVLN: Towards Agentic Vision-and-Language Navigation. arXiv preprint arXiv:2603.17670 2026.
- Yu, Z.; Long, Y.; Yang, Z.; Zeng, C.; Fan, H.; Zhang, J.; Dong, H. Correctnav: Self-correction flywheel empowers vision-language-action navigation model. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2026, Vol. 40, pp. 18737–18745.
- Dong, X.; Zhao, H.; Gao, J.; Li, H.; Ma, X.; Zhou, Y.; Chen, F.; Liu, J. Se-vln: A self-evolving vision-language navigation framework based on multimodal large language models. arXiv preprint arXiv:2507.13152 2025.
- Zhong, Y.; Zhang, Z.; Zhang, R.; Huang, L.; Gao, H.; Wang, S.; Li, D.; Han, R.; Guo, J.; Peng, S.; et al. Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2026, Vol. 40, pp. 18845–18854.
- Huang, J.; Huang, J.; Yang, H.; Li, H.; Wang, Y. AERR-Nav: Adaptive Exploration-Recovery-Reminiscing Strategy for Zero-Shot Object Navigation. arXiv preprint arXiv:2603.17712 2026.
- Li, J.; Tan, H.; Bansal, M. Envedit: Environment editing for vision-and-language navigation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15407–15417.
- He, K.; Si, C.; Lu, Z.; Huang, Y.; Wang, L.; Wang, X. Frequency-enhanced data augmentation for vision-and-language navigation. Advances in neural information processing systems 2023, 36, 4351–4364.
- Li, J.; Bansal, M. Panogen: Text-conditioned panoramic environment generation for vision-and-language navigation. Advances in neural information processing systems 2023, 36, 21878–21894.
- Wang, S.; Zhou, D.; Xie, L.; Xu, C.; Yan, Y.; Yin, E. Panogen++: Domain-adapted text-guided panoramic environment generation for vision-and-language navigation. Neural Networks 2025, 187, 107320.
- Zhong, Y.; Zhang, R.; Zhang, Z.; Wang, S.; Fang, C.; Zhang, X.; Guo, J.; Peng, S.; Huang, D.; Yan, Y.; et al. World-Consistent Data Generation for Vision-and-Language Navigation. arXiv preprint arXiv:2412.06413 2024.
- Kamath, A.; Anderson, P.; Wang, S.; Koh, J.Y.; Ku, A.; Waters, A.; Yang, Y.; Baldridge, J.; Parekh, Z. A new path: Scaling vision-and-language navigation with synthetic instructions and imitation learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 10813–10823.
- Wang, Z.; Li, J.; Hong, Y.; Li, S.; Li, K.; Yu, S.; Wang, Y.; Qiao, Y.; Wang, Y.; Bansal, M.; et al. Bootstrapping language-guided navigation learning with self-refining data flywheel. arXiv preprint arXiv:2412.08467 2024.
- Wang, Z.; Zhu, Y.; Lee, G.H.; Fan, Y. Navrag: Generating user demand instructions for embodied navigation through retrieval-augmented llm. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 8430–8440.
- Zheng, Y.; Zhang, L.; Sun, Y.; Shen, Y.; Zhao, S. CaneSpeaker: An LLM-Assisted Speaker for Generating Human-Like Navigation Instructions. ACM Transactions on Multimedia Computing, Communications and Applications 2026, 22, 1–26.
- Han, M.; Ma, L.; Zhumakhanova, K.; Radionova, E.; Zhang, J.; Chang, X.; Liang, X.; Laptev, I. Roomtour3d: Geometry-aware video-instruction tuning for embodied navigation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 27586–27596.
- Wei, M.; Wan, C.; Peng, J.; Yu, X.; Yang, Y.; Feng, D.; Cai, W.; Zhu, C.; Wang, T.; Pang, J.; et al. Ground slow, move fast: A dual-system foundation model for generalizable vision-and-language navigation. arXiv preprint arXiv:2512.08186 2025.
- Zhang, W.; Ma, C.; Wu, Q.; Yang, X. Language-guided navigation via cross-modal grounding and alternate adversarial learning. IEEE Transactions on Circuits and Systems for Video Technology 2020, 31, 3469–3481.
- Wang, X.E.; Jain, V.; Ie, E.; Wang, W.Y.; Kozareva, Z.; Ravi, S. Environment-agnostic multitask learning for natural language grounded navigation. In Proceedings of the European conference on computer vision. Springer, 2020, pp. 413–430.
- Liang, X.; Zhu, F.; Zhu, Y.; Lin, B.; Wang, B.; Liang, X. Contrastive instruction-trajectory learning for vision-language navigation. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2022, Vol. 36, pp. 1592–1600.
- Guhur, P.L.; Tapaswi, M.; Chen, S.; Laptev, I.; Schmid, C. Airbert: In-domain pretraining for vision-and-language navigation. In Proceedings of the Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1634–1643.
- Lu, H.; Liu, W.; Zhang, B.; Wang, B.; Dong, K.; Liu, B.; Sun, J.; Ren, T.; Li, Z.; Yang, H.; et al. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 2024.
- Li, A.; Wang, Z.; Zhang, J.; Li, M.; Qi, Y.; Chen, Z.; Zhang, Z.; Wang, H. Urbanvla: A vision-language-action model for urban micromobility. arXiv preprint arXiv:2510.23576 2025.
- Huang, Z.; Zhang, Y.; Liu, J.; Song, R.; Tang, C.; Ma, J. TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments. arXiv preprint arXiv:2602.02459 2026.
- Yin, H.; Xu, X.; Wu, Z.; Zhou, J.; Lu, J. Sg-nav: Online 3d scene graph prompting for llm-based zero-shot object navigation. Advances in neural information processing systems 2024, 37, 5285–5307.
- Huang, X.; Zhao, S.; Wang, Y.; Lu, X.; Zhang, W.; Qu, R.; Li, W.; Wang, Y.; Wen, C. Msgnav: Unleashing the power of multi-modal 3d scene graph for zero-shot embodied navigation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 37154–37163.
- Dorbala, V.S.; Sigurdsson, G.; Piramuthu, R.; Thomason, J.; Sukhatme, G.S. Clip-nav: Using clip for zero-shot vision-and-language navigation. arXiv preprint arXiv:2211.16649 2022.
- Zhang, W.; Zhang, J. Language-Driven Zero-Shot Object Navigation via Dynamic Probabilistic Strategy and Large Language Models. IEEE Access 2025.
- Majumdar, A.; Aggarwal, G.; Devnani, B.; Hoffman, J.; Batra, D. Zson: Zero-shot object-goal navigation using multimodal goal embeddings. Advances in Neural Information Processing Systems 2022, 35, 32340–32352.
- Chen, J.; Lin, B.; Liu, X.; Ma, L.; Liang, X.; Wong, K.Y.K. Affordances-oriented planning using foundation models for continuous vision-language navigation. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2025, Vol. 39, pp. 23568–23576.
- Team, I.N. InternVLA-N1: An Open Dual-System Navigation Foundation Model with Learned Latent Plans, 2025.
- Gao, C.; Peng, X.; Yan, M.; Wang, H.; Yang, L.; Ren, H.; Li, H.; Liu, S. Adaptive zone-aware hierarchical planner for vision-language navigation. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 14911–14920.
- Song, X.; Chen, W.; Liu, Y.; Chen, W.; Li, G.; Lin, L. Towards long-horizon vision-language navigation: Platform, benchmark and method. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025, pp. 12078–12088.
- Han, Z.; Wang, X.; Liu, B.; Lyu, Q.; Shang, Z.; Dong, J.; Liu, L.; Han, Z. SeqWalker: Sequential-Horizon Vision-and-Language Navigation with Hierarchical Planning. arXiv preprint arXiv:2601.04699 2026.
- Dai, G.; Wang, S.; Wang, Z.; Xie, G.S.; Yang, Y.; Pan, J.; Sun, Q.; Shu, X. History to Future: Evolving Agent with Experience and Thought for Zero-shot Vision-and-Language Navigation. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 15177–15187.
- Wang, X.; Li, G.; Liu, Z.; Wang, Y.; Liu, L.; Han, Z. All-day multi-scenes lifelong vision-and-language navigation with tucker adaptation. arXiv preprint arXiv:2603.14276 2026.
- Jiang, Y.; Zhang, H.; Luo, X.; He, S. M3E: Continual Vision-and-Language Navigation via Mixture of Macro and Micro Experts. In Proceedings of the The Fourteenth International Conference on Learning Representations.
- Yao, X.; Gao, J.; Xu, C. Navmorph: A self-evolving world model for vision-and-language navigation in continuous environments. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 5536–5546.
- Pei, J.; Liu, Y.; Pan, G.; Jiang, Y.; Liu, H.; Wang, X. OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation. arXiv preprint arXiv:2604.12872 2026.
- Tian, H.; Meng, J.; Zheng, W.S.; Li, Y.M.; Yan, J.; Zhang, Y. Loc4plan: Locating before planning for outdoor vision and language navigation. In Proceedings of the Proceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 4073–4081.
- Elnoor, M.; Weerakoon, K.; Seneviratne, G.; Xian, R.; Guan, T.; Jaffar, M.K.M.; Rajagopal, V.; Manocha, D. VLM-GroNav: Robot Navigation Using Physically Grounded Vision-Language Models in Outdoor Environments. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 2391–2398.
- Xu, Y.; Pan, Y.; Liu, Z.; Wang, H. Flame: Learning to navigate with multimodal llm in urban environments. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence, 2025, Vol. 39, pp. 9005–9013. [CrossRef]
- Ning, Y.; Zhao, G.; Qin, Y.; Liu, S.; Liu, Y.; Lin, L.; Li, G. LookasideVLN: direction-aware aerial vision-and-language navigation. arXiv preprint arXiv:2604.17190 2026.
- Taioli, F.; Rosa, S.; Castellini, A.; Natale, L.; Del Bue, A.; Farinelli, A.; Cristani, M.; Wang, Y. I2EDL: Interactive Instruction Error Detection and Localization. In Proceedings of the 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (ROMAN). IEEE, 2024, pp. 1872–1877.
- Li, C.; Tang, W.; Huang, Y.; Zhan, S.S.; Hu, M.; Jia, X.; Liu, Y. Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack. arXiv preprint arXiv:2511.13132 2025.
- Zhang, H.; Xu, M.; Dhafer, A.; Yue, S.; Dong, H.; Hao, Z.D. Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models. arXiv preprint arXiv:2605.00321 2026.
- Wang, Z.; Li, X.; Yang, J.; Liu, Y.; Jiang, S. Sim-to-real transfer via 3d feature fields for vision-and-language navigation. arXiv preprint arXiv:2406.09798 2024.
- Yin, S.; Fu, C.; Zhao, S.; Li, K.; Sun, X.; Xu, T.; Chen, E. A survey on multimodal large language models. National Science Review 2024, 11, nwae403. [CrossRef]
- Caffagni, D.; Cocchi, F.; Barsellotti, L.; Moratelli, N.; Sarto, S.; Baraldi, L.; Cornia, M.; Cucchiara, R. The revolution of multimodal large language models: A survey. Findings of the association for computational linguistics: ACL 2024 2024, pp. 13590–13618. [CrossRef]
- Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen technical report. arXiv preprint arXiv:2309.16609 2023.
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 2023.
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International conference on machine learning. PMLR, 2023, pp. 19730–19742.
- Chang, A.; Dai, A.; Funkhouser, T.; Halber, M.; Niessner, M.; Savva, M.; Song, S.; Zeng, A.; Zhang, Y. Matterport3D: Learning from RGB-D Data in Indoor Environments. In Proceedings of the International Conference on 3D Vision (3DV), 2017.









| Dimension | Gu et al. [1] | Wu et al. [21] | Zhang et al. [22] | Khan et al. [23] | Pan et al. [24] | Ours |
|---|---|---|---|---|---|---|
| Temporal coverage | –2022 | –2023 | –2024 | –2025 | –2025 | 2022–2026 |
| Core perspective | Task taxonomy | Task taxonomy | Foundation model tools | Task taxonomy | Foundation language models | Paradigm evolution |
| Object/landmark grounding | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| 3D scene understanding | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ |
| Streaming / video VLN | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Audio-visual navigation | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Memory & history modeling | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| World models | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ |
| LLM/VLM-based reasoning & planning | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ |
| Zero-shot & open-world generalization | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Long-horizon navigation | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Agentic navigation & self-correction | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ |
| Continual / lifelong learning | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Self-evolving navigation | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Cross-platform navigation (UAV, outdoor) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| City-scale outdoor VLN | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Trustworthy & safety-aware VLN | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ |
| Social-aware & human-in-the-loop VLN | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Benchmark | Year | Environment | Domain | Highlight |
|---|---|---|---|---|
| R2R [4] | 2018 | Sim. (Matterport3D) | Indoor | Foundational VLN benchmark with step-by-step instructions |
| R4R [28] | 2019 | Sim. (Matterport3D) | Indoor | Long-path extension by concatenating R2R trajectories |
| Touchdown [29] | 2019 | Real (Street View) | Outdoor | First outdoor street-view VLN benchmark |
| StreetLearn [30] | 2019 | Real (Street View) | Outdoor | Large-scale street-level navigation |
| HANNA [25] | 2019 | Sim. (Matterport3D) | Indoor | Help-seeking navigation with subgoal requests |
| Just Ask [49] | 2019 | Sim. (Matterport3D) | Indoor | Active question-asking for ambiguity resolution |
| ALFRED [50] | 2020 | Sim. (AI2-THOR) | Indoor | Household task combining navigation and manipulation |
| REVERIE [32] | 2020 | Sim. (Matterport3D) | Indoor | Remote object grounding with high-level instructions |
| RxR [31] | 2020 | Sim. (Matterport3D) | Indoor | Multilingual extension with denser instructions |
| VLN-CE / R2R-CE [2] | 2020 | Sim. (Habitat) | Indoor | First continuous-environment VLN with low-level control |
| CVDN [34] | 2020 | Sim. (Matterport3D) | Indoor | Cooperative vision-and-dialog navigation |
| ObjectNav [33] | 2020 | Sim. (Habitat) | Indoor | Object-goal navigation in unseen environments |
| RoboSlang [51] | 2020 | Real | Indoor | Real-robot dialog-based VLN |
| Retouchdown [52] | 2020 | Real (Street View) | Outdoor | Refined Touchdown with cleaner annotations |
| SOON [53] | 2021 | Sim. (Matterport3D) | Indoor | Scenario-oriented object navigation with hierarchical reasoning |
| RxR-CE [2] | 2021 | Sim. (Habitat) | Indoor | Continuous-environment counterpart of RxR |
| Talk2Nav [54] | 2021 | Real (Street View) | Outdoor | Long-range outdoor navigation with attention dialog |
| TEACh [35] | 2022 | Sim. (AI2-THOR) | Indoor | Task-oriented embodied agent with chat dialogue |
| DialFRED [36] | 2022 | Sim. (AI2-THOR) | Indoor | Dialog-augmented household task execution |
| HM3D-AutoVLN [55] | 2022 | Sim. (HM3D) | Indoor | Auto-generated instructions on large-scale HM3D |
| IVLN [14] | 2023 | Sim. (Habitat) | Indoor | Iterative VLN with cross-episode persistent memory |
| AerialVLN [45] | 2023 | Sim. (UE4) | Outdoor | First city-scale UAV VLN benchmark |
| Safe-VLN [40] | 2023 | Sim. (Habitat) | Indoor | Collision-aware safe VLN-CE |
| HA-VLN [41] | 2024 | Sim. (Matterport3D) | Indoor | Human-aware VLN with dynamic human activities |
| R2R-IE-CE [56] | 2024 | Sim. (Habitat) | Indoor | Instruction error detection and localization |
| VLNCL [42] | 2024 | Sim. (Matterport3D) | Indoor | First continual learning benchmark for VLN |
| CVLN [43] | 2024 | Sim. (Habitat) | Indoor | Cross-domain continual VLN |
| NaviLLM-Bench [57] | 2024 | Sim. (Mixed) | Indoor | Unified evaluation across multiple VLN tasks |
| VLN-Video [58] | 2024 | Real (Driving Video) | Outdoor | Driving-video-based outdoor VLN |
| CityNav [46] | 2024 | Real (Aerial) | Outdoor | City-scale aerial navigation dataset |
| Open X-E [59] | 2024 | Real | Indoor/Outdoor | Cross-embodiment large-scale dataset |
| LHPR-VLN [37] | 2025 | Sim. (Habitat) | Indoor | First long-horizon VLN benchmark, ∼150-step trajectories |
| MG-VLN [38] | 2025 | Sim. (Habitat) | Indoor | Multi-goal sequential navigation |
| GSA-VLN [60] | 2025 | Sim. (Habitat) | Indoor | Generalized scene adaptation with memory bank |
| HA-VLN 2.0 [61] | 2025 | Sim. (Matterport3D) | Indoor | Multi-human social-norm-aware VLN |
| VLN-PE [62] | 2025 | Sim.+Real | Indoor | Physical-level platform across multi-embodiments |
| VR-Robo [63] | 2025 | Real-Sim-Real | Indoor | High-fidelity digital twins for sim-to-real transfer |
| OpenFly [47] | 2025 | Sim./Real | Outdoor | 100K aerial trajectories, keyframe-aware UAV VLN |
| UAV-VLN [64] | 2025 | Sim. (UE4) | Outdoor | End-to-end velocity-yaw regression for UAV |
| StreamVLN [65] | 2025 | Sim. (Habitat) | Indoor | Streaming video-based VLN with online dialogue |
| CoNavBench [39] | 2026 | Sim. (Habitat) | Indoor | Multi-agent collaborative long-horizon VLN |
| VLNVerse [66] | 2026 | Sim. (Physics) | Indoor | Physics-aware large-scale VLN benchmark |
| VLN-NF [44] | 2026 | Sim. (Habitat) | Indoor | Feasibility-aware VLN with false-premise instructions |
| CoT-VLNBench [67] | 2026 | Sim. (Mixed) | Indoor | Visual chain-of-thought reasoning benchmark |
| AirNav [48] | 2026 | Real (Aerial) | Outdoor | Large-scale UAV VLN dataset for MLLM evaluation |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).