Submitted:
10 October 2025
Posted:
11 October 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
3. Methodology
3.1. Research Questions
- RQ1: Can multimodal inputs (vision and LiDAR) improve the zero-shot reasoning capabilities of LLMs for navigation and exploration tasks in simulated AMR environments?
- RQ2: How do different foundational LLMs vary in their capacity to generalize planning behaviors for autonomous navigation under identical simulation conditions?
3.2. System Architecture
3.3. LiDAR Data Processing and Representation
- Spatial locality: Groups measurements from contiguous angular directions.
- Dimensionality reduction: Compresses raw sensor data into compact descriptors.
- Semantic alignment: Maps sensor readings to spatial concepts recognizable by humans and LLMs.
| Listing 1. Structured LiDAR data representation for LLM processing. |
![]() |
| Listing 2. Template-based natural language generation of LiDAR descriptions. |
![]() |
3.4. LLMs and Prompt
| Listing 3. System Prompt. |
![]()
|
3.5. Experimental Settings
- T1: Go to the pile of pellet
- T2: Find an object i can use to carry some beers
- T3: Look if something is inside the red box
- T4: Go between the stairs and the oil barrels
- T5: Look for some stairs
- T6: Make a 360-degree turn
- T7: Look around
- T8: Go to the pile of dark brown boxes
- T9: Look for a red cylindrical fire extinguisher
- T10: There’s a fire!
3.6. Evaluation Metric for Spatial and Orientation Accuracy
4. Results
4.1. Multimodal Reasoning Capabilities of LLMs for AMR Planning
4.2. Generalization Across LLMs
5. Discussion and Limitations
5.1. Limitations
6. Conclusion and Future Works
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| AMR | Autonomous Mobile Robot |
| HRI | Human-Robot Interaction |
| LLM | Large Language Model |
| VLFM | Vision-Language Frontier Maps |
| VLM | Vision-Language Model |
Appendix A
| Listing A1. Pseudocode of LLM-Based Robot Control System. |
![]() ![]()
|
References
- Kawaharazuka, K.; Matsushima, T.; Gambardella, A.; Guo, J.; Paxton, C.; Zeng, A. Real-world robot applications of foundation models: a review. Advanced Robotics 2024, 38, 1232–1254. [Google Scholar] [CrossRef]
- Liu, H.; Zhu, Y.; Kato, K.; Tsukahara, A.; Kondo, I.; Aoyama, T.; Hasegawa, Y. Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration. IEEE Robotics and Automation Letters 2024, 9, 6904–6911. [Google Scholar] [CrossRef]
- Sun, C.; Huang, S.; Pompili, D. LLM-Based Multi-Agent Decision-Making: Challenges and Future Directions. IEEE Robotics and Automation Letters 2025, 10, 5681–5688. [Google Scholar] [CrossRef]
- Jin, Y.; Li, D.; A, Y.; Shi, J.; Hao, P.; Sun, F.; Zhang, J.; Fang, B. RobotGPT: Robot Manipulation Learning From ChatGPT. IEEE Robotics and Automation Letters 2024, 9, 2543–2550. [Google Scholar] [CrossRef]
- Xu, Z.; Zhang, Y.; Xie, E.; Zhao, Z.; Guo, Y.; Wong, K.Y.K.; Li, Z.; Zhao, H. DriveGPT4: Interpretable End-to-End Autonomous Driving Via Large Language Model. IEEE Robotics and Automation Letters 2024, 9, 8186–8193. [Google Scholar] [CrossRef]
- Zhang, C.; Chen, J.; Li, J.; Peng, Y.; Mao, Z. Large language models for human–robot interaction: A review. Biomimetic Intelligence and Robotics 2023, 3, 100131. [Google Scholar] [CrossRef]
- Wang, J.; Shi, E.; Hu, H.; Ma, C.; Liu, Y.; Wang, X.; Yao, Y.; Liu, X.; Ge, B.; Zhang, S. Large language models for robotics: Opportunities, challenges, and perspectives. Journal of Automation and Intelligence 2024. [Google Scholar] [CrossRef]
- ŞAHiN, E.; Arslan, N.N.; Özdemir, D. Unlocking the black box: an in-depth review on interpretability, explainability, and reliability in deep learning. Neural Computing and Applications 2024, 1–107. [Google Scholar] [CrossRef]
- Mishra, C.; Verdonschot, R.; Hagoort, P.; Skantze, G. Real-time emotion generation in human-robot dialogue using large language models. Frontiers in Robotics and AI 2023, 10, 1271610. [Google Scholar] [CrossRef] [PubMed]
- Wang, C.; Hasler, S.; Tanneberg, D.; Ocker, F.; Joublin, F.; Ceravola, A.; Deigmoeller, J.; Gienger, M. LaMI: Large language models for multi-modal human-robot interaction. In Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems; 2024; pp. 1–10. [Google Scholar]
- Olaiya, K.; Delnevo, G.; Lam, C.T.; Pau, G.; Salomoni, P. Exploring the Capabilities and Limitations of Large Language Models for Zero-Shot Human-Robot Interaction. In Proceedings of the 2025 IEEE Symposium on Computers and Communications (ISCC). IEEE; 2025; pp. 1–6. [Google Scholar]
- Olaiya, K.; Delnevo, G.; Ceccarini, C.; Lam, C.T.; Pau, G.; Salomoni, P. Natural Language and LLMs in Human-Robot Interaction: Performance and Challenges in a Simulated Setting. In Proceedings of the 2025 7th International Congress on Human-Computer Interaction, Optimization and Robotic Applications (ICHORA). IEEE; 2025; pp. 1–8. [Google Scholar] [CrossRef]
- Qin, H.; Shao, S.; Wang, T.; Yu, X.; Jiang, Y.; Cao, Z. Review of Autonomous Path Planning Algorithms for Mobile Robots. Drones 2023, 7, 211. [Google Scholar] [CrossRef]
- She, Y.; Song, C.; Sun, Z.; Li, B. Optimized Model Predictive Control-Based Path Planning for Multiple Wheeled Mobile Robots in Uncertain Environments. Drones 2025, 9, 39. [Google Scholar] [CrossRef]
- Chen, G.; Hong, L. Research on Environment Perception System of Quadruped Robots Based on LiDAR and Vision. Drones 2023, 7, 329. [Google Scholar] [CrossRef]
- Dorbala, V.S.; Mullen, J.F.; Manocha, D. Can an Embodied Agent Find Your “Cat-shaped Mug”? LLM-Based Zero-Shot Object Navigation. IEEE Robotics and Automation Letters 2024, 9, 4083–4090. [Google Scholar] [CrossRef]
- Yu, B.; Kasaei, H.; Cao, M. L3MVN: Leveraging Large Language Models for Visual Target Navigation. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE; 2023; pp. 3554–3560. [Google Scholar] [CrossRef]
- Nasiriany, S.; Xia, F.; Yu, W.; Xiao, T.; Liang, J.; Dasgupta, I.; Xie, A.; Driess, D.; Wahid, A.; Xu, Z.; et al. PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs. In Proceedings of the Proceedings of the 41st International Conference on Machine Learning; Salakhutdinov, R.; Kolter, Z.; Heller, K.; Weller, A.; Oliver, N.; Scarlett, J.; Berkenkamp, F., Eds. PMLR, 21–27 Jul 2024, Vol. 235, Proceedings of Machine Learning Research, pp. 37321–37341.
- Shah, D.; Osiński, B.; ichter, b.; Levine, S. LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action. In Proceedings of the Proceedings of The 6th Conference on Robot Learning; Liu, K.; Kulic, D.; Ichnowski, J., Eds. PMLR, 14–18 Dec 2023, Vol. 205, Proceedings of Machine Learning Research, pp. 492–504.
- Zitkovich, B.; Yu, T.; Xu, S.; Xu, P.; Xiao, T.; Xia, F.; Wu, J.; Wohlhart, P.; Welker, S.; Wahid, A.; et al. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. In Proceedings of the Proceedings of The 7th Conference on Robot Learning; Tan, J.; Toussaint, M.; Darvish, K., Eds. PMLR, 06–09 Nov 2023, Vol. 229, Proceedings of Machine Learning Research, pp. 2165–2183.
- Yokoyama, N.; Ha, S.; Batra, D.; Wang, J.; Bucher, B. VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE; 2024; pp. 42–48. [Google Scholar] [CrossRef]
- Huang, W.; Wang, C.; Zhang, R.; Li, Y.; Wu, J.; Fei-Fei, L. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models. In Proceedings of the Proceedings of The 7th Conference on Robot Learning; Tan, J.; Toussaint, M.; Darvish, K., Eds. PMLR, 06–09 Nov 2023, Vol. 229, Proceedings of Machine Learning Research, pp. 540–562.
- Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: a family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar] [CrossRef]
- OpenAI. Models. https://platform.openai.com/docs/models. Accessed: 23 June 2025.
- Kong, A.; Zhao, S.; Chen, H.; Li, Q.; Qin, Y.; Sun, R.; Zhou, X.; Wang, E.; Dong, X. Better Zero-Shot Reasoning with Role-Play Prompting. In Proceedings of the Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers); Duh, K.; Gomez, H.; Bethard, S., Eds., Mexico City, Mexico, 2024; pp. 4099–4113. [CrossRef]
- Yang, Z.; Li, L.; Wang, J.; Lin, K.; Azarnasab, E.; Ahmed, F.; Liu, Z.; Liu, C.; Zeng, M.; Wang, L. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action, 2023. [CrossRef]
- Khattak, M.U.; Rasheed, H.; Maaz, M.; Khan, S.; Khan, F.S. learning. In Proceedings of the Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023; pp. 19113–19122.
- Michel, O. Webots: Professional Mobile Robot Simulation. Journal of Advanced Robotics Systems 2004, 1, 39–42. [Google Scholar]
- Cyberbotics. Factory World in Webots, 2025. Accessed: 2025-03-06.
- Cyberbotics. Clearpath Robotics’ PR2, 2025. Accessed: 2025-03-06.
- DeepMind. Gemini Robotics brings AI into the physical world. https://deepmind.google/discover/blog/gemini-robotics-brings-ai-into-the-physical-world/, 2024. Accessed: 24 June 2025.



| Task | Success Rate | Score |
|---|---|---|
| 1 | 4/5 | 0.49 ± 0.30 |
| 2 | 0/5 | 0.01 ± 0.01 |
| 3 | 1/5 | 0.02 ± 0.03 |
| 4 | 0/5 | 0.01 ± 0.02 |
| 5 | 2/5 | 0.16 ± 0.22 |
| 6 | 5/5 | 1,00 ± 0 |
| 7 | 1/5 | N/A |
| 8 | 1/5 | 0.23 ± 0.42 |
| 9 | 2/5 | 0.16 ± 0.18 |
| 10 | 1/5 | 0.004 ± 0.004 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).






