Submitted:
23 December 2025
Posted:
24 December 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Background and Motivation
1.2. Problem Statement
1.3. Objectives and Proposed Approach
- Dual-Thread Asynchronous Perception: To ensure low-latency performance on resource-constrained edge devices, we implement a Dual-Thread Asynchronous Architecture. This design decouples the video streaming task from the AI inference task, guaranteeing that the operator receives smooth FPV feedback regardless of the computational load from the object detection model.
- Confidence-Based Semantic Verification: Instead of processing every frame via the cloud, we employ a Confidence-Based Triggering Mechanism. The system invokes the cloud-based VLM only when the edge detection confidence falls below a predefined threshold (e.g., 70%). To preserve spatial context, we utilize a Visual Prompting strategy—overlaying a bounding box on the full-frame image—rather than cropping the ROI, allowing the VLM to analyze the object within its environment.
- Markerless Spatial Registration: To overcome the limitations of marker-based tracking, we employ Screen-to-World Raycasting. This technique maps the 2D detection coordinates to 3D spatial anchors, ensuring precise overlay of semantic information without pre-deployed markers.
- Human-in-the-Loop Control: Finally, an Action Manager is introduced to close the decision loop. Operators can issue high-level commands based on AR-visualized semantic insights, establishing a robust Human-in-the-Loop (HITL) system.
1.4. Main Contributions
- Proposal of an Asynchronous Hybrid Perception Architecture: We propose a robust edge-cloud framework featuring a Dual-Thread Asynchronous Architecture. This design solves the latency bottleneck in real-time FPV transmission. Furthermore, by integrating a Confidence-Based Triggering Mechanism, we optimize computational resources, leveraging Cloud VLM intelligence only when Edge AI uncertainty is high.
- Development of Markerless Semantic AR Registration: We develop a spatial registration technique combining deep learning coordinates with Screen-to-World Raycasting. This enables the precise anchoring of semantic insights in unmodeled, unstructured environments without fiducial markers.
- Integration of Visual Prompting for HITL Control: We introduce a Visual Prompting strategy that retains full-frame context for VLM analysis, significantly enhancing the accuracy of semantic verification (e.g., correcting low-confidence detections). Coupled with our Action Manager, this forms a reliable Semantic-Driven Human-in-the-Loop (HITL) control system, transforming unstructured VLM outputs into actionable robot commands.
1.5. Paper Organization
2. Literature Review
2.1. Augmented Reality for Industrial Teleoperation
2.2. Embodied AI and Vision Language Models
2.3. Edge-Cloud Collaboration and Human-in-the-Loop Control
3. Methodology
3.1. System Architecture: Asynchronous Edge Processing
- Thread A (Streaming Task): This high-priority thread captures raw frames from the camera and encodes them into binary data (MJPEG/Base64) for real-time FPV (First-Person View) transmission via WebSocket. This ensures the operator receives smooth, low-latency video feedback regardless of AI processing load.
- Thread B (Inference Task): This thread runs the YOLO object detection model asynchronously. Upon completing an inference cycle, it updates a shared data structure with the latest BBox, Class ID, and Confidence Score.
3.2. Cognitive Layer: Confidence-Based Semantic Verification
- Data Parsing & Visualization: Upon receiving the JSON packet, the system parses the object detection data. If valid detections exist, the Bounding Box (BBox) and Confidence Score are rendered onto the raw image.
- Threshold Evaluation: The system evaluates the Confidence Score (S).
- IF S ≥ 70: The detection is considered reliable. The system bypasses the VLM and directly proceeds to spatial registration and AR rendering using the local YOLO labels.
- IF S < 70: The detection is flagged as ambiguous. The system triggers the VLM verification process.
- Visual Prompting Strategy: Unlike previous approaches that cropped the ROI, we retain the full-frame image to preserve spatial context. To focus the VLM's attention, we employ a Visual Prompting technique where the target BBox is visually drawn (e.g., a green rectangle) on the image sent to the VLM.
- Prompt Engineering: The Prompt Engine selects a specific query based on the YOLO Class ID (e.g., "Verify the object inside the green box. Is it a valve or a cap? Describe its condition.") and transmits it to the Gemini VLM via API.
3.3. Markerless Spatial Registration Pipeline
3.3.1. Anchor Selection and Coordinate Normalization
3.3.2. Raycasting and Spatial Mapping
3.4. Action Manager and HITL Control
- Recommendation: The AR interface parses the action_code from the VLM response (e.g., INSPECT) and displays a corresponding interactive button overlaid on the object.
- Validation: The human operator evaluates the visual evidence provided by the VLM. If the operator agrees with the AI's diagnosis, they confirm the action via the UI.
- Execution: The app transmits the confirmed command ID to the robot's Action Manager. The Action Manager maintains a Finite State Machine (FSM) and triggers the appropriate actuator behaviors (e.g., stopping motors, adjusting camera zoom, or logging the incident).
4. Experiments & Results
4.1. Experiment 1: Semantic Verification and Detail Refinement
- Trigger: Since the detected confidence score S=54% was below the predefined 70% threshold, the VLM pipeline was triggered. Figure 5a depicts this baseline detection provided by the Edge AI.
- Visual Prompt: The full image with an orange BBox overlay was sent to Gemini.
- VLM Response: The VLM returned the analysis: "The object inside the orange box is indeed a valve, but it shows signs of heavy corrosion on the handle. Status: Warning." and assigned a high verification confidence.
4.2. Experiment 2: Semantic Navigation at Intersections
- Detection: As shown in Figure 6a,b, the Edge AI successfully identified the scene as an "Intersection" with a high confidence score of 92%. Concurrently, the visual system captured the "STOP" sign.
- Cognitive Analysis: The system forwarded the scene context to the VLM. The VLM analyzed the spatial relationship between the intersection and the "STOP" sign, determining that the left path was restricted.
- Semantic Augmentation: Instead of a simple 2D text warning, the system utilized the Spatial Anchor Manager to instantiate a virtual 3D Barrier object directly onto the left path in the AR view, as illustrated in Figure 6c. This provides intuitive, immersive feedback to the operator.
- Decision Interface: Based on the VLM's recommendation (Action Code: BLOCK_LEFT), the AR UI dynamically generated the control panel shown in Figure 6d. The system filtered out the restricted "Turn Left" option, presenting only valid commands: "Straight" and "Turn Right".
- Result: The operator acknowledged the virtual barrier and selected "Turn Right". The robot executed the command safely. This experiment validates the framework's ability to translate high-level semantic understanding into actionable, safe control constraints in real-time.
5. Conclusion and Future Work
5.1. Conclusion
- Real-time FPV Performance: The dual-thread design effectively decouples image streaming from AI inference. This ensures that operators receive low-latency visual feedback regardless of the computational load from object detection tasks.
- Optimized Resource Utilization: The Confidence-Based Triggering mechanism balances the speed of Edge AI with the intelligence of Cloud AI. By invoking the VLM only when edge detection confidence falls below 70%, the system efficiently resolves ambiguity without incurring unnecessary latency or token costs.
- Enhanced Semantic Understanding: The implementation of Visual Prompting, which retains the full-frame context rather than cropping, proved crucial. Experiments demonstrated that the VLM could effectively act as a verifier, elevating low-confidence detections (e.g., from 54% to 95%) and identifying fine-grained details such as corrosion.
- Semantic-Driven HITL Control: The system demonstrated the ability to translate high-level semantic analysis into intuitive AR visualizations (e.g., 3D Barriers) and constrained control interfaces. This establishes a robust Human-in-the-Loop control cycle, ensuring safe navigation in semantically complex environments.
5.2. Limitations and Future Work
- Edge-VLM Deployment: We aim to investigate quantized Vision Language Models (e.g., NanoLLM) that can run directly on the robot's NPU. This would eliminate network latency entirely, enabling continuous semantic analysis even in offline environments.
- Advanced 3D Reconstruction: We intend to integrate 3D Gaussian Splatting technology to achieve denser environmental understanding. This will allow for more precise semantic occlusion and interaction in complex, non-planar 3D spaces.
Author Contributions
References
- Van Krevelen, D.; Poelman, R. Augmented Reality: Technologies, Applications, and Limitations; Department of Computer Sciences, Vrije University Amsterdam: Amsterdam, The Netherlands, 2007.
- Seetohul, J.; Shafiee, M.; Sirlantzis, K. Augmented Reality (AR) for Surgical Robotic and Autonomous Systems: State of the Art, Challenges, and Solutions. Sensors 2023, 23, 6202. [Google Scholar] [CrossRef] [PubMed]
- Zhao, F.; Deng, W.; Pham, D.T. A Robotic Teleoperation System with Integrated Augmented Reality and Digital Twin Technologies for Disassembling End-of-Life Batteries. Batteries 2024, 10, 382. [Google Scholar] [CrossRef]
- Rosa-Garcia, A.D.L.; Marrufo, A.I.S.; Luviano-Cruz, D.; Rodriguez-Ramirez, A.; Garcia-Luna, F. Bridging Remote Operations and Augmented Reality: An Analysis of Current Trends. IEEE Access 2025, 13, 36502–36526. [Google Scholar] [CrossRef]
- Tourani, A.; Avsar, D.I.; Bavle, H.; Sanchez-Lopez, J.L.; Lagerwall, J.; Voos, H. Unveiling the Potential of iMarkers: Invisible Fiducial Markers for Advanced Robotics. arXiv 2025, arXiv:2501.15505. [Google Scholar] [CrossRef]
- Friske, M.D. Integration of Augmented Reality and Mobile Robot Indoor SLAM for Enhanced Spatial Awareness. arXiv 2024, arXiv:2409.01915. [Google Scholar] [CrossRef]
- Su, Y.P.; Chen, X.Q.; Zhou, C.; Pearson, L.H.; Pretty, C.G.; Chase, J.G. Integrating virtual, mixed, and augmented reality into remote robotic applications: A brief review of extended reality-enhanced robotic systems for intuitive telemanipulation and telemanufacturing tasks in hazardous conditions. Applied Sciences 2023, 13, 12129. [Google Scholar] [CrossRef]
- Li, H.D. Open-vocabulary object detection for high-resolution remote sensing images. Comput. Vis. Image Underst. 2025, 263, 104566. [Google Scholar] [CrossRef]
- Zhang, H.; Li, F.; Zou, X.; Liu, S.; Li, C.; Yang, J.; Zhang, L. A simple framework for open-vocabulary segmentation and detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023; pp. 1020–1031.
- Li, Z.; Xiang, Z.; West, J.; Khoshelham, K. From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects. arXiv 2024, arXiv:2411.18207. [Google Scholar] [CrossRef]
- Lin, J.; Shen, Y.; Wang, B.; Lin, S.; Li, K.; Cao, L. Weakly supervised open-vocabulary object detection. Proceedings of the AAAI Conference on Artificial Intelligence 2024, Vol. 38(No. 4), 3404–3412. [Google Scholar] [CrossRef]
- Zhang, H.; Xu, J.; Tang, T.; Sun, H.; Yu, X.; Huang, Z.; Yu, K. (2024, September). OpenSight: A simple open-vocabulary framework for LiDAR-based object detection. In European Conference on Computer Vision (pp. 1–19). Cham: Springer Nature Switzerland.
- Hosoya, Y.; Suganuma, M.; Okatani, T. Open-vocabulary vs. Closed-set: Best Practice for Few-shot Object Detection Considering Text Describability. arXiv 2024, arXiv:2410.15315. [Google Scholar]
- Zhu, C.; Chen, L. A survey on open-vocabulary detection and segmentation: Past, present, and future. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8954–8975. [Google Scholar] [CrossRef] [PubMed]
- Zhang, H.; Xu, J.; Tang, T.; Sun, H.; Yu, X.; Huang, Z.; Yu, K. (2024, September). OpenSight: A simple open-vocabulary framework for LiDAR-based object detection. In European Conference on Computer Vision (pp. 1–19). Cham: Springer Nature Switzerland.
- Driess, D.; Xia, F.; Sajjadi, M.S.; Lynch, C.; Chowdhery, A.; Wahid, A.; Florence, P. Palm-e: An embodied multimodal language model. 2023. [Google Scholar]
- Zitkovich, B.; Yu, T.; Xu, S.; Xu, P.; Xiao, T.; Xia, F.; ... & Han, K. (2023, December). Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning (pp. 2165–2183). PMLR.
- Kim, M.J.; Pertsch, K.; Karamcheti, S.; Xiao, T.; Balakrishna, A.; Nair, S.; Finn, C. Openvla: An open-source vision-language-action model. arXiv 2024, arXiv:2406.09246. [Google Scholar]
- Li, Q.; Liang, Y.; Wang, Z.; Luo, L.; Chen, X.; Liao, M.; Guo, B. Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation. arXiv 2024, arXiv:2411.19650. [Google Scholar]
- Sapkota, R.; Cao, Y.; Roumeliotis, K.I.; Karkee, M. Vision-language-action models: Concepts, progress, applications and challenges. arXiv 2025, arXiv:2505.04769. [Google Scholar]
- Wake, N.; Kanehira, A.; Sasabuchi, K.; Takamatsu, J.; Ikeuchi, K. Gpt-4v (ision) for robotics: Multimodal task planning from human demonstration. IEEE Robot. Autom. Lett. 2024.
- Hu, Y.; Lin, F.; Zhang, T.; Yi, L.; Gao, Y. Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning. arXiv 2023, arXiv:2311.17842. [Google Scholar]
- Zhang, J.; Wang, Z.; Lai, J.; Wang, H. GPTArm: An Autonomous Task Planning Manipulator Grasping System Based on Vision–Language Models. Machines 2025, 13, 247. [Google Scholar] [CrossRef]
- Duan, J.; Yuan, W.; Pumacay, W.; Wang, Y.R.; Ehsani, K.; Fox, D.; Krishna, R. Manipulate-anything: Automating real-world robots using vision-language models. arXiv 2024, arXiv:2406.18915. [Google Scholar]
- Hussain, S.; Biswas, S.; Dutta, A.; Saad, M.; Baimagambetov, A.; Saeed, K.; Polatidis, N. A Review of Advances in Large Language and Vision Models for Robotic Manipulation: Techniques, Integrations, and Challenges. SN Comput. Sci. 2025, 6, 588. [Google Scholar] [CrossRef]
- Team, G.R.; Abeyruwan, S.; Ainslie, J.; Alayrac, J.B.; Arenas, M.G.; Armstrong, T.; Zhou, Y. Gemini robotics: Bringing ai into the physical world. arXiv 2025, arXiv:2503.20020. [Google Scholar] [CrossRef]
- Team, G.R.; Abdolmaleki, A.; Abeyruwan, S.; Ainslie, J.; Alayrac, J.B.; Arenas, M.G.; Wulfmeier, M. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer. arXiv 2025, arXiv:2510.03342. [Google Scholar] [CrossRef]
- Li, P.; An, Z.; Abrar, S.; Zhou, L. Large language models for multi-robot systems: A survey. arXiv 2025, arXiv:2502.03814. [Google Scholar] [CrossRef]
- Kawaharazuka, K.; Oh, J.; Yamada, J.; Posner, I.; Zhu, Y. Vision-language-action models for robotics: A review towards real-world applications. IEEE Access 2025.
- Saxena, S.; Sharma, M.; Kroemer, O. (2023, August). Multi-resolution sensing for real-time control with vision-language models. In 2nd Workshop on Language and Robot Learning: Language as Grounding.
- Wu, X.; Xiao, L.; Sun, Y.; Zhang, J.; Ma, T.; He, L. A survey of human-in-the-loop for machine learning. Future Gener. Comput. Syst. 2022, 135, 364–381. [Google Scholar] [CrossRef]
- Zanzotto, F.M. Human-in-the-loop artificial intelligence. J. Artif. Intell. Res. 2019, 64, 243–252. [Google Scholar] [CrossRef]
- Chai, C.; Li, G. Human-in-the-loop Techniques in Machine Learning. IEEE Data Eng. Bull. 2020, 43, 37–52. [Google Scholar]
- Wang, Y.; Yang, C.; Lan, S.; Zhu, L.; Zhang, Y. End-edge-cloud collaborative computing for deep learning: A comprehensive survey. IEEE Commun. Surv. Tutor. 2024, 26, 2647–2683. [Google Scholar] [CrossRef]
- Rong, G.; Xu, Y.; Tong, X.; Fan, H. An edge-cloud collaborative computing platform for building AIoT applications efficiently. J. Cloud Comput. 2021, 10, 36. [Google Scholar] [CrossRef]
- Banitalebi-Dehkordi, A.; Vedula, N.; Pei, J.; Xia, F.; Wang, L.; Zhang, Y. (2021, August). Auto-split: A general framework of collaborative edge-cloud AI. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (pp. 2543–2553).
- Kehoe, B.; Patil, S.; Abbeel, P.; Goldberg, K. A Survey of Research on Cloud Robotics and Automation. IEEE Trans. Autom. Sci. Eng. 2015, 12, 398–409. [Google Scholar] [CrossRef]
- Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge Computing: Vision and Challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
- Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Liu, H. (2024). The Rise of Large Language Models in Robotics: A Survey of Recent Advances. IEEE Robotics and Automation Magazine (Early Access / Preprint).
- Hu, G.; Tay, W.P.; Wen, Y. Cloud robotics: Architecture, challenges and applications. IEEE Netw. 2012, 26, 21–28. [Google Scholar] [CrossRef]
- Villani, V.; Pini, F.; Leali, F.; Secchi, C. Survey on human–robot collaboration in industrial settings: Safety, intuitive interfaces and applications. Mechatronics 2018, 55, 248–266. [Google Scholar] [CrossRef]
- Sheridan, T.B. Human–Robot Interaction: Status and Challenges. Hum. Factors 2016, 58, 525–532. [Google Scholar] [CrossRef]
- Losey, D.P.; McDonald, C.G.; Battaglia, E.; O'Malley, M.K. A Review of Intent Detection, Arbitration, and Communication Aspects of Shared Control for Physical Human–Robot Interaction. Appl. Mech. Rev. 2018, 70. [Google Scholar] [CrossRef]
- Firoozi, R.; et al. Foundation Models in Robotics: Applications, Challenges, and the Future. Annu. Rev. Control. Robot. Auton. Syst. 2023, 7. [Google Scholar] [CrossRef]
- Garg, S.; Sünderhauf, N.; Dayoub, F.; Morrison, D.; Cosgun, A.; Carneiro, G.; Wu, Q.; Chin, T.-J.; Reid, I.; Gould, S.; et al. Semantics for Robotic Mapping, Perception and Interaction: A Survey. Found. Trends® Robot. 2020, 8, 1–224. [Google Scholar] [CrossRef]






Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
