Preprint
Article

This version is not peer-reviewed.

Semantic-Augmented Reality: A Hybrid Robotic Framework Combining Edge AI and Vision Language Models for Dynamic Industrial Inspection

Submitted:

23 December 2025

Posted:

24 December 2025

You are already at the latest version

Abstract
With the rise of Industry 4.0, Augmented Reality (AR) has become pivotal for human-robot collaboration. However, most industrial AR systems still rely on pre-defined tracked images or markers, limiting adaptability in unmodeled or dynamic environments. This paper proposes a novel Interactive Semantic-Augmented Reality (ISAR) framework that synergizes Edge AI and Cloud Vision-Language Models (VLMs). To ensure real-time performance, we implement a Dual-Thread Asynchronous Architecture on the robotic edge, decoupling video streaming from AI inference. We introduce a Confidence-Based Triggering Mechanism, where a cloud-based VLM is invoked only when edge detection confidence falls below a specific threshold. Instead of traditional image cropping, we employ a Visual Prompting strategy—overlaying bounding boxes on full-frame images—to preserve spatial context for accurate VLM semantic analysis. Finally, the generated insights are anchored to the physical world via Screen-to-World Raycasting without fiducial markers. This framework realizes a semantic-aware 'Intelligent Agent' that enhances Human-in-the-Loop (HITL) decision-making in complex industrial settings.
Keywords: 
;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated