Preprint
Article

This version is not peer-reviewed.

Confidence-Aware Gated Multimodal Fusion for Robust Temporal Action Localization in Occluded Environments

Submitted:

24 February 2026

Posted:

25 February 2026

You are already at the latest version

Abstract
In industrial environments, robust Temporal Action Localization (TAL) is essential; however, frequent occlusions often compromise the reliability of skeletal data, leading to negative transfer in multimodal fusion. To address this challenge, we propose a Gated Skeleton Refinement Module (Gated SRM) that explicitly incorporates Open-Pose confidence scores into the network architecture. By applying these scores as a logarithmic bias within a self-attention mechanism, our method achieves soft suppression—dynamically attenuating the attention weights assigned to unreliable joints—before adaptively fusing the refined skeletal features with RGB representations through a learnable gating network. Extensive experiments on the heavily occluded IKEA ASM dataset demonstrate that our approach effectively prevents the catastrophic accuracy degradation typical of naive fusion strategies, improving the mean Average Precision (mAP) to 21.77% and outperforming the RGB-only baseline. Furthermore, the system maintains practical real-time inference speeds of approximately 16 frames per second (FPS). By prioritizing confidence-based data selection over data restoration, this sensor-metadata-driven architecture offers a highly robust and principled solution for real-world action recognition under occlusion.
Keywords: 
;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated