Submitted:
28 July 2025
Posted:
29 July 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Problem Statement
- A Novel Dual-Stream Architecture (TransMODAL): We propose a compositional, end-to-end trainable model that synergistically fuses two powerful and complementary data streams. The first stream leverages a pre-trained VideoMAE backbone to extract rich, contextual appearance features from RGB video. The second, a dedicated PoseEncoder stream, processes explicit kinematic information from a sequence of 2D skeletal keypoints. This pose data is generated via a state-of-the-art pre-processing pipeline composed of the RT-DETR model for robust person detection and the ViTPose++ model for accurate keypoint estimation. By integrating these specialized foundation models, our architecture focuses its innovation directly on the crucial task of multi-modal fusion.
- Adaptive Multi-Modal Fusion and Pruning: We introduce two novel modules that form the technical core of our architecture. The first, CoAttentionFusion, facilitates a deep, iterative dialogue between the two data streams. Through a symmetric cross-attention mechanism, the appearance stream queries the pose stream for kinematic context, while the pose stream simultaneously queries the appearance stream for visual evidence. This allows each modality to enrich its representation with context from the other. The second module, AdaptiveSelector, addresses the challenge of computational efficiency inherent in Transformer models. It is a lightweight module with a learnable scoring mechanism that intelligently identifies and prunes redundant spatiotemporal tokens from the fused representation. This significantly reduces computational overhead and inference latency without compromising classification accuracy.
2. Related Work
2.1. Spatiotemporal Convolutions for Action Recognition
2.2. Transformer-Based Video Understanding
2.3. Pose-Guided Action Recognition
3. The TransMODAL Architecture
3.1. Overall Pipeline
- Person Detection and Tracking: We first employ an off-the-shelf, pre-trained RT-DETR model for each video to detect all persons in every frame [14]. A simple tracker associates detections across frames, yielding a set of person-centric video clips.
- Pose Estimation: A pre-trained ViTPose++ model is used for each person clip to estimate a sequence of 2D human poses [15]. As specified in our data loader, we extract 17 keypoints corresponding to the COCO format for each frame. This results in a tensor of skeletal coordinates.
- Dual-Stream Encoding: The pipeline then splits into two parallel streams. The cropped RGB person clip is fed into a frozen VideoMAE backbone, while the corresponding sequence of pose coordinates is passed to our lightweight PoseEncoder.
-
Fusion, Selection, and Classification: Visual and pose tokens undergo iterative cross-modal refinement via CoAttentionFusion (inspired by STAR-Transformer’s zigzag attention [26] and MM-ViT’s modality factorization [27]), then are pruned byAdaptiveSelector using lightweight L₂-norm scoring (echoing DynamicViT’s token sparsification [25]). The remaining tokens are averaged and classified.
3.2. Input Modalities and Encoders
3.2.1. RGB Appearance Stream
3.2.2. Pose Kinematics Stream (PoseEncoder)
3.3. Core Fusion and Selection Modules
3.3.1. Co-Attention Fusion:
3.3.2. Adaptive Feature Selector (AdaptiveSelector):
-
Frame Selection: It first identifies the most "active" temporal segments of the clip. Given the fused feature tensor (for a single batch item), where T is the number of frames and P is the number of patches per frame, as shown in 3, it calculates a salience score for each frame t using a learnable linear layer applied to its aggregated patch features.It then selects the indices corresponding to the top_k_frames=8 with the highest scores.
- Token Selection: Within these selected high-salience frames, it further pinpoints the most important spatial regions by applying a second learnable linear layer to compute a salience score for each patch token p as 4.
3.4. Implementation and Complexity
4. Experimental Evaluation
4.1. Datasets and Protocol
4.2. Implementation Details
4.3. Main Results and Baselines
4.4. Ablation Studies
- Pose is Critical: Removing the pose stream entirely (Row 1) causes the largest drop in accuracy (-1.7%), confirming it as the most impactful component for performance.
- AdaptiveSelector is Efficient: Removing the AdaptiveSelector (Row 2) results in a marginal 0.3% drop in accuracy but increases latency by over 46% (from 35.2 ms to 51.5 ms). This demonstrates that our token pruning strategy is highly effective at reducing computational cost with a negligible impact on performance.
- top_k_frames Sensitivity: Varying top_k_frames shows a clear trade-off. Reducing it to 4 (Row 3) harms accuracy, suggesting that important temporal information is lost. Increasing it to 12 (Row 4) provides no significant benefit over our default of 8, validating our hyperparameter choice.
4.5. Qualitative Analysis
5. Analysis and Discussion
6. Conclusion and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Dong, Y.; Zhou, R.; Zhu, C.; Cao, L.; Li, X. Hierarchical activity recognition based on belief functions theory in body sensor networks. IEEE Sensors J. 2022, 22, 15211–15221. [CrossRef]
- Joudaki, M.; Ebrahimpour Komleh, H. Introducing a new architecture of deep belief networks for action recognition in videos. JMVIP. 2024, 11, 1, 43-58.
- Teng, Q.; Wang, K.; Zhang, L.; He, J. The layer-wise training convolutional neural networks using local loss for sensor-based human activity recognition. IEEE Sensors J. 2020, 20, 7265–7274. [CrossRef]
- Han, Y.; Zhang, P.; Zhuo, T.; Huang, W.; Zhang, Y. Going Deeper with Two-Stream ConvNets for Action Recognition in Video Surveillance. Pattern Recognit. Lett. 2018, 107, 83–90. [CrossRef]
- Abdelbaky, A.; Aly, S. Two-Stream Spatiotemporal Feature Fusion for Human Action Recognition. Vis. Comput. 2021, 37, 1821–1835. [CrossRef]
- Joudaki, M.; Imani, M.; Arabnia, H.R. A New Efficient Hybrid Technique for Human Action Recognition Using 2D Conv-RBM and LSTM with Optimized Frame Selection. Technologies 2025, 13, 53. [CrossRef]
- Xin, C.; Kim, S.; Cho, Y.; Park, K.S. Enhancing Human Action Recognition with 3D Skeleton Data: A Comprehensive Study of Deep Learning and Data Augmentation. Electronics 2024, 13, 747. [CrossRef]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the Kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), USA, 2017; pp. 6299–6308. [CrossRef]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), USA, 2018; pp. 6450–6459. [CrossRef]
- Tong, Z.; Song, Y.; Wang, J.; Wang, L. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 10078–10093.
- Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [CrossRef]
- Baradel, F.; Wolf, C.; Mille, J. Human activity recognition with pose-driven attention to RGB. In Proceedings of the 29th British Machine Vision Conference (BMVC), UK, September 2018; pp. 1–14.
- Song, S.; Liu, J.; Li, Y.; Guo, Z. Modality compensation network: Cross-modal adaptation for action recognition. IEEE Trans. Image Process. 2020, 29, 3957–3969. [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Cham, Switzerland, August 2020; pp. 213–229. [CrossRef]
- Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose++: Vision transformer for generic body pose estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 1212–1230. [CrossRef]
- Zhang, Y.; Guo, Q.; Du, Z.; Wu, A. Human Action Recognition for Dynamic Scenes of Emergency Rescue Based on Spatial-Temporal Fusion Network. Electronics 2023, 12, 538. [CrossRef]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; Suleyman, M. The Kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [CrossRef]
- Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of ICML, USA, July 2021; pp. 4.
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. ViViT: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Canada, 2021; pp. 6836–6846. [CrossRef]
- Fang, H.-S.; Xie, S.; Tai, Y.-W.; Lu, C. RMPE: Regional multi-person pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), USA, 2017; pp. 2334–2343. [CrossRef]
- Jiang, Y.; Yu, S.; Wang, T.; Sun, Z.; Wang, S. Skeleton-Based Human Action Recognition Based on Single Path One-Shot Neural Architecture Search. Electronics 2023, 12, 3156. [CrossRef]
- Wang, J.; Tan, S.; Zhen, X.; Xu, S.; Zheng, F.; He, Z.; Shao, L. Deep 3D human pose estimation: A review. Comput. Vis. Image Underst. 2021, 210, 103225. [CrossRef]
- Bevilacqua, A.; MacDonald, K.; Rangarej, A.; Widjaya, V.; Caulfield, B.; Kechadi, T. Human activity recognition with convolutional neural networks. In Proceedings of ECML PKDD, Cham, Switzerland, September 2018; pp. 541–552. [CrossRef]
- Reilly, D.; Chadha, A.; Das, S. Seeing the pose in the pixels: Learning pose-aware representations in vision transformers. arXiv 2023, arXiv:2306.09331. [CrossRef]
- Rao, Y.; Zhao, W.; Liu, B.; Lu, J.; Zhou, J.; Hsieh, C.J. DynamicViT: Efficient vision transformers with dynamic token sparsification. Adv. Neural Inf. Process. Syst. 2021, 34, 13937–13949.
- Ahn, D.; Kim, S.; Hong, H.; Ko, B.C. STAR-Transformer: A spatio-temporal cross attention transformer for human action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), USA, January 2023; pp. 3330–3339. [CrossRef]
- Chen, J.; Ho, C.M. MM-ViT: Multi-modal video transformer for compressed video action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), USA, January 2022; pp. 1910–1921. [CrossRef]
- Schuldt, C.; Laptev, I.; Caputo, B. Recognizing human actions: A local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR), USA, August 2004; 3, pp. 32–36. [CrossRef]
- Soomro, K.; Zamir, A.R.; Shah, M. Ucf101: A dataset of 101 human action classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [CrossRef]
- Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), USA, November 2011; pp. 2556–2563. [CrossRef]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [CrossRef]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [CrossRef]
- Li, Y.; Lu, Z.; Xiong, X.; Huang, J. PERF-Net: Pose empowered RGB-Flow Net. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), USA, January 2022; pp. 513–522. [CrossRef]
- Wang, L.; Huang, B.; Zhao, Z.; Tong, Z.; He, Y.; Wang, Y.; Wang, Y.; Qiao, Y. VideoMAE v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), USA, 2023; pp. 14549–14560. [CrossRef]
- Goyal, R.; Ebrahimi Kahou, S.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Freund, I.; Yianilos, P.; Mueller-Freitag, M.; Hoppe, F. The “Something Something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), USA, October 2017; pp. 5842–5850. [CrossRef]
- Tan, T.-H.; Wu, J.-Y.; Liu, S.-H.; Gochoo, M. Human Activity Recognition Using an Ensemble Learning Algorithm with Smartphone Sensor Data. Electronics 2022, 11, 322. [CrossRef]




| Module Name (code) | Key Hyperparameters | Parameters (M) | FLOPs (G) |
|---|---|---|---|
| VideoMAE Backbone | embed_dim=768 | 87 (frozen) | N/A |
| PoseEncoder | embed_dim=768, num_joints=17 | 0.22 | 0.02 |
| CoAttentionFusion | num_heads=8, depth=1 | 9.45 | 0.81 |
| AdaptiveSelector | top_k_frames=8, top_k_tokens=12 | 0.05 | <0.01 |
| ActionClassifier | num_classes=6 | 0.60 | <0.01 |
| Total Trainable | 10.32 | ~0.83 |
| Model | Input Modality | Pre-trained on | KTH Top-1 Acc (%) |
|---|---|---|---|
| Two-stream ConvNets [4] | RGB + Flow | ImageNet | 93.1 |
| ST-VLAD-PCANet [5] | RGB | - | 93.33 |
| 2D Conv-RBM + LSTM [6] | RGB | - | 97.3 |
| VideoMAE(B/16) [10] | RGB | Kinetics | 96.5 |
| TransMODAL (Proposed method) - No Pose | RGB | Kinetics | 96.8 |
| TransMODAL (Proposed method) | RGB + Pose | Kinetics | 98.5 |
| Model | Input Modality | Pre-trained on | UCF101 Top-1 Acc (%) |
|---|---|---|---|
| I3D (Two-Stream) [8] | RGB + Flow | Kinetics | 97.9 |
| R(2+1)D (Two-Stream) [9] | RGB+ Flow | Kinetics | 97.3 |
| VideoMAE [10] | RGB | Kinetics | 91.3 |
| PERF-Net [33] | RGB + Flow + Pose | S3D-G | 98.6 |
| TransMODAL (Proposed method) | RGB + Pose | Kinetics | 96.9 |
| Model | Input Modality | Pre-trained on | HMDB51 Top-1 Acc (%) |
|---|---|---|---|
| 2D Conv-RBM + LSTM [6] | RGB | - | 81.5 |
| I3D (Two-Stream) [8] | RGB + Flow | Kinetics | 80.2 |
| R(2+1)D (Two-Stream) [9] | RGB+ Flow | Kinetics | 78.7 |
| VideoMAE [10] | RGB | Kinetics | 62.6 |
| PERF-Net [33] | RGB + Flow + Pose | S3D-G | 83.2 |
| VideoMAE V2-g [34] | RGB | UnlabeledHybrid | 88.7 |
| TransMODAL (Proposed method) | RGB + Pose | Kinetics | 84.2 |
| Configuration | Top-1 Acc (%) | Δ vs. Full Model | Latency (ms/batch) |
|---|---|---|---|
| Full TransMODAL Model | 98.5 | - | 35.2 |
| No Pose Stream (VideoMAE-only) | 96.8 | -1.7 | 29.8 |
| No AdaptiveSelector (use all tokens) | 98.2 | -0.3 | 51.5 |
| top_k_frames = 4 | 97.5 | -1.0 | 31.1 |
| top_k_frames = 12 | 98.4 | -0.1 | 39.4 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).