Submitted:
20 January 2026
Posted:
20 January 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Works
2.1. Word-Level Sign Language Recognition
2.2. Multiview Action Recognition
3. Method for Dual-View Sign Language Training
3.1. A System for Dual-View Sign Language Training
3.2. NationalCSL-DP Dataset
4. Proposed Algorithm for Dual-View WSLR
4.1. Dual-View Word-Level Sign Language Recognition
4.2. An Efficient Algorithm for WSLR
5. Experiments
5.1. Implementation Details
5.2. Evaluation Metric
5.3. Experiments on Different Feature Extractors
5.4. Comparison with State-of-the-Art Algorithms
5.5. Ablation Study
6. Conclusions
Contributions
Funding Declaration
Data availability statement
Conflicts of interest
References
- Alyami, S.; Luqman, H.; Hammoudeh, M. Isolated Arabic sign language recognition using a transformer-based model and landmark keypoints. ACM Transactions on Asian and Low-Resource Language Information Processing 2024, 23, 1–19. [Google Scholar] [CrossRef]
- Alyami, S.; Luqman, H. Swin-MSTP: Swin transformer with multi-scale temporal perception for continuous sign language recognition. Neurocomputing 2025, 617, 129015. [Google Scholar] [CrossRef]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21) , 2021; pp. 6836–6846. [Google Scholar]
- Bruce, X.; Liu, Y.; Zhang, X.; Zhong, S.; Chan, K. MMnet: A model-based multimodal network for human action recognition in RGB-D videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 2022, 45, 3522–3538. [Google Scholar] [CrossRef]
- Wang, H.; Chai, X.; Hong, X.; Zhao, G.; Chen, X. Isolated sign language recognition with Grassmann covariance matrices. ACM Transactions on Accessible Computing 2016, 8, 1–21. [Google Scholar] [CrossRef]
- Das, S.; Ryoo, M. ViewCLR: Learning self-supervised video representation for unseen viewpoints. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV’23) , 2023; pp. 5573–5583. [Google Scholar]
- De Coster, M.; Van Herreweghe, M.; Dambre, J. Isolated sign recognition from RGB video using pose flow and self-attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21) , 2021; pp. 3441–3450. [Google Scholar]
- Dinh, N.; Nguyen, T.; Tran, D.; Pham, N.; Tran, T.; Tong, N.; Le Nguyen, P. Sign language recognition: A large-scale multi-view dataset and comprehensive evaluation. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV’25) , 2025; pp. 7887–7897. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Du, Y.; Xie, P.; Wang, M.; Hu, X.; Zhao, Z.; Liu, J. Full transformer network with masking future for word-level sign language recognition. Neurocomputing 2022, 500, 115–123. [Google Scholar] [CrossRef]
- Fink, J.; Poitier, P.; André, M.; Meurice, L.; Frénay, B.; Cleve, A.; Meurant, L. Sign language-to-text dictionary with lightweight transformer models. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’23) , 2023; pp. 5968–5976. [Google Scholar]
- Guan, Z.; Hu, Y.; Jiang, H.; Sun, Y.; Yin, B. Multi-view isolated sign language recognition based on cross-view and multi-level transformer. Multimedia Systems 2025, 31, 1–15. [Google Scholar] [CrossRef]
- Hasan, K.; Adnan, M. EMPATH: MediaPipe-aided ensemble learning with attention-based transformers for accurate recognition of Bangla word-level sign language. In Proceedings of the International Conference on Pattern Recognition (ICPR’25) , 2025; pp. 355–371. [Google Scholar]
- Hosain, A.; Santhalingam, P.; Pathak, P.; Rangwala, H.; Kosecka, J. Hand pose guided 3D pooling for word-level sign language recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’21) , 2021; pp. 3429–3439. [Google Scholar]
- Hu, H.; Zhao, W.; Zhou, W.; Li, H. SignBERT+: Hand-model-aware self-supervised pre-training for sign language understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 2023, 45, 11221–11239. [Google Scholar] [CrossRef]
- Hu, H.; Zhou, W.; Pu, J.; Li, H. Global-local enhancement network for NMF-aware sign language recognition. ACM transactions on Multimedia Computing, Communications, and Applications 2021, 17, 1–19. [Google Scholar] [CrossRef]
- Huang, J.; Zhou, W.; Li, H.; Li, W. Attention-based 3D-CNNs for large-vocabulary sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology 2018, 29, 2822–2832. [Google Scholar] [CrossRef]
- Ji, Y.; Yang, Y.; Shen, H.; Harada, T. View-invariant action recognition via unsupervised attention transfer (UANT). Pattern Recognition 2021, 113, 107807. [Google Scholar] [CrossRef]
- Jiang, S.; Sun, B.; Wang, L.; Bai, Y.; Li, K.; Fu, Y. Skeleton aware multi-modal sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21) , 2021; pp. 3413–3423. [Google Scholar]
- Jin, P.; Li, H.; Yang, J.; Ren, Y.; Li, Y.; Zhou, L.; Liu, J.; Zhang, M.; Pu, X.; Jing, S. A large dataset covering the Chinese national sign language for dual-view isolated sign language recognition. Scientific Data 2025, 12, 1–10. [Google Scholar] [CrossRef] [PubMed]
- Jing, S.; Wang, G.; Zhai, H.; Tao, Q.; Yang, J.; Wang, B.; Jin, P. Dual-view spatio-temporal feature fusion with CNN-Transformer hybrid network for Chinese isolated sign language recognition. arXiv 2025, arXiv:2506.06966. [Google Scholar]
- Joze, H.; Koller, O. MS-ASL: MS-ASL: A large-scale data set and benchmark for understanding American sign language. In Proceedings of the 30th British Machine Vision Conference (BMVC’19) , 2019; p. 100. [Google Scholar]
- Kang, X.; Yao, D.; Jiang, M.; Huang, Y.; Li, F. Semantic network model for sign language comprehension. International Journal of Cognitive Informatics and Natural Intelligence 2022, 16, 1–19. [Google Scholar] [CrossRef]
- Koller, O.; Camgoz, N.; Ney, H.; Bowden, R. Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 2019, 42, 2306–2320. [Google Scholar] [CrossRef]
- Kusnadi, A. Motion detection using frame differences algorithm with the implementation of density. In Proceedings of the International Conference on New Media (ICNM’15) , 2015; pp. 57–61. [Google Scholar]
- Kwak, I.; Guo, J.; Hantman, A.; Kriegman, D.; Branson, K. Detecting the starting frame of actions in video. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV’20) , 2020; pp. 489–497. [Google Scholar]
- Li, B.; Yuan, C.; Xiong, W.; Hu, W.; Peng, H.; Ding, X.; Maybank, S. Multi-view multi-instance learning based on joint sparse representation and multi-view dictionary learning. IEEE transactions on pattern analysis and machine intelligence 2017, 39, 2554–2560. [Google Scholar] [CrossRef]
- Li, D.; Rodriguez, C.; Yu, X.; Li, H. Word-level deep sign language recognition from video: A new large-scale dataset and methods comparison. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’20) , 2020; pp. 1459–1469. [Google Scholar]
- Li, J.; Wong, Y.; Zhao, Q.; Kankanhalli, M. Unsupervised learning of view-invariant action representations. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’18) , 2018; pp. 1262–1272. [Google Scholar]
- Li, Y.; Wu, C.; Fan, H.; Mangalam, K.; Malik, J.; Feichtenhofer, C. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR’22) , 2022; pp. 4804–4814. [Google Scholar]
- Liu, N.; Li, X.; Wu, B.; Yu, Q.; Wan, L.; Fang, T.; Zhang, J.; Li, Q.; Yuan, Y. A lightweight network-based sign language robot with facial mirroring and speech system. Expert Systems with Applications 2025, 262, 125492. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVPR’21) , 2021; pp. 10012–10022. [Google Scholar]
- Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.L.; Yong, M.G.; Lee, J.; Chang, W.T.; Hua, W.; Georg, M.; Grundmann, M. MediaPipe: A framework for building perception pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar] [CrossRef]
- Ma, Y.; Yuan, L.; Abdelraouf, A.; Han, K.; Gupta, R.; Li, Z.; Wang, Z. M2DAR: Multi-view multi-scale driver action recognition with vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’23) , 2023; pp. 5287–5294. [Google Scholar]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 2021, 43, 172–186. [Google Scholar] [CrossRef]
- Nguyen, H.; Nguyen, T. Attention-based network for effective action recognition from multi-view video. Procedia Computer Science 2021, 192, 971–980. [Google Scholar] [CrossRef]
- Rajalakshmi, E.; Elakkiya, R.; Prikhodko, A.; Grif, M.; Bakaev, M.; Saini, J.; Subramaniyaswamy, V. Static and dynamic isolated Indian and Russian sign language recognition with spatial and temporal feature detection using hybrid neural network. ACM Transactions on Asian and Low-Resource Language Information Processing 2022, 22, 1–23. [Google Scholar] [CrossRef]
- Ren, T.; Yao, D.; Yang, C.; Kang, X. The influence of Chinese characters on Chinese sign language. ACM Transactions on Asian and Low-Resource Language Information Processing 2024, 23, 6:1–6:31. [Google Scholar] [CrossRef]
- Rousseeuw, P.; Croux, C. Alternatives to the median absolute deviation. Journal of the American Statistical association 1993, 88, 1273–1283. [Google Scholar] [CrossRef]
- Sengupta, A.; Jin, F.; Zhang, R.; Cao, S. MM-Pose: Real-time human skeletal posture estimation using mmWave radars and CNNs. IEEE Sensors Journal 2020, 20, 10032–10044. [Google Scholar] [CrossRef]
- Shen, X.; Du, H.; Sheng, H.; Wang, S.; Chen, H.; Chen, H.; Yu, X. MM-WLAuslan: Multi-view multi-modal word-level Australian sign language recognition dataset. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS’24) , 2024. [Google Scholar]
- Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. AdaSGN: Adapting joint number and model size for efficient skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV’21) , 2021; pp. 13413–13422. [Google Scholar]
- Sincan, O.; Keles, H. AUTSL: A large scale multi-modal Turkish sign language dataset and baseline methods. IEEE Access 2020, 8, 181340–181355. [Google Scholar] [CrossRef]
- Singla, N. Motion detection based on frame difference method. International Journal of Information and Computation Technology 2014, 4, 1559–1565. [Google Scholar]
- Wang, H.; Chai, X.; Hong, X.; Zhao, G.; Chen, X. Isolated sign language recognition with Grassmann covariance matrices. ACM Transactions on Accessible Computing 2016, 8, 1–21. [Google Scholar] [CrossRef]
- Wang, L.; Ding, Z.; Tao, Z.; Liu, Y.; Fu, Y. Generative multi-view human action recognition. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV’19) , 2019; pp. 6212–6221. [Google Scholar]
- Wu, Z.; Ma, N.; Wang, C.; Xu, C.; Xu, G.; Li, M. Spatial–temporal hypergraph based on dual-stage attention network for multi-view data lightweight action recognition. Pattern Recognition 2024, 151, 110427. [Google Scholar] [CrossRef]
- Xia, Z.; Pan, X.; Song, S.; Li, L.; Huang, G. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR’22) , 2022; pp. 4794–4803. [Google Scholar]
- Xu, Y.; Jiang, S.; Cui, Z.; Su, F. Multi-view action recognition for distracted driver behavior localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’24) , 2024; pp. 7172–7179. [Google Scholar]
- Yamane, T.; Suzuki, S.; Masumura, R.; Tora, S. MVAFormer: RGB-based multi-view spatio-temporal action recognition with transformer. In Proceedings of the 2024 IEEE International Conference on Image Processing (ICIP’24) , 2024; pp. 332–338. [Google Scholar]
- Yang, Y.; Liang, G.; Wu, X.; Liu, B.; Wang, C.; Liang, J.; Sun, J. Cross-view fused network for multi-view rgb-based action recognition. In Proceedings of the 2024 IEEE 8th International Conference on Vision, Image and Signal Processing (ICVISP’24) , 2024; pp. 1–7. [Google Scholar]
- Zhang, D.; Dai, X.; Wang, X.; Wang, Y.F. S3D: Single shot multi-span detector via fully 3D convolutional networks. In Proceedings of the 2018 British Machine Vision Conference (BMVC’18) , 2018; p. 293. [Google Scholar]
- Zhang, R.; Hu, C.; Yu, P.; Chen, Y. Improving multilingual sign language translation with automatically clustered language family information. In Proceedings of the 31st International Conference on Computational Linguistics (COLING’25) , 2025; pp. 3579–3588. [Google Scholar]
- Zhang, R.; Zhao, R.; Wu, Z.; Zhang, L.; Zhang, H.; Chen, Y. Dynamic feature fusion for sign language translation using hypernetworks. In Proceedings of the Findings of the Association for Computational Linguistics (NAACL’25) ; 2025; pp. 6227–6239. [Google Scholar]
- Zhou, H.; Zhou, W.; Zhou, Y.; Li, H. Spatial-temporal multi-cue network for sign language recognition and translation. IEEE Transactions on Multimedia 2021, 24, 768–779. [Google Scholar] [CrossRef]
- Zuo, R.; Wei, F.; Mak, B. Natural language-assisted sign language recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’23) , 2023; pp. 14890–14900. [Google Scholar]



![]() |
![]() |
![]() |
![]() |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).



