Submitted:
08 May 2025
Posted:
08 May 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Work
2.1. Self-Supervised Methods
2.2. Supervised Methods
2.3. Multi-task Methods
3. Methods
3.1. Architecture
3.2. Multi-Task Training
3.3. Loss
4. Experimental Setup
4.1. Datasets
4.2. Pre-Processing and Augmentation
4.3. Implementation Details
4.4. Language Model
5. Results
5.1. Comparison to the Latest Methods
5.2. Language Model
5.3. Generalization
5.4. Noise Experiments
6. Future Works
7. Discussion
Author Contributions
Funding
Conflicts of Interest
Abbreviations
| SR | Speech recognition |
| VSR | Visual speech recognition |
| ASR | Audio (or automatic) speech recognition |
| AVSR | Audio-visual speech recognition |
| WER | Word error rate |
| LM | Language Model |
References
- Dua, M.; Akanksha.; Dua, S. Noise robust automatic speech recognition: review and analysis. International Journal of Speech Technology 2023, 26, 475–519. [CrossRef]
- Cui, X.; Iseli, M.; Zhu, Q.; Alwan, A. Evaluation of noise robust features on the Aurora databases. In Proceedings of the INTERSPEECH, 2002, pp. 481–484.
- Haapakangas, A.; Hongisto, V.; Hyönä, J.; Kokko, J.; Keränen, J. Effects of unattended speech on performance and subjective distraction: The role of acoustic design in open-plan offices. Applied Acoustics 2014, 86, 1–16. [CrossRef]
- Ma, P.; Haliassos, A.; Fernandez-Lopez, A.; Chen, H.; Petridis, S.; Pantic, M. Auto-avsr: Audio-visual speech recognition with automatic labels. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- Rouditchenko, A.; Thomas, S.; Kuehne, H.; Feris, R.; Glass, J. mWhisper-Flamingo for Multilingual Audio-Visual Noise-Robust Speech Recognition. arXiv preprint arXiv:2502.01547 2025.
- Shi, B.; Mohamed, A.; Hsu, W.N. Learning lip-based audio-visual speaker embeddings with av-hubert. arXiv preprint arXiv:2205.07180 2022.
- Sumby, W.H.; Pollack, I. Visual contribution to speech intelligibility in noise. The journal of the acoustical society of america 1954, 26, 212–215. [CrossRef]
- Cappellazzo, U.; Kim, M.; Chen, H.; Ma, P.; Petridis, S.; Falavigna, D.; Brutti, A.; Pantic, M. Large language models are strong audio-visual speech recognition learners. In Proceedings of the ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5.
- Ryumin, D.; Ivanko, D.; Ryumina, E. Audio-visual speech and gesture recognition by sensors of mobile devices. Sensors 2023, 23, 2284. [CrossRef]
- Sun, K.; Yu, C.; Shi, W.; Liu, L.; Shi, Y. Lip-interact: Improving mobile device interaction with silent speech commands. In Proceedings of the Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology, 2018, pp. 581–593.
- Srivastava, T.; Winters, R.M.; Gable, T.; Wang, Y.T.; LaScala, T.; Tashev, I.J. Whispering Wearables: Multimodal Approach to Silent Speech Recognition with Head-Worn Devices. In Proceedings of the Proceedings of the 26th International Conference on Multimodal Interaction, 2024, pp. 214–223.
- Jin, Y.; Gao, Y.; Xu, X.; Choi, S.; Li, J.; Liu, F.; Li, Z.; Jin, Z. EarCommand: " Hearing" Your Silent Speech Commands In Ear. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2022, 6, 1–28.
- Cha, H.S.; Chang, W.D.; Im, C.H. Deep-learning-based real-time silent speech recognition using facial electromyogram recorded around eyes for hands-free interfacing in a virtual reality environment. Virtual Reality 2022, 26, 1047–1057. [CrossRef]
- Acosta, L.H.; Reinhardt, D. A survey on privacy issues and solutions for Voice-controlled Digital Assistants. Pervasive and Mobile Computing 2022, 80, 101523. [CrossRef]
- Abdolrahmani, A.; Kuber, R.; Branham, S.M. " Siri Talks at You" An Empirical Investigation of Voice-Activated Personal Assistant (VAPA) Usage by Individuals Who Are Blind. In Proceedings of the Proceedings of the 20th international ACM SIGACCESS conference on computers and accessibility, 2018, pp. 249–258.
- Cowan, B.R.; Pantidi, N.; Coyle, D.; Morrissey, K.; Clarke, P.; Al-Shehri, S.; Earley, D.; Bandeira, N. " What can i help you with?" infrequent users’ experiences of intelligent personal assistants. In Proceedings of the Proceedings of the 19th international conference on human-computer interaction with mobile devices and services, 2017, pp. 1–12.
- Pandey, L.; Hasan, K.; Arif, A.S. Acceptability of speech and silent speech input methods in private and public. In Proceedings of the Proceedings of the 2021 CHI conference on human factors in computing systems, 2021, pp. 1–13.
- Haliassos, A.; Mira, R.; Chen, H.; Landgraf, Z.; Petridis, S.; Pantic, M. Unified Speech Recognition: A Single Model for Auditory, Visual, and Audiovisual Inputs. arXiv preprint arXiv:2411.02256 2024.
- Djilali, Y.A.D.; Narayan, S.; LeBihan, E.; Boussaid, H.; Almazrouei, E.; Debbah, M. Do VSR Models Generalize Beyond LRS3? In Proceedings of the Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 6635–6644.
- Haliassos, A.; Zinonos, A.; Mira, R.; Petridis, S.; Pantic, M. BRAVEn: Improving Self-supervised pre-training for Visual and Auditory Speech Recognition. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11431–11435.
- Haliassos, A.; Ma, P.; Mira, R.; Petridis, S.; Pantic, M. Jointly learning visual and auditory speech representations from raw data. arXiv preprint arXiv:2212.06246 2022.
- Hsu, W.N.; Shi, B. u-hubert: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality. Advances in Neural Information Processing Systems 2022, 35, 21157–21170.
- Ma, P.; Mira, R.; Petridis, S.; Schuller, B.W.; Pantic, M. Lira: Learning visual speech representations from audio through self-supervision. arXiv preprint arXiv:2106.09171 2021.
- Chung, J.; Nagrani, A.; Zisserman, A. VoxCeleb2: Deep speaker recognition. Interspeech 2018 2018. [CrossRef]
- Ephrat, A.; Mosseri, I.; Lang, O.; Dekel, T.; Wilson, K.; Hassidim, A.; Freeman, W.T.; Rubinstein, M. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG) 2018, 37, 1–11.
- Afouras, T.; Chung, J.S.; Senior, A.; Vinyals, O.; Zisserman, A. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence 2018, 44, 8717–8727.
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186.
- Makino, T.; Liao, H.; Assael, Y.; Shillingford, B.; Garcia, B.; Braga, O.; Siohan, O. Recurrent neural network transducer for audio-visual speech recognition. In Proceedings of the 2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 2019, pp. 905–912.
- Serdyuk, D.; Braga, O.; Siohan, O. Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video 2022.
- Liu, X.; Lakomkin, E.; Vougioukas, K.; Ma, P.; Chen, H.; Xie, R.; Doulaty, M.; Moritz, N.; Kolar, J.; Petridis, S.; et al. Synthvsr: Scaling up visual speech recognition with synthetic supervision. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18806–18815.
- Ahn, Y.J.; Park, J.; Park, S.; Choi, J.; Kim, K.E. SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization. arXiv preprint arXiv:2406.12233 2024.
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International conference on machine learning. PMLR, 2023, pp. 28492–28518.
- Afouras, T.; Chung, J.S.; Zisserman, A. LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 2018.
- Ma, P.; Petridis, S.; Pantic, M. End-to-end audio-visual speech recognition with conformers. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7613–7617.
- Chang, O.; Liao, H.; Serdyuk, D.; Shahy, A.; Siohan, O. Conformer is All You Need for Visual Speech Recognition. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10136–10140.
- Ma, P.; Petridis, S.; Pantic, M. Visual speech recognition for multiple languages in the wild. Nature Machine Intelligence 2022, 4, 930–939. [CrossRef]
- Bulat, A.; Tzimiropoulos, G. How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In Proceedings of the Proceedings of the IEEE international conference on computer vision, 2017, pp. 1021–1030.
- Varga, A.; Steeneken, H.J. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech communication 1993, 12, 247–251. [CrossRef]
- Son Chung, J.; Senior, A.; Vinyals, O.; Zisserman, A. Lip reading sentences in the wild. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6447–6456.
- Fernandez-Lopez, A.; Chen, H.; Ma, P.; Haliassos, A.; Petridis, S.; Pantic, M. SparseVSR: Lightweight and noise robust visual speech recognition. arXiv preprint arXiv:2307.04552 2023.
- Warden, P. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209 2018.
- Afouras, T.; Chung, J.S.; Zisserman, A. Asr is all you need: Cross-modal distillation for lip reading. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 2143–2147.
- Kim, S.; Jang, K.; Bae, S.; Cho, S.; Yun, S.Y. MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition. arXiv preprint arXiv:2502.10447 2025.

| Training Tasks | Shared Encoder | WER (%) | ||||
|---|---|---|---|---|---|---|
| VSR | ASR | AVSR | VSR | ASR | AVSR | |
| ✓ | ✗ | ✗ | ✗ | 42.0 | - | - |
| ✗ | ✓ | ✗ | ✗ | - | 2.3 | - |
| ✗ | ✗ | ✓ | ✗ | - | - | 2.3 |
| ✓ | ✓ | ✗ | ✗ | 41.2 | 2.1 | - |
| ✓ | ✓ | ✗ | ✓ | 32.2 | 2.5 | - |
| ✓ | ✗ | ✓ | ✓ | 36.9 | - | 3.7 |
| ✓ | ✓ | ✓ | ✓ | 31.1 | 2.4 | 2.5 |
| Method | Total Hours | Multi-task Training | LM | LRS3 WER (%) | WildVSR WER (%) |
|---|---|---|---|---|---|
| No Additional Data | |||||
| Auto-AVSR [4] | 438 | ✗ | ✓ | 36.3 | - |
| USR [18] | 438 | ✓ | ✗ | 34.3 | - |
| SyncVSR [31] | 438 | ✗ | ✗ | 33.3 | - |
| SyncVSR [31] | 438 | ✗ | ✓ | 31.2 | - |
| MultiAVSR | 438 | ✓ | ✗ | 31.1 | 63.0 |
| MultiAVSR | 438 | ✓ | ✓ | 29.9 | 63.7 |
| Less than 1000h | |||||
| CM-Seq2Seq [34] | 595 | ✗ | ✓ | 43.3 | - |
| Auto-AVSR [4] | 818 | ✗ | ✓ | 33.0 | - |
| Auto-AVSR [4] | 661 | ✗ | ✗ | 32.7 | 62.3‡ |
| SyncVSR [31] | 661 | ✗ | ✗ | 30.4 | - |
| SyncVSR [31] | 661 | ✗ | ✓ | 28.1 | - |
| MultiAVSR | 661 | ✓ | ✗ | 28.1 | 57.8 |
| MultiAVSR | 661 | ✓ | ✓ | 27.3 | 58.2 |
| Less than 3000h | |||||
| u-HuBERT [22] | 2,221 | ✓ | ✗ | 27.2 | - |
| AV-HuBERT [27] | 1,759 | ✗ | ✗ | 26.9 | - |
| Auto-AVSR [4] | 1,759 | ✗ | ✗ | 24.6 | 49.3‡ |
| Auto-AVSR [4] | 1,902 | ✗ | ✓ | 23.5 | - |
| SyncVSR [31] | 1,992 | ✗ | ✗ | 23.4 | - |
| SyncVSR [31] | 1,992 | ✗ | ✓ | 21.5 | - |
| USR [18] | 1,759 | ✓ | ✗ | 22.3 | 46.8† |
| USR [18] | 1,759 | ✓ | ✓ | 21.5 | 46.4 |
| MultiAVSR | 1,968 | ✓ | ✗ | 21.6 | 44.7 |
| MultiAVSR | 1,968 | ✓ | ✓ | 21.0 | 46.0 |
| Greater than 3000h and Extra Proprietary Data | |||||
| RNN-T [28] | 30,000 | ✗ | ✗ | 33.6 | - |
| SparseVSR [40] | 3,068 | ✗ | ✗ | 19.5 | - |
| Auto-AVSR [4] | 3,448 | ✗ | ✓ | 19.1 | 38.6‡ |
| SynthVSR [30] | 7,100 | ✗ | ✗ | 18.2 | - |
| SynthVSR [30] | 7,100 | ✗ | ✓ | 16.9 | - |
| ViT 3D [29] | 90,000 | ✗ | ✗ | 17.0 | - |
| LP Conformer [35] | 100,000 | ✗ | ✗ | 12.8 | - |
| Noise | Model | Task | SNR Levels (dB) | ||||||
| Clean | 12.5 | 7.5 | 2.5 | -2.5 | -7.5 | Average | |||
| Pink | Auto-AVSR [4] | Audio | 1.0 | 1.4 | 1.9 | 4.3 | 13.1 | 56.8 | 15.5 |
| MultiAVSR | 1.2 | 1.4 | 1.9 | 3.7 | 12.0 | 43.0 | 12.4 | ||
| Auto-AVSR [4] | Audio-Visual | 0.9 | 1.2 | 1.4 | 2.3 | 6.0 | 16.2 | 5.4 | |
| MultiAVSR | 1.2 | 1.2 | 1.6 | 2.0 | 3.9 | 9.8 | 3.7 | ||
| White | Auto-AVSR [4] | Audio | 1.0 | 2.1 | 4.0 | 10.4 | 30.2 | 88.9 | 27.1 |
| MultiAVSR | 1.2 | 2.2 | 4.0 | 9.7 | 27.2 | 76.0 | 23.8 | ||
| Auto-AVSR [4] | Audio-Visual | 0.9 | 1.4 | 2.3 | 4.3 | 9.5 | 24.2 | 8.3 | |
| MultiAVSR | 1.2 | 1.6 | 2.2 | 3.4 | 7.0 | 14.7 | 5.8 | ||
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).