Submitted:
17 July 2024
Posted:
19 July 2024
Read the latest preprint version here
Abstract
Keywords:
I. Introduction
II. Theoretical Overview
A. Basic Speech Recognition Technology
B. Application of Deep Learning Technology in Speech Recognition
C. The Role of Large Language Models in Speech Recognition
III. Integration Method of Deep Learning and Large Language Model
A. Design Principles and Architecture of Integrated Model
B. Process of Integration Implementation
IV. Result Analysis
A. Experimental Setup
- TIMIT dataset: Contains 6300 sentences from different dialects of American English, recorded by 438 speakers. Each sample is provided with detailed phoneme-level annotation for training and testing the accuracy of acoustic models.
- LibriSpeech data set: It is a larger data set, containing 1,000 hours of English speech, recorded by 2,428 speakers from different backgrounds, divided into two recording environments: clear and noisy, used to evaluate the model in different listening environments performance under conditions.
- Common Voice data set: A multilingual data set provided by Mozilla, containing more than 2,000 hours of recordings covering multiple languages and accents, used to test the multilingual adaptability of the model.
B. Performance Evaluation and Analysis
C. Discussion
V. Conclusion
References
- Zraibi, B.; Okar, C.; Chaoui, H.; Mansouri, M. Remaining useful life assessment for lithium-ion batteries using CNN-LSTM-DNN hybrid method. IEEE Transactions on Vehicular Technology 2021, 70, 4252–4261. [Google Scholar] [CrossRef]
- Zhao, C.; Zhu, G.; Wang, J. The enlightenment brought by ChatGPT to large language models and new development ideas for multi-modal large models. Data Analysis and Knowledge Discovery 2023, 7, 26–35. [Google Scholar]
- Naiyu, W.; Yuxin, Y.; Lu, L. , et al. Research progress on language models based on deep learning. Journal of Software 2020, 32, 1082–1115. [Google Scholar]
- Sili, W.; Ling, Z.; Heng, Y. , et al. Analysis on the research progress of deep learning language models. Journal of Agricultural Library and Information Technology 2023, 1–15. [Google Scholar]
- Jianxin, W.; Ziya, W.; Xuan, T. Review of natural scene text detection and recognition based on deep learning. Journal of Software 2020, 31, 1465–1496. [Google Scholar]
- Xinya, W.; Guang, H.; Hao, J. , et al. A review of copyright protection research on deep learning models. Journal of Network and Information Security 2022, 8, 1–14. [Google Scholar]
- Jin, X.; Wang, Y. Understand Legal Documents with Contextualized Large Language Models. arXiv 2023, arXiv:2303.12135. [Google Scholar]
- Mo, Y.; Qin, H.; Dong, Y.; Zhu, Z.; Li, Z. Large Language Model (LLM) AI Text Generation Detection based on Transformer Deep Learning Algorithm. Int. J. Eng. Mgmt. Res. 2024, 14, 154–159. [Google Scholar]
- Zou, H.P.; Samuel, V.; Zhou, Y.; Zhang, W.; Fang, L.; Song, Z.; Caragea, C. ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction. arXiv 2024, arXiv:2404.15592. [Google Scholar]
- Dong, Z.; Chen, B.; Liu, X.; Polak, P.; Zhang, P. Musechat: A conversational music recommendation system for videos. arXiv 2023, arXiv:2310.06282. [Google Scholar]
- Jia, Q.; Liu, Y.; Wu, D.; Xu, S.; Liu, H.; Fu, J.; Wang, B. (2023, July). KG-FLIP: Knowledge-guided Fashion-domain Language-Image Pre-training for E-commerce. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track) (pp. 81-88).
- Liang, J.; Li, S.; Cao, B.; Jiang, W.; He, C. Omnilytics: A blockchain-based secure data market for decentralized machine learning. arXiv 2021, arXiv:2107.05252. [Google Scholar]
- Wang, C.; Yang, Y.; Li, R.; Sun, D.; Cai, R.; Zhang, Y.; Floyd, L. Adapting llms for efficient context processing through soft prompt compression. arXiv 2024, arXiv:2404.04997. [Google Scholar]
- Wang, Y.; Su, J.; Lu, H.; Xie, C.; Liu, T.; Yuan, J.; Yang, H. LEMON: Lossless model expansion. arXiv 2023, arXiv:2310.07999. [Google Scholar]
- Feng, W.; Zhang, W.; Meng, M.; Gong, Y.; Gu, F. (2023, June). A Novel Binary Classification Algorithm for Carpal Tunnel Syndrome Detection Using LSTM. In 2023 IEEE 3rd International Conference on Software Engineering and Artificial Intelligence (SEAI) (pp. 143-147). IEEE.
- Zhou, Y.; Li, X.; Wang, Q.; Shen, J. Visual In-Context Learning for Large Vision-Language Models. arXiv 2024, arXiv:2402.11574. [Google Scholar]
- Jin, Y.; Choi, M.; Verma, G.; Wang, J.; Kumar, S. MM-Soc: Benchmarking Multimodal Large Language Models in Social Media Platforms. arXiv 2024, arXiv:2402.14154. [Google Scholar]
- Liu, W.; Cheng, S.; Zeng, D.; Qu, H. Enhancing document-level event argument extraction with contextual clues and role relevance. arXiv 2023, arXiv:2310.05991. [Google Scholar]
- Han, G.; Liu, W.; Huang, X.; Borsari, B. Chain-of-Interaction: Enhancing Large Language Models for Psychiatric Behavior Understanding by Dyadic Contexts. arXiv 2024, arXiv:2403.13786. [Google Scholar]
- Xu, W.; Chen, J.; Ding, Z.; Wang, J. Text Sentiment Analysis and Classification Based on Bidirectional Gated Recurrent Units (GRUs) Model. arXiv 2024, arXiv:2404.17123. [Google Scholar] [CrossRef]
- Han, G.; Tsao, J.; Huang, X. Length-Aware Multi-Kernel Transformer for Long Document Classification. arXiv 2024, arXiv:2405.07052. [Google Scholar]


| data set | Model type | WER (%) | RTF |
| TIMIT | baseline model | 18.5 | 0.09 |
| TIMIT | integrated model | 15.2 | 0.07 |
| LibriSpeech | baseline model | 10.3 | 0.12 |
| LibriSpeech | integrated model | 8.4 | 0.10 |
| Common Voice | baseline model | 22.0 | 0.15 |
| Common Voice | integrated model | 17.8 | 0.11 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).