Submitted:
09 March 2024
Posted:
11 March 2024
You are already at the latest version
Abstract
Keywords:
I. Introduction
II. Methods
A. Data Preprocessing
1. Data Source Selection and Collection Methodology
2. Data Collection Methodology
3. Transcription Generation
4. Feature Extraction
B. Implementation
1. Fine Tuning and Evaluations
2. Loading the Model with 8-Bit Precision
3. Trainable Parameters and LORA Optimization
4. Text Normalization
5. Fine-Tuning Process
6. Deployment of a Speech Recognition Application with Flask and Nginx
7. Application Architecture
III. Results
IV. Discussion
V. Conclusions
References
- Zapata, J.; Kirkedal, A.S. Assessing the performance of automatic speech recognition systems when used by native and non-native speakers of three major languages in dictation workflows. In Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015); 2015; pp. 201–210. Available online: https://aclanthology.org/W15-1825.pdf (accessed on 18 December 2023).
- Jiang, Y.; Poellabauer, C. A Sequence-to-sequence Based Error Correction Model for Medical Automatic Speech Recognition. In 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); IEEE, 2021; pp. 3029–3035. Available online: https://ieeexplore.ieee.org/abstract/document/9669554/ (accessed on 18 December 2023).
- Salimbajevs, A.; Kapočiūtė-Dzikienė, J. Automatic Speech Recognition Model Adaptation to Medical Domain Using Untranscribed Audio. In Digital Business and Intelligent Systems; Ivanovic, M., Kirikova, M., Niedrite, L., Eds.; Communications in Computer and Information Science; Springer International Publishing: Cham, 2022; ISBN 978-3-031-09849-9. [Google Scholar]
- Zielonka, M.; Krasiński, W.; Nowak, J.; Rośleń, P.; Stopiński, J.; Żak, M.; Górski, F.; Czyżewski, A. A survey of automatic speech recognition deep models performance for Polish medical terms. In 2023 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA); IEEE, 2023; pp. 19–24. Available online: https://ieeexplore.ieee.org/abstract/document/10274442/ (accessed on 18 December 2023).
- Mroz, A. Seeing how people hear you: French learners experiencing intelligibility through automatic speech recognition. Foreign Language Annals 2018, 51, 617–637. [Google Scholar] [CrossRef]
- Chatzoudis, G.; Plitsis, M.; Stamouli, S.; Dimou, A.-L.; Katsamanis, A.; Katsouros, V. Zero-Shot Cross-lingual Aphasia Detection using Automatic Speech Recognition. arXiv. 1 Apr. 2022. Available online: http://arxiv.org/abs/2204.00448 (accessed on 18 December 2023).
- Sunkara, M.; Ronanki, S.; Dixit, K.; Bodapati, S.; Kirchhoff, K. Robust Prediction of Punctuation and Truecasing for Medical ASR. arXiv. 1 Jul. 2020. Available online: http://arxiv.org/abs/2007.02025 (accessed on 18 December 2023).
- Scholz, M.L.; Collatz-Christensen, H.; Blomberg, S.N.F.; Boebel, S.; Verhoeven, J.; Krafft, T. Artificial intelligence in Emergency Medical Services dispatching: assessing the potential impact of an automatic speech recognition software on stroke detection taking the Capital Region of Denmark as case in point. Scand J Trauma Resusc Emerg Med 2022, 30, 36. [Google Scholar] [CrossRef] [PubMed]
- Hacking, C.; Verbeek, H.; Hamers, J.P.; Aarts, S. The development of an automatic speech recognition model using interview data from long-term care for older adults. Journal of the American Medical Informatics Association 2023, 30, 411–417. [Google Scholar] [CrossRef] [PubMed]
- Sezgin, E.; Oiler, B.; Abbott, B.; Noritz, G.; Huang, Y. “Hey Siri, Help Me Take Care of My Child”: A Feasibility Study With Caregivers of Children With Special Healthcare Needs Using Voice Interaction and Automatic Speech Recognition in Remote Care Management. Frontiers in Public Health 2022, 10, 366. [Google Scholar] [CrossRef] [PubMed]
- Donnelly, L.F.; Grzeszczuk, R.; Guimaraes, C.V. Use of natural language processing (NLP) in evaluation of radiology reports: an update on applications and technology advances. In Seminars in Ultrasound, CT and MRI; Elsevier, 2022; pp. 176–181. Available online: https://www.sciencedirect.com/science/article/pii/S0887217122000191 (accessed on 18 December 2023).
- Vatandoost, M.; Litkouhi, S. The future of healthcare facilities: how technology and medical advances may shape hospitals of the future. Hospital Practices and Research 2019, 4, 1–11. [Google Scholar] [CrossRef]
- Ruby, J.; et al. Automatic Speech Recognition and Machine Learning for Robotic Arm in Surgery. Am. J. Clinical Surgery 2020, 2, 10–18. [Google Scholar]
- Schulte, J.; et al. Automatic speech recognition in the operating room. Ann. Med. Surg. (London) 2020, 59, 81–85. [Google Scholar] [CrossRef] [PubMed]
- Brink, J.A.; Arenson, R.L.; Grist, T.M.; Lewin, J.S.; Enzmann, D. Bits and bytes: the future of radiology lies in informatics and information technology. Eur Radiol 2017, 27, 3647–3651. [Google Scholar] [CrossRef] [PubMed]
- Latif, S.; Qadir, J.; Qayyum, A.; Usama, M.; Younis, S. Speech technology for healthcare: Opportunities, challenges, and state of the art. IEEE Reviews in Biomedical Engineering 2020, 14, 342–356. [Google Scholar] [CrossRef] [PubMed]
- Kumar, V.; Singh, H.; Mohanty, A. Real-Time Speech-To-Text/Text-To-Speech Converter with Automatic Text Summarizer Using Natural Language Generation and Abstract Meaning Representation. International Journal of Engineering and Advanced Technology (IJEAT) 2020, 9, 2361–2365. [Google Scholar]
- Dua, M.; Akanksha; Dua, S. Noise robust automatic speech recognition: review and analysis. Int J Speech Technol 2023, 26, 475–519. [Google Scholar] [CrossRef]
- Rista, A.; Kadriu, A. Automatic Speech Recognition: A Comprehensive Survey. SEEU Review 2020, 15, 86–112. [Google Scholar] [CrossRef]
- Raclin, T.; Price, A.; Stave, C.; Lee, E.; Reddy, B.; Kim, J.; Chu, L. Combining Machine Learning, Patient-Reported Outcomes, and Value-Based Health Care: Protocol for Scoping Reviews. JMIR Research Protocols 2022, 11, e36395. [Google Scholar] [CrossRef] [PubMed]
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning; PMLR, 2023; pp. 28492–28518. Available online: https://proceedings.mlr.press/v202/radford23a.html (accessed on 20 December 2023).
- Chun, S.-J.; Park, J.B.; Ryu, H.; Jang, B.-S. Development and benchmarking of a Korean audio speech recognition model for Clinician-Patient conversations in radiation oncology clinics. International Journal of Medical Informatics 2023, 176, 105112. [Google Scholar] [CrossRef]
- Zhang, T.; Feng, T. Application and technology of an open source AI large language model in the medical field. Radiology Science 2023, 2, 96–104. [Google Scholar] [CrossRef]
- Jia, X.; Cunha, J.A.M.; Rong, Y. Artificial intelligence can overcome challenges in brachytherapy treatment planning. Journal of applied clinical medical physics 2022, 23. Available online: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8803284/ (accessed on 24 December 2023). [CrossRef] [PubMed]
- Cruz Rivera, S.; Dickens, A.P.; Aiyegbusi, O.L.; Flint, R.; Fleetcroft, C.; McPherson, D.; Collis, P.; Calvert, M.J. Patient-reported outcomes in the regulatory approval of medical devices. Nature medicine 2021, 27, 2067–2068. [Google Scholar] [CrossRef] [PubMed]
- Elsokah, M.M.; Zerek, A.R. Design and development Intelligent Medical Care Bed Using Voice Recognition. In 2022 IEEE 2nd International Maghreb Meeting of the Conference on Sciences and Techniques of Automatic Control and Computer Engineering (MI-STA); May 2022; pp. 299–304. [Google Scholar]
- Czyzewski, A. Optimizing medical personnel speech recognition models using speech synthesis and reinforcement learning. The Journal of the Acoustical Society of America 2023, 154, A202–A203. [Google Scholar] [CrossRef]
- Davari, M.H.; Kazemi, T.; Saberhosseini, M. The status of Clinical education in ophthalmology surgery ward of Vali-e-Asr Hospital affiliated with Birjand University of Medical Science before and after intervention. Journal of Surgery and Trauma 2018, 6. Available online: https://jsurgery.bums.ac.ir/article-1-127-.pdf (accessed on 24 December 2023).
- Chung, A.E.; Griffin, A.C.; Selezneva, D.; Gotz, D. Health and Fitness Apps for Hands-Free Voice-Activated Assistants: Content Analysis. JMIR mHealth and uHealth 2018, 6, e9705. [Google Scholar] [CrossRef]


| Body System | Duration (hh:mm:ss) |
|---|---|
| Nervous System | 53:32:04 |
| Musculoskeletal System | 61:27:22 |
| Endocrine System | 23:42:37 |
| Respiratory System | 22:54:55 |
| Cardiovascular System | 26:11:11 |
| Digestive System | 19:43:15 |
| Reproductive System | 17:00:24 |
| Urinary System | 6:17:28 |
| Auditory System | 5:10:49 |
| Lymphatic and Blood System | 4:21:04 |
| Accent | Duration (hours) |
| African | 16h 33m |
| Algerian | 25h 19m |
| Canadian | 2h 38m |
| Native French | 155h 22m |
| Moroccan | 29h 17m |
| Tunisian | 11h 13m |
| Transcription Tools | WER |
|---|---|
| SpeechRecognition (Python library supporting various engines including Google Web Speech API, Sphinx, etc.) | 0.228 |
| oTranscribe (Free online tool designed for audio transcription with features like playback speed control, bookmarking, etc.) | 0.628 |
| Conformer-CTC Large (Voice recognition model based on the Conformer-CTC architecture, optimized for automatic speech transcription) | 0.2 |
| AssemblyAI (Automatic speech transcription service using machine learning models) | 0.100 |
| Whisper Large-v2 (Voice recognition model developed by OpenAI, designed for automatic speech transcription) | 0.142 |
| Wav2Vec 2.0 (Model developed by Facebook AI, excels in automatic speech recognition using self-supervised learning approach) | 0.257 |
| Learning Rate | WER (%) | Normalized WER (%) | CER (%) | Normalized CER (%) | Train Loss | Validation Loss |
|---|---|---|---|---|---|---|
| 50.634 | 45.321 | 26.198 | 22.566 | 5.824 | 5.533 | |
| 25.781 | 18.774 | 15.337 | 12.103 | 0.378 | 0.510 | |
| 43.634 | 35.321 | 26.198 | 22.566 | 2.824 | 2.533 |
| Warmup Steps | Normalized WER (%) | Normalized CER (%) | Train Loss | Eval Loss |
|---|---|---|---|---|
| 250 | 21.563 | 18.235 | 0.499 | 0.667 |
| 500 | 18.774 | 15.337 | 0.288 | 0.510 |
| 750 | 18.720 | 15.349 | 0.270 | 0.499 |
| 1000 | 18.504 | 15.320 | 0.235 | 0.487 |
| 1250 | 23.757 | 19.813 | 0.507 | 0.689 |
| Adam_epsilon | Normalized WER (%) | Train Loss | Eval Loss |
|---|---|---|---|
| 18.273 | 0.229 | 0.477 | |
| 17.640 | 0.219 | 0.457 |
| LORA Configuration (R) |
Normalized WER (%) |
Train Loss | Eval Loss |
|---|---|---|---|
| R=42 | 17.660 | 0.223 | 0.467 |
| R=52 | 20.298 | 0.491 | 0.649 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).