Submitted:
27 July 2025
Posted:
28 July 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Works
3. Materials and Methods
3.1. Speech Recognition and Synthesis Systems for Kazakh Language
| Audio corpora name | Data type | Volume | Accessibility |
|---|---|---|---|
| Common Voice [36] | Audio recordings, with transcriptions. | 150+ hours | open access |
| KazakhTTS [37] | Audio-text pair | 271 hours | conditionally open |
| Kazakh Speech Corpus [18,38] | Speech + transcriptions | 330 hours | open access |
| Kazakh Speech Dataset (KSD) [2,39] | Speech | 554 hours | open access |
3.2. Audio and Text Dataset Formation
3.3. ASR Systems and Selecting Criteria
- Availability. The most important criterion was the system’s availability for large-scale use. This concept meant both technical availability (the ability to deploy and integrate quickly) and legal openness, including the availability of a free license or access to the source code. Preference was given to open-source solutions that did not require significant financial investments at the implementation stage.
- Recognition quality. One of the key technical parameters was the linguistic accuracy of the system. This crucial factor ensures the system’s ability to correctly interpret both standard and accented speech, taking into account the language’s morphological and syntactic features. Particular attention was also paid to the contextual relevance of the recognized text, that is, the system’s ability to preserve semantic integrity when converting oral speech into written form.
- The efficiency of subsequent processing. An additional criterion was the system’s ability to effectively work with a large volume of input data, implying not only accurate recognition but also the possibility of further processing (for example, automatic translation or categorization of content). Special importance was given to the scalability of the architecture and support for batch processing of audio files, ensuring a high-performance system.
3.4. Text-to-Speech (TTS)
| Models | Advantages | Drawbacks |
|---|---|---|
| MMS |
Open-source and publicly available
Supports 1,100+ languages, including Kazakh Unified model for ASR, TTS, and language identification Trained on a vast amount of data |
Less optimized for real-time use
May show degraded performance on specific dialects |
| TurkicTTS |
Specially designed for Turkic languages into 7 languages (Azerbaijani, Bashkir, Kazakh, Kyrgyz, Sakha, Tatar, Turkish, Turkmen, Uyghur, and Uzbek)
Incorporates phonological features of Turkic speech Provides open research resources and benchmarks |
sometimes incorrectly identifies Turkic languages
Limited in domain coverage and audio quality variation Research-focused with minimal production integration |
| KazakhTTS2 |
Tailored for high-quality Kazakh TTS
Improved naturalness and prosody over earlier versions Developed for national applications Open-source and available via GitHub |
Limited to Kazakh only
Requires fine-tuning for expressive or emotional speech |
| Elevenlabs |
High-fidelity, human-like voice synthesis
Supports multilingual and emotional speech User-friendly web and API interfaces Fast inference and low-latency output |
Commercial licensing with usage restrictions
No access to full training data or fine-tuning options |
| OpenAI TTS |
Advanced TTS with realistic prosody and expressiveness
Integrated with GPT models for contextual generation Robust handling of punctuation, emphasis, and emotion |
closed model; not open-source
Limited user control and customization Subject to usage caps or API quotas |
3.5. Text and Audio Quality Metrics
4. Results
4. Discussion
5. Conclusion and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| ASR | Automatic speech recognition |
| TTS | Text-to-speech |
| STT | Speech-to-Text |
| E2E | End-to-End |
| WER | Word error rates |
| TER | Translation Edit Rate |
| BLEU | Bilingual Evaluation Understudy |
| chrF | CHaRacter-level F-score |
| LoRA | Low-Rank Adaptation |
| CER | Character error rate |
| KSC | Kazakh Speech Corpus |
| MT | Machine translation |
| HMM | Hidden Markov Models |
| PESQ | Perceptual Evaluation of Speech Quality |
| STOI | Short-Time Objective Intelligibility |
| USM | Universal Speech Model |
| USC | Uzbek Speech Corpus |
| MCD | Mel Cepstral Distortion |
| DNSMOS | Deep Noise Suppression Mean Opinion Score |
| DNS | Deep Noise Suppression |
| MOS | Mean Opinion Score |
| MSE | Mean square error |
| MMS | Massively Multilingual Speech |
| MFCC | Mel-frequency cepstral coefficients |
| ISSAI | Institute of Intelligent Systems and Artificial Intelligence |
| NU | Nazarbayev University |
| CTC | Connectionist temporal classification |
| KSD | Kazakh Speech Dataset |
| AI | Artificial Intelligence |
| COMET | Crosslingual Optimized Metric for Evaluation of Translation |
| RNN-T | Recurrent neural network-transducer |
| LSTM | Long Short-Term Memory |
| UzLM | Uzbek language model |
| STS | Speech-to-speech |
| LID | Language identifier |
| DL | Deep Learning |
| IPA | International Phonetic Alphabet |
| API | Application Programming Interface |
| GPT | Generative Pre-trained Transformer |
| WebRTC | Web Real-Time Communication |
| MOS-LQO | Mean Opinion Score - Listening Quality Objective |
| HiFi-GAN | Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis |
| WaveGAN | Generative adversarial network for unsupervised synthesis of raw-waveform audio |
References
- Bekarystankyzy, A.; Mamyrbayev, O.; Mendes, M.; et al. Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets. Sci. Rep. 2024, 14, 13835. [Google Scholar] [CrossRef]
- Kadyrbek, N.; Mansurova, M.; Shomanov, A.; Makharova, G. The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters. Big Data Cogn. Comput. 2023, 7, 132. [Google Scholar] [CrossRef]
- Yeshpanov, R.; Mussakhojayeva, S.; Khassanov, Y. Multilingual Text-to-Speech Synthesis for Turkic Languages Using Transliteration. In Proceedings of the INTERSPEECH; 2023; pp. 5521–5525. [Google Scholar] [CrossRef]
- Mussakhojayeva, S.; Janaliyeva, A.; Mirzakhmetov, A.; Khassanov, Y.; Varol, H.A. KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset. In Proceedings of the INTERSPEECH; 2021; pp. 2786–2790. [Google Scholar] [CrossRef]
- Basak, S.; Agrawal, H.; Jena, S.; Gite, S.; Bachute, M.; Pradhan, B.; Assiri, M. Challenges and Limitations in Speech Recognition Technology: A Critical Review of Speech Signal Processing Algorithms, Tools and Systems. Comput. Model. Eng. Sci. 2023, 135, 1053–1089. [Google Scholar] [CrossRef]
- Rosenberg, A.; Zhang, Y.; Ramabhadran, B.; et al. Speech recognition with augmented synthesized speech. In Proceedings of the IEEE ASRU 2019; pp. 996–1002.
- Zhang, C.; Li, B.; Sainath, T.; et al. Streaming end-to-end multilingual speech recognition with joint language identification. In Proceedings of the INTERSPEECH; 2022. [Google Scholar]
- Zhang, Y.; Han, W.; Qin, J.; et al. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv 2023, arXiv:2303.01037. [Google Scholar]
- Fendji, J.L.K.E.; Tala, D.C.M.; Yenke, B.O.; Atemkeng, M. Automatic Speech Recognition Using Limited Vocabulary: A Survey. Appl. Artif. Intell. 2022, 36. [Google Scholar] [CrossRef]
- Metze, F.; Gandhe, A.; Miao, Y.; et al. Semi-supervised training in low-resource ASR and KWS. In Proceedings of the ICASSP 2015; pp. 5036–5040. [CrossRef]
- Du, W.; Maimaitiyiming, Y.; Nijat, M.; et al. Automatic Speech Recognition for Uyghur, Kazakh, and Kyrgyz: An Overview. Appl. Sci. 2023, 13, 326. [Google Scholar] [CrossRef]
- Mukhamadiyev, A.; Mukhiddinov, M.; Khujayarov, I.; Ochilov, M.; Cho, J. Development of Language Models for Continuous Uzbek Speech Recognition System. Sensors 2023, 23, 1145. [Google Scholar] [CrossRef] [PubMed]
- Veitsman, Y.; Hartmann, M. Recent Advancements and Challenges of Turkic Central Asian Language Processing. In Proceedings of the First Workshop on Language Models for Low-Resource Languages; ACL: Abu Dhabi, UAE, 2025; pp. 309–324. [Google Scholar]
- Oyucu, S. A Novel End-to-End Turkish Text-to-Speech (TTS) System via Deep Learning. Electronics 2023, 12, 1900. [Google Scholar] [CrossRef]
- Polat, H.; Turan, A.K.; Koçak, C.; Ulaş, H.B. Implementation of a Whisper Architecture-Based Turkish ASR System and Evaluation of Fine-Tuning with LoRA Adapter. Electronics 2024, 13, 4227. [Google Scholar] [CrossRef]
- Musaev, M.; Mussakhojayeva, S.; Khujayorov, I.; et al. USC: An open-source Uzbek speech corpus and initial speech recognition experiments. In Speech and Computer. Lecture Notes in Computer Science; Springer, 2021; pp. 437–447. [Google Scholar]
- Kozhirbayev, Z. Kazakh Speech Recognition: Wav2vec2.0 vs. Whisper. J. Adv. Inf. Technol. 2023, 14, 1382–1389. [Google Scholar] [CrossRef]
- Khassanov, Y.; Mussakhojayeva, S.; Mirzakhmetov, A.; et al. A Crowdsourced Open-Source Kazakh Speech Corpus and Initial Speech Recognition Baseline. Proceedings of EACL 2021; pp. 697–706.
- Kozhirbayev, Z.; Islamgozhayev, T. Cascade Speech Translation for the Kazakh Language. Appl. Sci. 2023, 13, 8900. [Google Scholar] [CrossRef]
- Orken, M.; Dina, O.; Keylan, A.; et al. A study of transformer-based end-to-end speech recognition system for Kazakh language. Sci. Rep. 2022, 12, 8337. [Google Scholar] [CrossRef]
- Kapyshev, G.; Nurtas, M.; Altaibek, A. Speech recognition for Kazakh language: a research paper. Procedia Comput. Sci. 2024, 231, 369–372. [Google Scholar] [CrossRef]
- Mussakhojayeva, S.; et al. Noise-Robust Multilingual Speech Recognition and the Tatar Speech Corpus. In Proceedings of the ICAIIC 2024, Osaka, Japan; 2024; pp. 732–737. [Google Scholar] [CrossRef]
- Mussakhojayeva, S.; Khassanov, Y.; Varol, H.A. KSC2: An industrial-scale open-source Kazakh speech corpus. Proceedings of INTERSPEECH 2022; pp. 1367–1371.
- Vaswani, A.; Shazeer, N.; Parmar, N.; et al. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.C.; et al. Conformer: Convolution-augmented Transformer for Speech Recognition. Proceedings of INTERSPEECH 2020; pp. 5036–5040. [CrossRef]
- Radford, A.; Kim, J.; Xu, T.; et al. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar]
- Watanabe, S.; Hori, T.; Karita, S.; et al. ESPnet: End-to-End Speech Processing Toolkit. arXiv 2018, arXiv:1804.00015. [Google Scholar]
- Conneau, A.; Khandelwal, K.; Goyal, N.; et al. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of ACL 2020; pp. 8440–8451.
- Hu, E.; Shen, Y.; Wallis, P.; et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
- ESPnet Toolkit. Available online: https://github.com/espnet/espnet (accessed on 10 June 2025).
- Ghoshal, A.; Boulianne, G.; Burget, L.; et al. The Kaldi speech recognition toolkit. In Proceedings of the ASRU; 2011. [Google Scholar]
- Wolf, T.; et al. Transformers: State-of-the-art Natural Language Processing. Proceedings of EMNLP 2020; pp. 38–45. [CrossRef]
- Shen, J.; et al. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In ICASSP 2018, pp. 4779–4783. [CrossRef]
- Ren, Y.; et al. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. arXiv 2020, arXiv:2006.04558. [Google Scholar]
- Kong, J.; et al. HiFi-GAN: Generative Adversarial Network for Efficient and High Fidelity Speech Synthesis. arXiv 2020, arXiv:2010.05646. [Google Scholar]
- Common Voice. Available online: https://commonvoice.mozilla.org/ru/datasets (accessed on 10 June 2025).
- KazakhTTS. Available online: https://github.com/IS2AI/Kazakh_TTS (accessed on 10 June 2025).
- Kazakh Speech Corpus. Available online: https://www.openslr.org/102/ (accessed on 10 June 2025).
- Kazakh Speech Dataset. Available online: https://www.openslr.org/140/ (accessed on 10 June 2025).
- GPT-4o-transcribe (OpenAI). Available online: https://platform.openai.com/docs/models/gpt-4o-transcribe (accessed on 2 July 2025).
- Whisper. Available online: https://github.com/openai/whisper (accessed on 2 June 2025).
- Soyle. Available online: https://github.com/IS2AI/Soyle (accessed on 2 June 2025).
- ElevenLabs Scribe. Available online: https://elevenlabs.io/docs/capabilities/speech-to-text (accessed on 20 June 2025).
- Voiser. Available online: https://voiser.net/ (accessed on 30 June 2025).
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: a Method for Automatic Evaluation of Machine Translation. Proceedings of ACL 2002; pp. 311–1040.
- Gillick, L.; Cox, S. Some Statistical Issues in the Comparison of Speech Recognition Algorithms. In ICASSP 1989, 1, 532–535. [Google Scholar] [CrossRef]
- Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of AMTA 2006. https://www.cs.umd.edu/~snover/pub/amta06_ter_final.pdf.
- Popović, M. chrF: character n-gram F-score for automatic MT evaluation. Proceedings of WMT 2015; pp. 392–3049.
- Rei, R.; Farinha, A.C.; Martins, A.F.T. COMET: A Neural Framework for MT Evaluation. Proceedings of EMNLP 2020; pp. 2685–2702.
- Kubichek, R. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of the IEEE Pacific Rim Conf. 1993; 1, pp. 125–128. [Google Scholar] [CrossRef]
- Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ). In ICASSP 2001, 2, 749–752. [Google Scholar] [CrossRef]
- Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
- Reddy, A.; et al. DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors. In Proceedings of the ICASSP; 2020. [Google Scholar]
- MMS (Massively Multilingual Speech). Available online: https://github.com/facebookresearch/fairseq/tree/main/examples/mms (accessed on 10 June 2025).
- TurkicTTS. Available online: https://github.com/IS2AI/TurkicTTS (accessed on 12 June 2025).
- KazakhTTS2. Available online: https://github.com/IS2AI/Kazakh_TTS (accessed on 12 June 2025).
- ElevenLabs TTS. Available online: https://elevenlabs.io/docs/capabilities/text-to-speech (accessed on 2 July 2025).
- OpenAI TTS. Available online: https://platform.openai.com/docs/guides/text-to-speech (accessed on 30 June 2025).











| Models | Advantages | Drawbacks |
|---|---|---|
| ChatGPT Transcribe |
State-of-the-art accuracy Real-time streaming via WebSocket/WebRTC Robust multilingual support Strong in noisy environments Integration with multimodal GPT-4o |
Limited access to the OpenAI API No open-source Requirement of a consistent internet connection |
| Whisper | Open-source and free availability High multilingual accuracy Language detection and translation support Availability of pretrained models in various sizes |
Possibility of lags in low-latency applications Resource-intensivity of large models |
| Soyle | Focus on Kazakh and low-resource Turkic languages Local development for national use Regional speech support |
A few language limitations Scarce public documentation Restriction of deployment options |
| Elevenlabs | Fast transcription optimized for the voice cloning ecosystem High-quality speaker labeling Integration with TTS tools |
Primarily focus on English Closed-source Influence of a noisy environment on the transcription quality |
| Voiser | High accuracy in Kazakh and Turkish Real-time and batch transcription Punctuation and speaker diarization Cloud access |
Proprietary and closed-source Limited global language range Less academic benchmarking |
| Model | BLEU% | WER% | TER% | chrF | COMET |
|---|---|---|---|---|---|
| Whisper | 13.22 | 77.10 | 74.87 | 55.30 | 0.42 |
| GPT4 Transcribe | 45.57 | 43.75 | 42.35 | 76.99 | 0.86 |
| Soyle | 38.66 | 48.14 | 36.30 | 80.35 | 0.97 |
| Elevenlabs | 43.33 | 42.77 | 41.89 | 77.36 | 0.88 |
| Voiser | 38.41 | 40.65 | 31.97 | 80.88 | 1.01 |
| Model | BLEU% | WER% | TER% | chrF | COMET |
|---|---|---|---|---|---|
| Whisper | 21.97 | 60.55 | 54.36 | 68.36 | 0.30 |
| GPT4 Transcribe | 53.46 | 36.22 | 23.04 | 81.15 | 1.02 |
| Soyle | 74.93 | 18.61 | 18.61 | 95.60 | 1.23 |
| Elevenlabs | 59.45 | 30.84 | 17.27 | 88.04 | 1.13 |
| Voiser | 47.04 | 37.11 | 22.95 | 84.51 | 1.05 |
| Model | STOI | PESQ | MCD | LSD | DNSMOS |
|---|---|---|---|---|---|
| MMS | 0.09 | 1.12 | 145.16 | 1.15 | 4.63 |
| TurkicTTS | 0.11 | 1.16 | 129.54 | 1.06 | 5.92 |
| KazakhTTS2 | 0.10 | 1.09 | 150.53 | 1.11 | 8.79 |
| Elevenlabs | 0.10 | 1.10 | 164.29 | 1.34 | 6.13 |
| OpenAI TTS | 0.09 | 1.12 | 123.44 | 1.16 | 7.43 |
| Model | STOI | PESQ | MCD | LSD | DNSMOS |
|---|---|---|---|---|---|
| MMS | 0.12 | 1.11 | 148.40 | 1.20 | 3.91 |
| TurkicTTS | 0.15 | 1.14 | 145.49 | 1.12 | 6.39 |
| KazakhTTS2 | 0.12 | 1.07 | 137.03 | 1.12 | 8.96 |
| Elevenlabs | 0.13 | 1.08 | 139.75 | 1.29 | 6.38 |
| OpenAI TTS | 0.14 | 1.14 | 117.11 | 1.19 | 7.04 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).