Introduction
Garo is a Tibeto-Burman language spoken by approximately 1.2 million people primarily in Meghalaya, India, and parts of Bangladesh. Despite its significant speaker population, Garo remains technologically underserved with limited digital resources and no publicly available speech recognition systems.
Automatic speech recognition has seen remarkable progress with transformer-based models like Whisper [
1], which demonstrate strong multilingual capabilities. However, Whisper’s training data does not include Garo, resulting in poor zero-shot performance.
In this work, we fine-tune Whisper-small on Garo speech data from the Vaani dataset [
2] and evaluate its performance comprehensively. Our contributions include: (1) a publicly available ASR model for Garo language achieving under 10% WER; (2) comprehensive evaluation with detailed error analysis; (3) analysis of challenges specific to Garo phonology and morphology; and (4) open release of the trained model for community use.
Dataset
We use the Vaani-transcription-part dataset [
2], a component of the larger Vaani project by ARTPARK-IISc. The dataset contains spontaneous, image-prompted speech collected across 165 districts in India, covering 109 languages. This collection methodology captures naturalistic speech patterns as speakers describe images shown to them.
The Garo subset is split into standard training, validation, and test partitions following an 80/8/10 ratio. Audio characteristics include a mean duration of approximately 4 seconds per sample at 16 kHz sampling rate. The relatively low type-token ratio in the dataset reflects Garo’s agglutinative nature, where productive morphology creates diverse word forms from a smaller set of roots and affixes.
We apply the following preprocessing steps: removal of XML tags and bracketed corrections, conversion to lowercase, removal of punctuation except word-internal characters (·, -), and whitespace normalization. Audio is resampled to 16 kHz and converted to mel-spectrograms following Whisper’s preprocessing pipeline.
Methodology
Whisper-small consists of 12 encoder and 12 decoder transformer layers with 244M total parameters and a 30-second context window. The encoder processes mel-spectrogram inputs while the decoder generates transcriptions autoregressively.
We fine-tune all model parameters using standard practices including AdamW optimizer, learning rate scheduling with warmup, mixed precision training (FP16), and gradient accumulation. Checkpoints are evaluated on the validation set, and the best performing checkpoint is selected based on validation loss convergence. Full training hyperparameters will be detailed in the extended version of this work.
Evaluation
We evaluate using two standard ASR metrics: Word Error Rate (WER) and Character Error Rate (CER). WER is computed as (S + D + I) / N where S = substitutions, D = deletions, I = insertions, and N = total words in reference. CER follows the same formulation at the character level. For fair comparison, we apply consistent normalization to both references and predictions including lowercase conversion, punctuation removal, and whitespace normalization.
Results
Main Results
Our fine-tuned model achieves a WER of 9.74% (97.5% relative improvement) and CER of 3.82% (98.1% relative improvement). The model produces perfect transcriptions (0% WER) for over 60% of test samples. The dramatic improvement over zero-shot performance demonstrates the effectiveness of fine-tuning for low-resource languages not represented in Whisper’s training data.
Table 1.
Performance comparison with zero-shot baseline.
Table 1.
Performance comparison with zero-shot baseline.
| Model |
WER (%) |
CER (%) |
| Whisper-small (zero-shot) |
382.7 |
203.5 |
| GaroASR (fine-tuned) |
9.74 |
3.82 |
Error Distribution
The zero median for both WER and CER indicates that most samples are transcribed perfectly, with errors concentrated in a subset of challenging cases. This bimodal distribution suggests the model has learned core Garo patterns effectively but struggles with specific phenomena.
Inference Speed
The model achieves a real-time factor of approximately 0.05× (20×faster than audio duration), enabling practical deployment for real-time applications including live transcription and voice interfaces.
Error Analysis
Error Patterns
Code-switching with English: Analysis of samples containing English loanwords reveals significant challenges, with a notable proportion exhibiting high error rates (WER > 30%). These errors often cascade, affecting surrounding Garo words as the model struggles to reconcile conflicting language patterns.
Annotation noise: A subset of samples contain bracketed corrections in the reference transcriptions, indicating transcriber uncertainty or dialectal variation. These corrections introduce ambiguity into evaluation, as the model may produce valid alternatives that differ from the corrected reference.
Compound word boundaries: Garo’s agglutinative morphology creates compound forms with hyphens. The model occasionally segments these compounds incorrectly, reflecting tension between Garo’s productive morphology and the word-tokenization assumptions of the underlying model.
CER vs WER Analysis
The CER/WER ratio of 0.39 indicates that errors are predominantly partial word mistakes rather than complete word substitutions or omissions. For agglutinative languages like Garo, a single morpheme error can invalidate an entire word from a WER perspective while preserving most characters. The low CER suggests the model captures Garo phonology and most morphological patterns effectively.
Discussion
The achieved WER of 9.74% represents strong performance for a low-resource language with no prior ASR systems. The 97.5% improvement over zero-shot performance validates fine-tuning as an effective strategy for low-resource ASR, consistent with findings from other low-resource languages [
12,
13]. The low CER/WER ratio (0.39) provides evidence that the model handles Garo’s agglutinative morphology effectively.
The error rate on code-switched utterances highlights a significant limitation, likely reflecting limited English-Garo code-switching in training data, conflicting phonological and orthographic patterns between languages, and the model’s bias toward producing Garo outputs. Code-switching is common in multilingual communities, making this a priority for future work.
This ASR system enables several practical applications for the Garo-speaking community including speech-to-text services, educational tools for language learning and literacy, documentation of oral traditions and cultural heritage, accessibility features for voice-controlled interfaces, and support for content creation in Garo language media. The real-time inference speed makes the model suitable for interactive applications.
Conclusions
We present a publicly available ASR system for Garo, achieving 9.74% WER and 3.82% CER through fine-tuning Whisper-small. The model produces perfect transcriptions for over 60% of test cases and operates at 20× real-time speed. Our error analysis reveals that code-switching with English and annotation noise present the primary challenges, while the model handles native Garo phonology and morphology effectively. By releasing this model publicly, we aim to support technological inclusion for the Garo-speaking community and contribute to the broader effort of developing language technologies for low-resource languages worldwide.
Limitations
This work has several limitations: high error rate on English loanwords limits applicability in code-switching contexts; training data may not capture full dialectal variation across Garo-speaking regions; annotation uncertainty affects a portion of test samples; only Whisper-small was evaluated, and larger model variants may achieve better performance; evaluation was conducted on a single test set from one collection methodology; and the model has not been tested on conversational or telephone speech.
Ethics Statement
This research aims to support linguistic diversity and technological inclusion for the Garo-speaking community. The Vaani dataset was collected with appropriate consent and ethical approval from participants. We release this model openly to benefit speakers of Garo and researchers working on low-resource languages. We acknowledge that ASR technology can be used for both beneficial and harmful purposes, including surveillance. We encourage responsible deployment that respects speaker privacy, cultural context, and community needs.
Funding
This research was funded by MWire Labs.
Data Availability Statement
Acknowledgments
We thank ARTPARK-IISc for creating and releasing the Vaani dataset, which made this work possible. We also acknowledge the Garo-speaking participants who contributed their speech to the dataset.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning, 2023; pp. 28492–28518. [Google Scholar]
- ARTPARK-IISc. Vaani: Capturing the true diversity of India’s spoken languages. 2024. Available online: https://huggingface.co/datasets/ARTPARK-IISc/Vaani-transcription-part (accessed on 22 January 2025).
- Besacier, L.; Barnard, E.; Karpov, A.; Schultz, T. Automatic speech recognition for under-resourced languages: A survey. Speech Commun. 2014, 56, 85–100. [Google Scholar] [CrossRef]
- Conneau, A.; Baevski, A.; Collobert, R.; Mohamed, A.; Auli, M. Unsupervised cross-lingual representation learning for speech recognition. Proceedings of Interspeech, 2020; pp. 2426–2430. [Google Scholar]
- Gisslen, N.R.; Hedlund, E.F.; Brown, S.; Bird, S. Breaking the transcription bottleneck: Fine-tuning ASR models for extremely low-resource fieldwork languages. In Proceedings of the 6th Workshop on the Use of Computational Methods in the Study of Endangered Languages, 2025. [Google Scholar]
- Pratap, V.; Tjandra, A.; Shi, B.; Tomasello, P.; Babu, A.; Kundu, S.; et al. Scaling speech technology to 1,000+ languages. arXiv 2023, arXiv:2305.13516. [Google Scholar]
- Prajapati, A.; Kumar, K.; Jyothi, P. Improving on the limitations of the ASR model in low-resourced environments using parameter-efficient fine-tuning. In Proceedings of the 21st International Conference on Natural Language Processing (ICON), 2024. [Google Scholar]
- Schneider, S.; Baevski, A.; Collobert, R.; Auli, M. wav2vec: Unsupervised pre-training for speech recognition. Proceedings of Interspeech, 2019; pp. 3465–3469. [Google Scholar]
- Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 2020, Volume 33, 12449–12460. [Google Scholar]
- Zhang, Y.; Park, D.S.; Han, W.; Qin, J.; Gulati, A.; Shor, J.; et al. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv 2023, arXiv:2303.01037. [Google Scholar] [CrossRef]
- Prabhavalkar, R.; Rao, K.; Sainath, T.N.; Li, B.; Johnson, L.; Jaitly, N. Multilingual speech recognition for Indian languages. In Intelligent Computing: Proceedings of the 2022 Computing Conference, 2022; Springer; pp. 528–542. [Google Scholar]
- Sinha, S.; Saabith, A.L.S.; Jyothi, P. Model adaptation for ASR in low-resource Indian languages. arXiv 2023, arXiv:2307.07948. [Google Scholar] [CrossRef]
- Muralidaran, D.; Singh, Y.K.; Kumar, R. Whispering in Ol Chiki: Cross-lingual transfer learning for Santali speech recognition; Findings of the Association for Computational Linguistics: IJCNLP-AACL, 2025. [Google Scholar]
|
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).