Submitted:
05 March 2025
Posted:
07 March 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction

2. Related Work
2.1. Challenges in Multilingual Translation for Indian Languages
- Data Scarcity and Quality: One of the biggest problems in developing effective MT systems for Indian languages is the lack of large, high-quality parallel corpora. This scarcity hampers the ability to train robust models that can handle the nuances of each language [5].
- Transliteration and Code-Mixing: Indian languages often appear in transliterated forms or code-mixed with English, especially in informal settings like social media. Current state-of-the-art multilingual language models (LMs) struggle to handle these variations effectively [6].
- Diverse Linguistic Landscape: The linguistic diversity in India, with languages from four major language families, requires MT systems to be highly adaptable and capable of handling multiple scripts and dialects [7].
2.2. MuRIL: Multilingual Representations for Indian Languages
2.3. Hybrid Models and Neural Networks
2.4. VAKTA-SETU: Speech-to-Speech Machine Translation
3. Proposed Work
3.1. System Architecture
3.2. Components of System
3.2.1. Text-to-Speech - TTS
- x represents the input text sequence.
- y represents the generated speech waveform (audio output).
3.2.2. Speech Recognition Module
- represents the most likely transcribed text.
- is the conditional probability of the text x given the audio input y.
3.2.3. Speech-to-Text - STT
- y represents the input speech waveform (audio input).
- x represents the transcribed text output.
3.3. User FLow Diagram

4. IndicTrans2 Model Overview
4.1. Model Architecture
4.2. Training Methods
- Data Augmentation: To improve model robustness, various data augmentation methods are applied. This includes synthetically generating additional training data by introducing controlled variations in the existing text to expand the dataset and expose the model to a broader set of linguistic patterns.
- Subword Tokenization (e.g., SentencePiece): The training pipeline incorporates subword tokenization techniques like SentencePiece to break down words into subword units. This approach helps handle out-of-vocabulary words effectively, improves language modeling, and ensures efficient handling of morphologically rich languages.
- Back-Translation: Back-translation is used to augment training data by translating sentences from the target language back into the source language and then re-translating them into the target language. This technique boosts the training dataset, particularly for low-resource language pairs, and enhances the model’s capability to produce fluent and accurate translations.
- Optimization Strategy: The model training process uses state-of-the-art optimization algorithms such as Adam or AdamW, tailored with a custom learning rate schedule. This ensures that the model converges efficiently and avoids common pitfalls like vanishing or exploding gradients during training.
- Multi-Task Learning: To further improve cross-lingual transfer capabilities, the training process employs multi-task learning. Related language pairs are trained together, allowing shared learning representations across linguistically similar languages. This joint training approach enhances performance on both high-resource and low-resource languages by transferring knowledge between them.
4.3. Evaluation Metrics
- BLEU (Bilingual Evaluation Understudy Score): BLEU is a widely used metric for evaluating the quality of machine-translated text by comparing it to one or more human-generated reference translations. It calculates the precision of n-grams between the model’s output and the reference, penalizing overly short translations [13].
- TER (Translation Edit Rate): TER measures the number of edits (insertions, deletions, substitutions, and shifts) required to transform the model’s translation into the reference translation. It quantifies how much post-editing is needed for the output [14].
- ChrF (Character F-score): ChrF is a character-level metric that measures the precision and recall of character n-grams. It is particularly useful for evaluating translations in morphologically rich languages where word-level matching might not capture nuances effectively [15].
5. Requirements and Environment Setup
5.1. Hardware and Software Requirements
5.1.1. Hardware Requirements
- Processor (CPU)
- Graphics Processing Units (GPU)
- RAM - 16GB
- Storage - 500GB
- System - I5 Processor
5.1.2. Software Requirements
- Operating System
-
Langauges and Libraries:
- -
- Python
- -
- Streamlit
-
Deep Learning Frameworks:
- -
- TensorFlow/PyTorch
- -
- Hugging Face Transformers
-
Text-to-Speech (TTS) and Speech-to-Text (STT) Libraries:
- -
- Google Text-to-Speech (gTTS) API
- -
- SpeechRecognition
5.2. Development Environment and Tools
5.2.1. Integrated Development Environment (IDE)
- VSCode for code editing and debugging.
- Jupyter Notebooks for testing and experimenting with machine learning models in Python.
5.2.2. Version Control
- Git/GitHub for version control and collaborative development.
5.2.3. Deployment: The final application will be deployed as a web application, accessible via standard web browsers.
6. Results and Experiments
6.1. Project Modules
6.1.1. HomeScreen

6.1.2. Source and Target Language Selection

6.1.3. Audio Input

6.1.4. Output Screen

Conclusion
References
- Inaguma, H.; Duh, K.; Kawahara, T.; Watanabe, S. Multilingual End-to-End Speech Translation. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 570–577. [CrossRef]
- Patil, P.S.; Dhande, M.S.; More, M.M.; Dalvi, M.S. Text to Speech and Language Conversion in Hindi and English Using CNN. International Journal for Research in Applied Science and Engineering Technology 2022, 10. [CrossRef]
- Al-Bakhrani, A.; Amran, G.A.; Al-Hejri, A.; Chavan, S.; Manza, R.; Nimbhore, S., Development of Multilingual Speech Recognition and Translation Technologies for Communication and Interaction; 2023; pp. 711–723. [CrossRef]
- Wikipedia contributors. Languages of India, 2024. Accessed: 2024-11-09.
- Mhaskar, S.; Bhat, V.; Batheja, A.; Deoghare, S.; Choudhary, P.; Bhattacharyya, P. VAKTA-SETU: A Speech-to-Speech Machine Translation Service in Select Indic Languages. ArXiv 2023, abs/2305.12518. [CrossRef]
- Sannigrahi, S.; Bawden, R. Investigating Lexical Sharing in Multilingual Machine Translation for Indian Languages 2023. pp. 181–192. [CrossRef]
- Padmane, P.; Pakhale, A.; Agrel, S.; Patel, A.; Pimparkar, S.; Bagde, P. Multilingual Speech and Text Recognition and Translation. International Journal of Innovations in Engineering and Science 2022. [CrossRef]
- Khanuja, S.; Bansal, D.; Mehtani, S.; Khosla, S.; Dey, A.; Gopalan, B.; Margam, D.; Aggarwal, P.; Nagipogu, R.; Dave, S.; et al. MuRIL: Multilingual Representations for Indian Languages. ArXiv 2021, abs/2103.10730.
- Rudrappa, N.T.; Reddy, M.V.; Hanumanthappa, M. KHiTE: Multilingual Speech Acquisition to Monolingual Text Translation. Indian Journal Of Science And Technology 2023. [CrossRef]
- Jayanthi, N.; Lakshmi, A.; Raju, C.S.K.; Swathi, B. Dual Translation of International and Indian Regional Language using Recent Machine Translation. 2020 3rd International Conference on Intelligent Sustainable Systems (ICISS) 2020, pp. 682–686. [CrossRef]
- Mujadia, V.; Sharma, D. Towards Speech to Speech Machine Translation focusing on Indian Languages 2023. pp. 161–168. [CrossRef]
- Gala, J.; Chitale, P.A.; Raghavan, A.K.; Gumma, V.; Doddapaneni, S.; M, A.K.; Nawale, J.A.; Sujatha, A.; Puduppully, R.; Raghavan, V.; et al. IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages. Transactions on Machine Learning Research 2023.
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Isabelle, P.; Charniak, E.; Lin, D., Eds., Philadelphia, Pennsylvania, USA, 2002; pp. 311–318. [CrossRef]
- Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of the Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, Cambridge, Massachusetts, USA, 8-12 2006; pp. 223–231.
- Popović, M. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Proceedings of the Tenth Workshop on Statistical Machine Translation; Bojar, O.; Chatterjee, R.; Federmann, C.; Haddow, B.; Hokamp, C.; Huck, M.; Logacheva, V.; Pecina, P., Eds., Lisbon, Portugal, 2015; pp. 392–395. [CrossRef]
- Proksch, S.O.; Wratil, C.; Wäckerle, J. Testing the Validity of Automatic Speech Recognition for Political Text Analysis. Political Analysis 2019, 27, 339–359. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).