Submitted:
07 May 2025
Posted:
08 May 2025
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Low-Resource Languages and Creation of Parallel Sentences Datasets
- • OPUS — an extensive repository of parallel datasets in various languages, including data for many language pairs [15];
- • TED Talks — subtitles for TED talks are often available in multiple languages, allowing the creation of a parallel dataset [16];
- • Europarl — parallel dataset from European Parliament proceedings in multiple languages [17].
- • Europarl – Official documents of the European Parliament.
- • GNOME, KDE, Ubuntu – Software interfaces and technical documentation.
- • Tanzil – Multilingual translations of the Quran.
- • OpenSubtitles – Movie subtitles in multiple languages.
- • WikiMatrix – Multilingual parallel texts extracted from Wikipedia.
3. Methods and Materials
3.1. Research Workflow
- Preparation.
- AI system selection.
- Corpus creation.
- Literature review - conducting an extensive review of the current state of research in machine translation for low-resource languages, focusing on the Kazakh-Kyrgyz language pair and related technologies.
- Searching for datasets of parallel sentences KZ-KG and KZ-EN languages - identifying available parallel corpus for the Kazakh-Kyrgyz and Kazakh-English language pairs from open data sources, such as OPUS, and evaluating their quality and coverage.
- Searching for AI systems - investigating available AI systems and neural machine translation models that can be used for the Kazakh-Kyrgyz language pair, focusing on pre-trained models and open-source solutions.
- Selecting translation quality metrics – choosing relevant evaluation metrics based on literature review to assess the quality and performance of the translation systems.
- Defining AI system selection criteria – establishing clear criteria for selecting the most suitable AI model for translation tasks, considering factors like model architecture, efficiency, training data availability, and translation accuracy.
- Creating a test dataset – curating a test dataset of parallel sentences from selected sources, ensuring it is balanced and representative of different domains for comprehensive evaluation of translation quality.
3.2. Translation Quality Metrics
- • BLEU (Bilingual Evaluation Understudy) measures the overlap between machine-generated translations and reference translations. Sentence-level BLEU scores were calculated using the `sentence_bleu` function from SacreBLEU in Python [27].
- • WER (Word Error Rate) measures the number of errors (insertions, deletions, and substitutions) in the translated text. A lower WER indicates better translation accuracy.
- • ChrF (Character n-gram F-score) evaluates translations at the character level, making it particularly useful for assessing morphologically rich languages. The “sacrebleu.sentence_ChrF” function was utilized to compute segmental-level ChrF scores, which were then averaged [28].
- • METEOR (Metric for Evaluation of Translation with Explicit ORdering) evaluates translations based on synonymy, stemming, and word order.
- • COMET is a neural-based evaluation metric that considers semantic adequacy and fluency [29].
3.3. Formatting of Mathematical Components
3.4. AI Systems and Selecting Criteria
4. Results
4.1. Translation Quality Assessment Results by Various AI Systems
4.2. Selection of an AI System [Table with Quality Metrics Values+Paid System Rejection+Choosing Two Best Systems+Spead Tests + Final Selection of AI System]
4.3. Parameters of a Corpus Created
4.4. Translation Errors and Their Correction
5. Discussion
6. Conclusions and Future Works
Author Contributions
Funding
Conflicts of Interest
References
- Cieri, C.; Maxwell, M.; Strassel, S.; Tracey, J. Selection Criteria for Low Resource Language Programs. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), Portorož, Slovenia, 2016; European Language Resources Association (ELRA); pp. 4543–4549. [Google Scholar]
- Ivasubramanian, R.; Umamaheswari, T.; Babu, S.B.G.; Inakoti, R.; Salome, J. ; Y. M., Dr; Sivasubramanian, R. Natural Language Processing in Low-Resource Language Contexts. Front. Health Inform. 2024, 13, 1578–1584. [Google Scholar] [CrossRef]
- Pakray, P.; Gelbukh, A.; Bandyopadhyay, S. Natural language processing applications for low-resource languages. Nat. Lang. Process. 2025, 31, 183–197. [Google Scholar] [CrossRef]
- Bekarystankyzy, A.; Mamyrbayev, O.; Mendes, M.; Fazylzhanova, A.; Assam, M. Multilingual end-to-end ASR for low-resource Turkic languages with common alphabets. Sci. Rep. 2024, 14. [Google Scholar] [CrossRef] [PubMed]
- Tukeyev, U.; Amirova, D.; Karibayeva, A.; Sundetova, A.; Abduali, B. Combined Technology of Lexical Selection in Rule-Based Machine Translation. In Recent Advances in Systems, Control and Information Technology; Springer: Cham, Switzerland, 2017; pp. 491–500. [Google Scholar] [CrossRef]
- Tukeyev, U.; Karibayeva, A.; Abduali, B. Neural Machine Translation System for the Kazakh Language Based on Synthetic Corpora. MATEC Web Conf. 2019, 252, 03006. [CrossRef]
- Karibayeva, B.Abduali, D. Amirova. Formation of the Synthetic Corpus for Kazakh on the Base of Endings Complete System. Turklang-2018 proceedings of international conference., 2018; pp.114-118.
- Ahmadnia, B.; Serrano, J.; Haffari, G. Persian-Spanish Low-Resource Statistical Machine Translation Through English as Pivot Language. Proceedings of Recent Advances in Natural Language Processing, RANLP 2017, Varna, Bulgaria, 2017; INCOMA Ltd; pp. 24–30. [Google Scholar]
- Elmadani, K.N.; Buys, J. Neural Machine Translation between Low-Resource Languages with Synthetic Pivoting. Proceedings of LREC-COLING 2024, Torino, Italia, 2024; ELRA and ICCL; pp. 12144–12158. [Google Scholar]
- Pontes, J.J.A. Bilingual Sentence Alignment of a Parallel Corpus by Using English as a Pivot Language. Proceedings of JISIC, Quito, Ecuador, 2014; Association for Computational Linguistics; pp. 13–20. [Google Scholar]
- Alekseev, A.; Turatali, T. KyrgyzNLP: Challenges, Progress, and Future. arXiv arXiv:2411.05503v1, 2024.
- Riemland, M. Theorizing Sustainable, Low-Resource MT in Development Settings: Pivot-Based MT between Guatemala’s Indigenous Mayan Languages. Transl. Spaces 2023, 12. [Google Scholar] [CrossRef]
- Lin, D.; Murakami, Y.; Ishida, T. Towards Language Service Creation and Customization for Low-Resource Languages. Information 2020, 11, 67. [Google Scholar] [CrossRef]
- Trieu, H.-L.; Tran, V.; Ittoo, A.; Nguyen, L. Leveraging Additional Resources for Improving Statistical Machine Translation on Asian Low-Resource Languages. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2019, 18, 1–22. [Google Scholar] [CrossRef]
- Tiedemann, J. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012; pp. 2214–2218. [Google Scholar]
- Karakanta, A.; Orrego-Carmona, D. Subtitling in Transition: The Case of TED Talks. 2023. [CrossRef]
- Koehn, P. Europarl: A Parallel Corpus for Statistical Machine Translation. Proceedings of Machine Translation Summit X, Phuket, Thailand; 2005; pp. 79–86. [Google Scholar]
- Tiedemann, J.; Thottingal, S. OPUS-MT—Building Open Translation Services for the World. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation; 2020. [Google Scholar]
- Balahur, A.; Turchi, M. Comparative Experiments Using Supervised Learning and Machine Translation for Multilingual Sentiment Analysis. Comput. Speech Lang. 2014, 28, 56–75. [Google Scholar] [CrossRef]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv arXiv:1409.0473, 2015.
- Koehn, P.; Knowles, R. Six Challenges for Neural Machine Translation. In Proceedings of the First Workshop on Neural Machine Translation; 2020; pp. 28–39. [Google Scholar]
- Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th ACL; 2020; pp. 8440–8451. [Google Scholar]
- Costa-jussà, M.R.; Cross, J.; et al. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv arXiv:2207.04672, 2022.
- Hou, X. Research on Translation Corpus Building with the Assistance of AI. In Proceedings of the 2021 International Conference on Smart Technologies and Systems for IoT; 2022; pp. 136–140. [Google Scholar]
- Reixa, I.D.; et al. Corpus PaGeS: A Multifunctional Resource for Language Learning, Translation and Cross-Linguistic Research. In Parallel Corpus for Contrastive and Translation Studies; John Benjamins: Amsterdam, Netherlands, 2019; pp. 103–121. [Google Scholar]
- Post, M. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation; 2018; pp. 186–191. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of ACL 2002, 2002. [Google Scholar]
- Popović, M. chrF: Character n-gram F-score for Automatic MT Evaluation. In Proceedings of WMT 2015, 2015. [Google Scholar]
- Rei, R.; Sánchez-Cartagena, V.; Popović, M. COMET: A Neural Framework for MT Evaluation. In Proceedings of EMNLP 2020, 2020. [Google Scholar]
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Ponde, H.B.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv arXiv:2107.03374, 2021.
- Federmann, C.; Kocmi, T.; Xin, Y. NTREX-128—News Test References for MT Evaluation of 128 Languages. In Proceedings of the First Workshop on Scaling Up Multilingual Evaluation; 2022; pp. 21–24. [Google Scholar]
- Goyal, N.; Gao, C.; Chaudhary, V.; Chen, P.-J.; Wenzek, G.; Ju, D.; Krishnan, S.; Ranzato, M.; Guzman, F.; Fan, A. The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation. Trans. Assoc. Comput. Linguist. 2022, 10, 522–538. [Google Scholar] [CrossRef]
- Abduali, B.; Tukeyev, U.; Zhumanov, Z.; Israilova, N. Study of Kyrgyz-Kazakh Neural Machine Translation. In Recent Challenges in Intelligent Information and Database Systems; Nguyen, N.T., et al., Eds.; Springer: Singapore, 2025. [Google Scholar] [CrossRef]
- Zhumanov, Z.; et al. Integrated Technology for Creating Quality Parallel Corpus. In Advances in Computational Collective Intelligence; pp. 511–524.
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv arXiv:1609.08144, 2016.
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv arXiv:2005.14165, 2020.
- Costa-jussà, M.R.; Cross, J.; Fan, A.; Ghazvininejad, M.; Gu, J.; Gunasekara, C.; He, Y.; Kalbassi, E.; Liptchinsky, V.; Liu, Z.; et al. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv arXiv:2207.04672, 2022.
- Author(s). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. [Details missing].
- Ziegler, A.; Kalliamvakou, E.; Li, X.A.; Rice, A.; Rifkin, D.; Simister, S.; Sittampalam, G.; Aftandilian, E. Measuring GitHub Copilot’s Impact on Productivity. Commun. ACM 2024, 67, 54–63. [Google Scholar] [CrossRef]
- Gemma Team; Mesnard, T.; et al. Gemma: Open Models Based on Gemini Research and Technology. arXiv 2024, arXiv:2403.08295.


| Specification | Value |
| Graphics Card | NVIDIA RTX 4090 24 GB |
| Graphics Memory Type | GDDR6X |
| Graphics Memory Size | 24 GB |
| CUDA Cores | 16,384 |
| Core Clock Speed | 2.23 GHz |
| RAM | 128 GB DDR4/DDR5 |
| RAM Type | DDR4 or DDR5 |
| Network Interfaces | 10Gb Ethernet (or higher) |
| Power Supply | 850 W or higher |
| Metrics | Value range | Interpretation |
| BLEU | 0-100 | >50 – Excellent 30–50 – Good 10–30 – Fair <10 – Poor |
| WER | 0–1 Example: WER of 0.8 means that there is an 80% error rate for compared sentences. |
<0.2 – Excellent 0.2 – 0.4 Acceptable >0.4– Poor |
| ChrF | 0-100 | >60 – Excellent 40–60 – Good <40 – Weak |
| METEOR | 0-1 | >0.5 – Excellent 0.3-0.5 – Moderate <0.3 – Low |
| COMET | 0-1 | >0.5 – High quality 0.3–0.5 – Acceptable <0.3 – Weak |
| Dataset name | Dataset link | Number of parallel sentences |
| OPUS [15] | https://opus.nlpl.eu/results/kk&ky/Corpus-result-table | 102 345 |
| NTREX [31] | https://huggingface.co/datasets/davidstap/NTREX | 2 000 |
| Flores 101 [32] | https://huggingface.co/datasets/severo/flores_101 | 2 000 |
|
Metrics System |
BLEU | WER | METEOR | COMET | ChrF |
| Google Translator | 14.0 | 0.92 | 0.078 | 0.692 | 23.01 |
| Chat GPT | 36.6 | 0.87 | 0.151 | 0.818 | 31.56 |
| Nllb-200-3.3 | 30.4 | 0.88 | 0.126 | 0.755 | 27.12 |
| DeepSeek | 33.0 | 0.87 | 0.145 | 0.819 | 31.56 |
| Copilot | 26.2 | 0.87 | 0.146 | 0.812 | 31.38 |
| Gemma-2-27b | 31.0 | 0.87 | 0.136 | 0.802 | 30.25 |
| Indicator | Gemma-2-27 | Nllb-200-3.3 |
| Translation speed | 3 sentences per minute | 300 sentences per minute |
| Time for full translation of 302 530 sentences | ~2.5 months | ~2 days |
| Corpus name | Quantity of sentences | Quantity of words | Size of file |
| KZ-KG | 302 530 | ~ 10 000 000 | 139.5 MB |
| Original text in Kazakh | Translated text using Nllb-200-3.3 into Kyrgyz | Explanation of the identified errors |
| Мемлекет басшысы Нұрсұлтан Назарбаев Ресей Федерациясының Президенті Владимир Путинге Дoнецк маңында бoлған ТУ - 154 жoлаушылар ұшағының апатынан адамдардың қаза бoлуына байланысты көңіл айтты . | Президент Владимир Путинге Дoнецк шаарынын жанындагы ТУ-154 учагынын кыйрашынан каза бoлгoндoргo байланыштуу көңүл айтты . | “Мемлекет басшысы Нұрсұлтан Назарбаев Ресей Федерациясының Президенті (Head of State Nursultan Nazarbayev President of the Russian Federation)” – the Kazakh phrase translated to Kyrgyz only “ Президент (President)”, must be “Мамлекет башчысы Нурсултан Назарбаев Рoссия Федерациясынын Президенти” “жoлаушылар(passengers)” - do not translate, must be “жүргүнчүлөр” “адамдардың(peoples)” - do not translate, must be “адамдар” |
| үкімет басшысы еққдб-ның қазақстандағы мемлекеттік-жекешелік әріптестік саласындағы ірі жoбаны - үлкен алматы автoмoбиль айналма жoлын (үаааж) іске асыруға қатысуының маңыздылығын атап өтті. | Өкмөт башчысы ӨКМдин Казакстандагы мамлекеттик-жеке өнөктөштүк тармагындагы ири дoлбooрду - чoң Алматы автoжoл айланма жoлун (УААЖ) ишке ашырууга катышуусунун маанилүүлүгүн белгиледи. | “еққдб-ның” – “ӨКМдин” “үаааж” – “УААЖ” abbreviated words were translated incorrectly |
| ертең қр премьер-министрі асқар мамин шанхай ынтымақтастық ұйымына (шыұ) мүше мемлекеттердің үкімет басшылары кеңесінің (үбк) oтырысына қатысады | Эртең Кыргызстандын премьер-министри Шанхай кызматташтык уюмуна мүчө өлкөлөрдүн өкмөт башчыларынын кеңешинин (ШК) жыйынына катышат . | “қр премьер-министрі асқар мамин (Prime Minister of the Kazakhstan Republic Askar Mamin)” – “Кыргызстандын премьер-министри” there is translated not Kazakhstan, Kyrgyz |
| қр премьер-министрі асқар мамин арыс қаласындағы зардап шеккен үйлерді , әлеуметтік нысандар мен инженерлік желілерді қалпына келтіру жұмыстарының барысымен танысу мақсатында жұмыс сапарымен түркістан oблысына барды . | Кыргызстандын премьер-министри Аскар Мамин шаардагы кыйраган үйлөрдү , сoциалдык oбъектилерди жана инженердик тармактарды калыбына келтирүү иштеринин жүрүшү менен таанышуу максатында Түркстан oблусуна иш сапары менен барды . | “қр (RK)” translated “Кыргызстандын (Kyrgyzstan)”, must be “КР (RK)” or “Казакстаннын (Kazakhstan)” “арыс” – does not translated, the phrase “арыс қаласындағы (in the Arys city)” - just translated “шаардагы (in the city)” without name of city |
| асқар мамин заңнамалық өзгерістерді әзірлеу және енгізу қажеттілігін айтты . | Жoгoрку министр мыйзамдык өзгөртүүлөрдү иштеп чыгуу жана киргизүү зарылдыгын айтты . | “асқар мамин (Askar Mamin)” – translated to another phrase like “Жoгoрку министр (senior minister)” |
| "қр премьер-министрі асқар маминнің төрағалығымен өткен үкімет oтырысында "" еңбек "" нәтижелі жұмыспен қамтуды және жаппай кәсіпкерлікті дамытудың 2017-2021 жылдарға арналған мемлекеттік бағдарламасын іске асыру барысы қаралды" | "Өкмөттүн жыйынында 2017-2021-жылдарга "" эмгекти "" натыйжалуу пайдалануу жана массалык ишкердикти өнүктүрүү бoюнча мамлекеттик прoграмманы ишке ашыруунун жүрүшү каралды ." | “қр премьер-министрі асқар маминнің төрағалығымен өткен (Premier minster of Republic Kazakhstan Askar Mamin chaired by)” – does not translated |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).