Submitted:
29 March 2026
Posted:
31 March 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We develop KokborokMT, a significantly improved NMT system for Kokborok, by fine-tuning NLLB-200 [6] with a novel trp_Latn language token.
- We construct a 36,052-sentence parallel corpus combining professional translations from SMOL [7], WMT Bible data, and synthetic back-translations generated from Tatoeba English sentences using Gemini Flash.
- We demonstrate that synthetic data augmentation via LLM back-translation yields consistent improvements across all evaluation metrics and test conditions.
- We provide comprehensive ablation studies comparing zero-shot NLLB, fine-tuning without synthetic data (System 1), and fine-tuning with synthetic data (System 2).
- We investigate LaBSE-based quality filtering for synthetic data and report that filtering scores are unreliable for Kokborok due to the language’s absence from LaBSE’s training data, an important negative finding for the community.
- We conduct human evaluation with three annotators including a linguistic expert and a domain specialist, yielding mean scores of 3.74/5 (adequacy) and 3.70/5 (fluency).
- We release the model and evaluation scripts publicly to facilitate further research on Kokborok NLP.
2. Background and Related Work
2.1. Kokborok: Language and Script
2.2. Prior NLP Work on Kokborok
2.3. Low-Resource MT and Back-Translation
3. Data
3.1. Parallel Corpus Construction
3.1.1. SMOL (9,284 Sentences)
3.1.2. WMT Bible Corpus (1,769 Sentences)
3.1.3. Synthetic Back-Translation (24,999 Sentences)
3.2. Quality Filtering Investigation
3.3. Data Splits and Deduplication
- SMOL Test Set (500 sentences): Randomly sampled from SMOLDOC sentences only, ensuring domain diversity and professional translation quality.
- WMT Test Set (499 sentences): Randomly sampled from the WMT Bible corpus, enabling direct comparison with WMT shared task results.
- Development Set (500 sentences): Randomly sampled from remaining SMOL sentences after test extraction.
3.4. Data Statistics
4. Methodology
4.1. Base Model and Language Token
4.2. Training Setup
4.2.1. Hyperparameters
4.2.2. Model Selection
4.3. Experimental Conditions
- Zero-Shot NLLB: The base NLLB-200-distilled-600M model with trp_Latn token added but no fine-tuning.
- System 1 (No BT): Fine-tuned on SMOL + WMT only (11,053 pairs; 23,098 with both directions). No synthetic data.
- System 2 (Full): Fine-tuned on SMOL + WMT + Gemini synthetic data (36,052 pairs; 72,096 with both directions). Primary system.
5. Evaluation
5.1. Automatic Metrics
- chrF [14]: Character n-gram F-score via sacreBLEU.
- ROUGE-L [15]: Longest common subsequence F-measure.
- METEOR [16]: Alignment-based metric using WordNet.
- TER [17]: Translation edit rate (lower is better).
- Cosine Similarity: Semantic similarity via LaBSE embeddings [11].
- COMET [18]: Neural evaluation using Unbabel/wmt22-comet-da.
5.2. Human Evaluation
5.2.1. Data
5.2.2. Annotators
5.2.3. Criteria
5.2.4. Results
5.3. Reproducibility
5.4. Automatic Evaluation Results
6. Analysis
6.1. Impact of Fine-Tuning
6.2. Impact of Synthetic Back-Translation
6.3. LaBSE Quality Filtering
6.4. Human Evaluation Analysis
6.5. Domain Generalisation
6.6. Translation Direction Asymmetry
6.7. Comparison with Prior Work
7. Limitations
8. Conclusion
Data Availability Statement
Acknowledgments
Appendix A. sacreBLEU Signatures
- EN→TRP (SMOL): BLEU = 15.25 47.9/20.4/10.2/5.5 (BP=1.000 ratio=1.009 hyp_len=8142 ref_len=8068)
- EN→TRP (WMT): BLEU = 17.30 46.9/22.6/12.0/7.4 (BP=0.988 ratio=0.988 hyp_len=12733 ref_len=12884)
- TRP→EN (SMOL): BLEU = 38.56 65.5/43.1/32.2/24.7 (BP=0.997 ratio=0.997 hyp_len=8595 ref_len=8620)
- TRP→EN (WMT): BLEU = 28.03 58.2/33.8/21.7/14.5 (BP=1.000 ratio=1.010 hyp_len=14381 ref_len=14243)
Appendix B. Training Loss Curves
| Epoch | Train Loss | Val Loss |
|---|---|---|
| System 2 (Full, With BT) | ||
| 1 | 0.3573 | 0.3816 |
| 2 | 0.2889 | 0.3187 |
| 3 | 0.2626 | 0.2896 |
| 4 | 0.2439 | 0.2723 |
| 5 | 0.2335 | 0.2607 |
| 6 | 0.2250 | 0.2533 |
| 7 | 0.2177 | 0.2477 |
| 8 | 0.2129 | 0.2450 |
| 9 | 0.2085 | 0.2429 |
| 10 | 0.2064 | 0.2422 |
| System 1 (No BT) | ||
| 1 | 8.4454 | 1.4459 |
| 2 | 1.8205 | 0.3779 |
| 3 | 0.4388 | 0.3199 |
| 4 | 0.3998 | 0.2862 |
| 5 | 0.3571 | 0.2651 |
| 6 | 0.3399 | 0.2506 |
| 7 | 0.3200 | 0.2403 |
| 8 | 0.3130 | 0.2333 |
| 9 | 0.3100 | 0.2291 |
| 10 | 0.3034 | 0.2278 |
Appendix C. Human Evaluation Data Pipeline
References
- Debbarma, K.; Patra, B.G.; Das, D.; Bandyopadhyay, S. Morphological Analyzer for Kokborok. In Proceedings of the Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing. The COLING 2012 Organizing Committee, 2012, pp. 41–52.
- Pal, S.; Pakray, P.; Laskar, S.R.; Laitonjam, L.; Khenglawt, V.; Warjri, S.; Dadure, P.K.; Dash, S.K. Findings of the WMT 2023 Shared Task on Low-Resource Indic Language Translation. In Proceedings of the Proceedings of the Eighth Conference on Machine Translation. Association for Computational Linguistics, 2023, pp. 682–694. https://doi.org/10.18653/v1/2023.wmt-1.56.
- Pakray, P.; Pal, S.; Vetagiri, A.; Krishna, R.; Dash, S.K.; Maji, A.K.; Laitonjam, L.; Lyngdoh, S.; Manna, R. Findings of the WMT 2024 Shared Task on Low-resource Indic Languages Translation. In Proceedings of the Proceedings of the Ninth Conference on Machine Translation. Association for Computational Linguistics, 2024, pp. 654–668. https://doi.org/10.18653/v1/2024.wmt-1.54.
- Pakray, P.; Krishna, R.M.; Pal, S.; Vetagiri, A.; Dash, S.K.; Maji, A.K.; Lyngdoh, S.A.; Laitonjam, L.; Jamatia, A.; Sambyo, K.; et al. Findings of WMT 2025 Shared Task on Low-resource Indic Languages Translation. In Proceedings of the Proceedings of the Tenth Conference on Machine Translation (WMT). Association for Computational Linguistics, 2025. https://doi.org/10.18653/v1/2025.wmt-1.29.
- ANVITA Team. ANVITA: A Multi-pronged Approach for Enhancing Machine Translation of Extremely Low-Resource Indian Languages. In Proceedings of the Proceedings of the Tenth Conference on Machine Translation (WMT). Association for Computational Linguistics, 2025, pp. 1240–1247. https://doi.org/10.18653/v1/2025.wmt-1.101.
- NLLB Team.; Costa-jussà, M.R.; Cross, J.; Çelebi, O.; et al. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv preprint arXiv:2207.04672 2022. https://doi.org/10.48550/arXiv.2207.04672.
- Caswell, I.; Nielsen, E.; Luo, J.; Cherry, C.; Kovacs, G.; Shemtov, H.; Talukdar, P.; Tewari, D.; et al. SMOL: Professionally Translated Parallel Data for 115 Under-Represented Languages. In Proceedings of the Proceedings of the Tenth Conference on Machine Translation (WMT). Association for Computational Linguistics, 2025. https://doi.org/10.18653/v1/2025.wmt-1.85.
- Debbarma, A.; Bhattacharya, P.; Purkayastha, B.S. Named Entity Recognition for a Low Resource Language. In Proceedings of the International Journal of Recent Technology and Engineering, 2019, Vol. 8, pp. 587–590. https://doi.org/10.35940/ijrte.B2085.098319.
- Sennrich, R.; Haddow, B.; Birch, A. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2016, pp. 86–96. https://doi.org/10.18653/v1/P16-1009.
- Tiedemann, J. The Tatoeba Translation Challenge – Realistic Data Sets for Low Resource and Multilingual MT. In Proceedings of the Proceedings of the Fifth Conference on Machine Translation. Association for Computational Linguistics, 2020, pp. 1174–1182. https://doi.org/10.18653/v1/2020.wmt-1.139.
- Feng, F.; Yang, Y.; Cer, D.; Arivazhagan, N.; Wang, W. Language-agnostic BERT Sentence Embedding. In Proceedings of the Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2022, pp. 878–891. https://doi.org/10.18653/v1/2022.acl-long.62.
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2002, pp. 311–318. https://doi.org/10.3115/1073083.1073135.
- Post, M. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Proceedings of the Third Conference on Machine Translation: Research Papers. Association for Computational Linguistics, 2018, pp. 186–191. https://doi.org/10.18653/v1/W18-6319.
- Popović, M. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Proceedings of the Tenth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2015, pp. 392–395. https://doi.org/10.18653/v1/W15-3049.
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out. Association for Computational Linguistics, 2004, pp. 74–81.
- Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, 2005, pp. 65–72.
- Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. A Study of Translation Edit Rate with Targeted Human Annotation. In Proceedings of the Proceedings of the 7th Conference of the Association for Machine Translation in the Americas, 2006, pp. 223–231.
- Rei, R.; Stewart, C.; Farinha, A.C.; Lavie, A. COMET: A Neural Framework for MT Evaluation. In Proceedings of the Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2020, pp. 2685–2702. https://doi.org/10.18653/v1/2020.emnlp-main.213.
- Graham, Y.; Baldwin, T.; Moffat, A.; Zobel, J. Continuous Measurement Scales in Human Evaluation of Machine Translation. In Proceedings of the Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse. Association for Computational Linguistics, 2013, pp. 33–41.

| Source | Sentences | Type |
|---|---|---|
| SMOL (train) | 9,284 | Professional |
| WMT Bible (train) | 1,769 | Professional |
| Gemini BT (Tatoeba) | 24,999 | Synthetic |
| Total Train | 36,052 | |
| Dev (SMOL) | 500 | Professional |
| Test (SMOL) | 500 | Professional |
| Test (WMT) | 499 | Professional |
| Annotator | Adequacy | Fluency |
|---|---|---|
| Linguistic expert | 3.76 | 3.76 |
| Native speaker | 3.64 | 3.54 |
| Native researcher (Kokborok) | 3.84 | 3.80 |
| Mean | 3.74 | 3.70 |
| (expert vs researcher) | 0.672 | 0.677 |
| (expert vs native) | 0.134 | 0.109 |
| (native vs researcher) | 0.004 | 0.019 |
| System | Direction | BLEU | chrF | ROUGE-L | METEOR | TER↓ | Cos Sim | COMET |
|---|---|---|---|---|---|---|---|---|
| SMOL Test Set (general domain) | ||||||||
| Zero-Shot NLLB | en→trp | 0.50 | 11.89 | 0.0261 | 0.0132 | 139.51 | 0.1939 | 0.2697 |
| Zero-Shot NLLB | trp→en | 0.63 | 17.07 | 0.0675 | 0.0526 | 130.30 | 0.1872 | 0.2880 |
| System 1 (No BT) | en→trp | 13.35 | 46.22 | 0.3873 | 0.3112 | 75.95 | 0.7707 | 0.6938 |
| System 1 (No BT) | trp→en | 32.91 | 49.41 | 0.5498 | 0.5091 | 55.07 | 0.7466 | 0.6604 |
| System 2 (Full) | en→trp | 15.25 | 47.67 | 0.3896 | 0.3138 | 74.26 | 0.7596 | 0.6958 |
| System 2 (Full) | trp→en | 38.56 | 53.92 | 0.5919 | 0.5602 | 50.15 | 0.7911 | 0.6926 |
| WMT Test Set (Bible domain) | ||||||||
| Zero-Shot NLLB | en→trp | 0.09 | 12.76 | 0.0136 | 0.0056 | 121.59 | 0.3361 | 0.2545 |
| Zero-Shot NLLB | trp→en | 0.32 | 16.89 | 0.0560 | 0.0390 | 123.30 | 0.2701 | 0.2888 |
| System 1 (No BT) | en→trp | 14.87 | 42.34 | 0.3908 | 0.3136 | 86.11 | 0.7268 | 0.6718 |
| System 1 (No BT) | trp→en | 24.99 | 45.14 | 0.5078 | 0.4306 | 70.23 | 0.7889 | 0.6413 |
| System 2 (Full) | en→trp | 17.30 | 47.11 | 0.4332 | 0.3483 | 76.81 | 0.7479 | 0.7064 |
| System 2 (Full) | trp→en | 28.03 | 48.18 | 0.5449 | 0.4713 | 66.31 | 0.8171 | 0.6640 |
| WMT 2025 Shared Task Best Systems [4] (Bible test set only; other metrics not reported) | ||||||||
| WMT Best | en→trp | 6.99 | 38.08 | 0.367 | 0.300 | 76.26 | – | – |
| WMT Best | trp→en | 2.99 | 25.52 | 0.218 | 0.163 | 117.73 | 0.487 | – |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).