Preprint
Article

This version is not peer-reviewed.

Building Neural Machine Translation for Garo: Corpus, Ablation, and Agentic Reranking

Submitted:

05 April 2026

Posted:

07 April 2026

You are already at the latest version

Abstract
We present GaroNMT, a neural machine translation system for Garo (ISO 639-3: grt), a Tibeto-Burman language spoken by approximately one million speakers in Northeast India, written in a Latin-based orthography with distinctive glottal stop and nasal characters. Despite its significant speaker population, Garo has received virtually no attention in NLP research, with no publicly available machine translation system prior to this work. We fine-tune NLLB-200-distilled-600M by introducing a custom language token (grt_Latn), and conduct a systematic ablation study across six configurations including zero-shot baselines, gold-only fine-tuning, and combined LLM-backtranslation and gold training, with and without continued pretraining (CPT) on Garo monolingual data. Our best model (B1: fresh base, BT + gold) achieves BLEU 14.06 / ChrF++ 51.38 in-domain and BLEU 16.50 / ChrF 54.52 out-of-domain in the English-to-Garo direction, and BLEU 29.50 / ChrF++ 49.23 in-domain and BLEU 45.37 / ChrF 60.15 out-of-domain in the Garo-to-English direction, compared to a zero-shot baseline of BLEU 0.23 / 0.56 respectively. We find that CPT on 55,623 Garo sentences converges to near-zero loss within one epoch and provides no downstream translation benefit. We additionally identify and document a Garo-specific evaluation challenge: Unicode interpunct inconsistency between U+00B7 and U+2219 artificially suppresses automatic metrics and requires normalisation before scoring. We release our 15,441-pair gold parallel corpus and models under CC-BY-4.0 to support future NLP research on Garo and related Northeast Indian languages.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated