Building Neural Machine Translation for Garo: Corpus, Ablation, and Agentic Reranking

Badal Nyalang; Walmatchi Momin

doi:10.20944/preprints202604.0370.v1

Submitted:

05 April 2026

Posted:

07 April 2026

You are already at the latest version

Abstract

We present GaroNMT, a neural machine translation system for Garo (ISO 639-3: grt), a Tibeto-Burman language spoken by approximately one million speakers in Northeast India, written in a Latin-based orthography with distinctive glottal stop and nasal characters. Despite its significant speaker population, Garo has received virtually no attention in NLP research, with no publicly available machine translation system prior to this work. We fine-tune NLLB-200-distilled-600M by introducing a custom language token (grt_Latn), and conduct a systematic ablation study across six configurations including zero-shot baselines, gold-only fine-tuning, and combined LLM-backtranslation and gold training, with and without continued pretraining (CPT) on Garo monolingual data. Our best model (B1: fresh base, BT + gold) achieves BLEU 14.06 / ChrF++ 51.38 in-domain and BLEU 16.50 / ChrF 54.52 out-of-domain in the English-to-Garo direction, and BLEU 29.50 / ChrF++ 49.23 in-domain and BLEU 45.37 / ChrF 60.15 out-of-domain in the Garo-to-English direction, compared to a zero-shot baseline of BLEU 0.23 / 0.56 respectively. We find that CPT on 55,623 Garo sentences converges to near-zero loss within one epoch and provides no downstream translation benefit. We additionally identify and document a Garo-specific evaluation challenge: Unicode interpunct inconsistency between U+00B7 and U+2219 artificially suppresses automatic metrics and requires normalisation before scoring. We release our 15,441-pair gold parallel corpus and models under CC-BY-4.0 to support future NLP research on Garo and related Northeast Indian languages.

Keywords:

neural machine translation

;

Garo (grt)

;

low-resource languages

;

Tibeto-Burman

;

NLLB-200

;

backtranslation

;

agentic reranking

;

parallel corpus

;

Northeast India

;

unicode normalization

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Building Neural Machine Translation for Garo: Corpus, Ablation, and Agentic Reranking

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe