Preprint
Article

This version is not peer-reviewed.

Building Neural Machine Translation for Garo: Corpus, Ablation, and Agentic Reranking

Submitted:

05 April 2026

Posted:

07 April 2026

You are already at the latest version

Abstract
We present GaroNMT, a neural machine translation system for Garo (ISO 639-3: grt), a Tibeto-Burman language spoken by approximately one million speakers in Northeast India, written in a Latin-based orthography with distinctive glottal stop and nasal characters. Despite its significant speaker population, Garo has received virtually no attention in NLP research, with no publicly available machine translation system prior to this work. We fine-tune NLLB-200-distilled-600M by introducing a custom language token (grt_Latn), and conduct a systematic ablation study across six configurations including zero-shot baselines, gold-only fine-tuning, and combined LLM-backtranslation and gold training, with and without continued pretraining (CPT) on Garo monolingual data. Our best model (B1: fresh base, BT + gold) achieves BLEU 14.06 / ChrF++ 51.38 in-domain and BLEU 16.50 / ChrF 54.52 out-of-domain in the English-to-Garo direction, and BLEU 29.50 / ChrF++ 49.23 in-domain and BLEU 45.37 / ChrF 60.15 out-of-domain in the Garo-to-English direction, compared to a zero-shot baseline of BLEU 0.23 / 0.56 respectively. We find that CPT on 55,623 Garo sentences converges to near-zero loss within one epoch and provides no downstream translation benefit. We additionally identify and document a Garo-specific evaluation challenge: Unicode interpunct inconsistency between U+00B7 and U+2219 artificially suppresses automatic metrics and requires normalisation before scoring. We release our 15,441-pair gold parallel corpus and models under CC-BY-4.0 to support future NLP research on Garo and related Northeast Indian languages.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  ;  

1. Introduction

Northeast India is home to over 220 languages spanning multiple language families, including Tibeto-Burman, Austro-Asiatic, and Tai-Kadai (Nyalang 2026). Despite this extraordinary linguistic diversity, the vast majority of these languages remain severely underrepresented in natural language processing research. The region’s languages are typically absent from large multilingual models, parallel corpora, and shared tasks, creating a compounding digital divide for millions of speakers.
Garo is one of the more widely spoken languages in this group. It belongs to the Bodo-Garo branch of the Tibeto-Burman family and is spoken primarily in the Garo Hills districts of Meghalaya and in adjacent parts of Assam and Tripura. Estimates of the speaker population range from 800,000 to over one million (Eberhard et al. 2023). The language is written in a Latin-based orthography that features distinctive characters including glottal stops (represented with the interpunct · or ·) and nasal forms, giving it a unique orthographic profile among Indic languages.
Despite its speaker population comparable to many European languages that enjoy substantial NLP resources, Garo has received virtually no attention from the NLP community. No machine translation system, parallel corpus, or language model specifically targeting Garo existed prior to this work. The language is absent from NLLB-200 (NLLB Team et al. 2022), IndicTrans2 (Gala et al. 2023), and other major multilingual translation systems covering South and Southeast Asian languages.
This paper makes the following contributions:
  • We present GaroNMT, the first publicly available neural machine translation system for Garo, supporting bidirectional translation between Garo (Latin script) and English.
  • We release a curated gold parallel corpus of 15,441 sentence pairs spanning mixed domains, building on an earlier 2,500-pair release (Labs 2025).
  • We conduct a systematic ablation study across six configurations examining zero-shot performance, the effect of LLM-based backtranslation augmentation, and continued pretraining on monolingual Garo data.
  • We document a Garo-specific evaluation artefact: Unicode interpunct inconsistency between U+00B7 (·) and U+2219 (·) artificially suppresses automatic MT metrics and must be normalised before evaluation.
  • We report a null result for CPT: monolingual pretraining converges within one epoch and provides no downstream translation benefit, a finding relevant to practitioners working on similar low-resource settings.
  • We introduce a GaroBERT-guided agentic reranking layer over GaroNMT, the first such system demonstrated for a Northeast Indian language.
  • We release LaBSE-Garo, a fine-tuned cross-lingual embedding model achieving 89.8% mean translation retrieval accuracy on Garo-English pairs, enabling future corpus mining and quality estimation work for Garo.
  • All models will be released under CC-BY-4.0 upon acceptance.

2. Background

2.1. The Garo Language

Garo (grt) is classified under the Bodo-Garo subgroup of the Tibeto-Burman language family. It is recognised as a scheduled language in Meghalaya, where it serves as the primary language of the Garo Hills region. The language exhibits a subject-object-verb (SOV) basic word order, agglutinative morphology, and a system of nominal classifiers.
The standard orthography uses the Latin script with several augmentations. The glottal stop, a phonemically distinctive sound in Garo, is represented using the interpunct character (· or ·), as in an·ching (we/us) or ong·a (is/exists). This orthographic complexity, combined with agglutinative morphology, makes Garo non-trivial for multilingual tokenizers trained primarily on European and South Asian scripts.
Garo’s agglutinative morphology means that a single English word may correspond to multiple valid Garo surface forms. For example, “farmer” may be rendered as gamgipa or gamgiparang depending on grammatical context. Politeness and register distinctions may also be expressed lexically: “Please sit down” may be translated as Asongbo (bare imperative) or Ka·sapae asongbo (with explicit politeness marker), both of which are correct. This morphological and pragmatic richness has direct implications for automatic evaluation, as single-reference BLEU penalises valid alternatives.

2.2. NLP for Northeast Indian Languages

NLP research on Northeast Indian languages has historically been sparse. Recent years have seen growing momentum, with the release of multilingual models such as NE-BERT (Nyalang 2026), which covers nine Northeast Indian languages including Garo, and the construction of monolingual corpora for several of the region’s languages. However, sequence-to-sequence tasks such as machine translation and automatic speech recognition have received comparatively little attention.
For Garo specifically, NE-BERT represents the first documented pretraining work in the Latin script, demonstrating that encoder representations for the language can be learned from relatively small monolingual corpora. Our work extends this into the generative, seq2seq setting, targeting practical deployment for bidirectional translation.

2.3. Low-Resource Neural Machine Translation

Neural machine translation for low-resource languages has advanced significantly through multilingual pretrained models. NLLB-200 (NLLB Team et al. 2022), trained on data for 200 languages, provides a strong foundation for fine-tuning on new or underrepresented languages. Prior work has demonstrated that adding custom language tokens to NLLB and fine-tuning on small gold corpora can yield usable translation quality for languages absent from the original training set (Adelani et al. 2022).
LLM-based backtranslation, generating synthetic parallel data by translating monolingual text using a large language model, has emerged as a practical augmentation strategy when human-translated data is limited (Sennrich et al. 2016). Continued pretraining (CPT) on target-language monolingual data is another commonly proposed adaptation strategy, though its impact in extremely low-resource settings is less well-studied (Müller et al. 2021). Our ablation directly addresses this gap.

3. Data

3.1. Gold Parallel Corpus

Our primary training resource is a curated gold parallel corpus of English-Garo sentence pairs constructed with an explicit focus on deployment-oriented coverage, spanning general conversation, government and administrative language, agriculture, health, education, and everyday scenarios. This intentional mixed-domain design distinguishes it from corpora built around a single domain such as Bible translations or legal texts.
The corpus includes approximately 1,500 short entries (single words to two- or three-word phrases) for lexical and idiomatic coverage, and approximately 13,500 full sentences including complex multi-clause constructions. Starting from 17,797 raw pairs, we removed null entries (13), deduplicated on the English source side (2,356 removed), removed entries exceeding 500 characters (1), and removed entries under 3 words (7), yielding 15,441 clean pairs. We shuffled with seed 42 and split into 500 test pairs (held out throughout), 500 validation pairs, and 14,441 training pairs. An earlier 2,500-pair subset was previously released (Labs 2025); the full corpus will be released under CC-BY-4.0 upon acceptance.

3.2. Backtranslation Data

We generated synthetic parallel data using Gemini Flash, translating English sentences into Garo in batches of 50 with structured output prompting and automatic retry logic. Of approximately 55,000 attempted pairs, 52,386 usable pairs were retained after cleaning (deduplication, null removal, TSV parsing error filtering), a success rate of approximately 95.2%. We treat this data as noisy silver-standard: the quality of LLM-generated Garo is inherently limited by the model’s sparse Garo knowledge, and no language-ID filtering was applied in the current work.

3.3. Monolingual Data

For CPT experiments, we used a monolingual Garo corpus of 55,623 sentences collected through MWire Labs’ Northeast Indian language data collection programme.

4. Model and Training

4.1. Base Model and Tokenizer

We use NLLB-200-distilled-600M (NLLB Team et al. 2022) as our base model. This 615M parameter sequence-to-sequence transformer covers 200 languages; Garo is not among them. We add a custom special token grt_Latn (ID: 256,204) to the tokenizer vocabulary and resize the model’s embedding matrices accordingly. We use the slow NllbTokenizer rather than the fast tokenizer, as the fast tokenizer lacks the lang_code_to_id mapping required for forced BOS decoding.

4.2. Training Configurations

We evaluate six configurations: two zero-shot baselines and four fine-tuned models. All fine-tuned configurations share identical hyperparameters (Table 1) and are evaluated on the same held-out 500-pair test set.

Zero-shot Fresh (ZS-F).

NLLB-200-distilled-600M with grt_Latn token added, no fine-tuning.

Zero-shot CPT (ZS-C).

The model after continued pretraining on 55,623 Garo monolingual sentences using a denoising objective (source equals target). CPT ran for 3 epochs; validation loss converged from 0.001097 at epoch 1 to 0.000363 at epoch 2, indicating near-complete convergence of the denoising task within the first two epochs. No translation fine-tuning was applied.

A1 – Fresh base, gold only.

Fine-tuned from NLLB-200-distilled-600M on 14,441 gold training pairs in both directions simultaneously, yielding 28,882 training examples.

A2 – CPT base, gold only.

The ZS-C checkpoint fine-tuned identically to A1. Isolates the effect of monolingual CPT on gold-only fine-tuning.

B1 – Fresh base, BT + gold.

Fine-tuned from NLLB-200-distilled-600M on the combined 52,386 BT pairs and 14,441 gold pairs in both directions, yielding 133,654 training examples. We use a single-stage approach mixing BT and gold data rather than a sequential pipeline, following observations that phased training provides no consistent benefit at this data scale.

B2 – CPT base, BT + gold.

The ZS-C checkpoint fine-tuned on the same combined dataset as B1.

4.3. Evaluation Protocol and Unicode Normalisation

All models are evaluated on the held-out 500-pair in-domain test set using BLEU (Papineni et al. 2002) via sacreBLEU (Post 2018), ChrF++ (Popović 2015), TER, and METEOR. Both translation directions are evaluated.
A critical preprocessing step is required before scoring: normalisation of Garo interpunct characters. Two Unicode code points are used interchangeably for the Garo glottal stop marker in practice, U+00B7 (middle dot, ·) and U+2219 (bullet operator, ). Without normalisation, identical surface forms differing only in interpunct code point are scored as mismatches, artificially suppressing all metrics. We normalise all predictions and references to U+00B7 before scoring and recommend this as a standard preprocessing step for all future Garo NLP evaluation. All results reported in this paper apply this normalisation.
We additionally evaluate B1 and B2 on a 100-pair out-of-domain test set generated using Gemini 2.5 Flash Thinking, covering domains not represented in the gold training corpus. We acknowledge that LLM-generated references may favour models trained on LLM-generated BT data; these results are reported as a preliminary OOD signal pending human-authored evaluation.

5. Results

Table 2. In-domain results (500-pair gold test set, Unicode-normalised). ZS-F = zero-shot fresh NLLB; ZS-C = zero-shot after CPT only. TER is lower-is-better (indicated by ↓). Bold = best result per column among fine-tuned models.
Table 2. In-domain results (500-pair gold test set, Unicode-normalised). ZS-F = zero-shot fresh NLLB; ZS-C = zero-shot after CPT only. TER is lower-is-better (indicated by ↓). Bold = best result per column among fine-tuned models.
Run Data en→grt (in-domain) grt→en (in-domain)
BLEU ChrF++ TER↓ METEOR BLEU ChrF++ TER↓ METEOR
ZS-F 0.23 8.95 221.27 0.056 0.56 14.78 148.81 0.087
ZS-C Mono CPT 0.69 11.42 136.04 0.064 0.71 12.74 106.61 0.056
A1 Gold 12.31 48.57 80.37 0.290 27.36 47.34 63.04 0.450
A2 CPT + Gold 12.68 48.44 79.45 0.292 25.76 46.02 66.09 0.437
B1 BT + Gold 14.06 51.38 75.70 0.321 29.50 49.23 59.88 0.485
B2 CPT + BT + Gold 13.25 51.26 75.08 0.314 29.36 49.36 60.20 0.483
Table 3. Out-of-domain results (100-pair LLM-generated test set). References generated by Gemini 2.5 Flash Thinking; scores may favour BT-trained models.
Table 3. Out-of-domain results (100-pair LLM-generated test set). References generated by Gemini 2.5 Flash Thinking; scores may favour BT-trained models.
Run en→grt grt→en
BLEU ChrF BLEU ChrF
B1 16.50 54.52 45.37 60.15
B2 14.55 52.88 45.32 60.55

Zero-shot baselines.

Both zero-shot configurations produce near-zero BLEU (ZS-F: 0.23 en→grt, 0.56 grt→en), confirming that NLLB-200 has no usable Garo translation capability out of the box. CPT alone (ZS-C) provides marginal TER improvement but no meaningful BLEU gain.

BT augmentation helps consistently.

Comparing A1 and B1, adding 52,386 BT pairs improves en→grt BLEU from 12.31 to 14.06 (+1.75) and grt→en BLEU from 27.36 to 29.50 (+2.14). ChrF++ improves by +2.81 and +1.89 respectively. Gains are consistent across all metrics and both directions.

CPT is a null result.

A1 and A2 differ by at most 0.37 BLEU and 0.13 ChrF++, within noise margin. CPT on 55,623 sentences converged to validation loss 0.000363 by epoch 2, indicating the denoising objective was essentially solved before translation fine-tuning began. The CPT step added compute without transferable translation signal.

B1 vs. B2.

In-domain results are mixed: B2 leads on TER en→grt (75.08 vs. 75.70), while B1 leads on BLEU, ChrF++, and METEOR in the en→grt direction. On OOD evaluation, B1 leads clearly on en→grt (BLEU 16.50 vs. 14.55, ChrF 54.52 vs. 52.88), which is the primary deployment direction. Given the null CPT result and B1’s OOD advantage, B1 is the recommended model.

5.1. Qualitative Analysis

Table 4 presents representative translations from B1 illustrating both model strengths and the limitations of single-reference evaluation.
Example 4 is particularly instructive. The reference translates “Please sit down” as bare Asongbo (“sit down”), omitting the politeness marker entirely. B1 produces Ka·sapae asongbo, correctly adding the Garo politeness particle for “please”. A native speaker would judge the model output superior, yet BLEU assigns zero n-gram overlap. This illustrates that single-reference BLEU systematically underestimates quality for morphologically and pragmatically rich languages like Garo, and motivates human evaluation as a necessary complement to automatic metrics.

6. Discussion

6.1. The Unicode Interpunct Evaluation Artefact

During development, we observed anomalously high automatic scores (BLEU 100.00) on preliminary evaluation runs, followed by unexpectedly low scores (BLEU ∼13) on subsequent runs using identical models. Investigation revealed that two interpunct Unicode code points, U+00B7 and U+2219, were being used inconsistently across the corpus, predictions, and references depending on the input source and editing environment. Without normalisation, a prediction of re·angaha (U+00B7) is scored as a mismatch against reference re·angaha (U+2219) despite being identical in meaning and pronunciation.
This is not a tokenization edge case but a fundamental orthographic inconsistency in the current state of Garo digital text. We recommend that all future Garo NLP work adopt U+00B7 as the canonical interpunct and apply the following normalisation before any evaluation:
text = text.replace(’\u2219’, ’\u00B7’)

6.2. On the CPT Null Result

CPT converging to validation loss 0.000363 within two epochs on a denoising (src=tgt) objective indicates that the model rapidly memorised the Garo monolingual data without meaningfully updating cross-lingual representations. The downstream translation quality is unchanged, consistent with this interpretation.
We hypothesise two contributing factors. First, NLLB-200’s multilingual pretraining already covers Tibeto-Burman-adjacent languages, providing a partial initialisation for Garo morphology. Second, 55,623 sentences is insufficient for a denoising CPT regime to reshape encoder representations meaningfully; the model solves the task by shallow pattern matching rather than acquiring deeper Garo linguistic structure. Future work should examine CPT with larger corpora (500k+ sentences) and structured objectives such as masked language modelling on the encoder only.

6.3. Deployment Context

GaroNMT is part of the Kren Stack, MWire Labs’ suite of language technology tools for Northeast Indian languages. The best-performing model (B1) is integrated into MWire Console (console.mwirelabs.com) and accessible via API. The mixed-domain gold corpus design was motivated directly by deployment requirements: a translation system for Northeast India must handle agricultural extension messaging, administrative forms, health communication, and conversational interfaces within a single model.

6.4. GaroBERT-Guided Agentic Reranking

We explore a post-hoc agentic quality layer over B1, combining GaroNMT with GaroBERT (Nyalang 2026), a masked language model pretrained on monolingual Garo text. The pipeline operates as follows: B1 generates n-best hypotheses via beam search; each hypothesis is Unicode-normalised and scored by GaroBERT using pseudo-log-likelihood (masking each token in turn and measuring the model’s surprisal); beam scores and GaroBERT fluency scores are normalised to [ 0 , 1 ] and blended as α · s beam + ( 1 α ) · s fluency ; the highest-scoring hypothesis is selected. If the best blended score falls below a confidence threshold, the pipeline retries with a wider beam ( n = 10 ), constituting an autonomous quality-driven decision loop.
We tune α on the development set and find that α = 0.9 yields the best BLEU, indicating that beam scores dominate and GaroBERT contributes as a tie-breaker. On the 500-pair in-domain test set, the agentic system achieves BLEU 14.16 / ChrF++ 51.21 versus the B1 baseline of 14.06 / 51.20, a marginal gain. We observe that GaroBERT fluency scores do not reliably proxy translation adequacy: the encoder captures surface Garo fluency but cannot verify meaning preservation, a known limitation of monolingual rerankers for morphologically rich languages. These results suggest that adequacy-aware reranking, potentially via cross-lingual sentence embeddings or reference-free metrics, is a productive direction for future work. To our knowledge, this is the first demonstrated agentic MT pipeline for any Northeast Indian language. As a byproduct of this investigation, fine-tuning LaBSE (Feng et al. 2022) on 14,441 Garo-English pairs yields 89.8% mean translation retrieval accuracy (91.2% src2trg, 88.4% trg2src), demonstrating that cross-lingual alignment for Garo is achievable with modest supervised data.

7. Limitations

OOD reference quality.

The out-of-domain test set uses LLM-generated Garo references, which may favour models trained on LLM-generated BT data. Human-authored OOD references are needed for unbiased generalisation evaluation.

Single-reference evaluation.

All automatic metrics are computed against a single reference. Garo’s agglutinative morphology and pragmatic richness mean that multiple valid surface forms exist for most inputs. Single-reference BLEU systematically underestimates quality, as illustrated by the Ka·sapae politeness marker example.

BT quality.

No language-ID filtering was applied to the 52,386 BT pairs. Some portion may contain fluent but incorrect Garo. Future iterations should apply fastText language identification as a quality filter before training.

Script coverage.

This work targets Garo in Latin script only. Alternative orthographic conventions are not covered.

Human evaluation.

Systematic human evaluation by native Garo speakers has not yet been conducted. This is the most important gap for future work.

8. Future Work

The most immediate priority is human evaluation by native Garo speakers using adequacy and fluency ratings on a standardised scale, which MWire Labs is actively planning through community partnerships. Second, an independently collected, human-authored OOD test set from news, government text, and spoken language transcripts will provide unbiased generalisation evaluation. Third, fastText-based language-ID filtering of BT data should be applied in future training rounds. Fourth, iterative gold corpus expansion, currently ongoing at MWire Labs, will enable staged improvement tracking as more data becomes available. Fifth, adequacy-aware reranking via cross-lingual embeddings or reference-free quality estimation metrics represents a natural extension of the agentic pipeline introduced in Section 6. Finally, integration with the Garo ASR system under development as part of the NE-Stack will enable end-to-end spoken-language translation for Garo.

9. Conclusion

We have presented GaroNMT, the first neural machine translation system for Garo, a Tibeto-Burman language of Northeast India with over one million speakers. Our systematic ablation across six configurations establishes zero-shot baselines (BLEU 0.23 en→grt, 0.56 grt→en), quantifies the gain from gold fine-tuning (BLEU 12.31 / 27.36), and shows that LLM-based backtranslation augmentation provides consistent further improvement (BLEU 14.06 / 29.50 in-domain, 16.50 / 45.37 OOD for B1). Continued pretraining on 55,623 Garo monolingual sentences converges within one epoch and provides no downstream benefit, a null result with practical value for low-resource MT practitioners. We document and resolve a Garo-specific Unicode interpunct evaluation artefact, identify reference inadequacy as a systematic challenge for single-reference evaluation of morphologically rich languages, and introduce the first agentic MT reranking pipeline for a Northeast Indian language. We release our corpus and models under CC-BY-4.0 to support the growing ecosystem of NLP tools for Northeast India’s languages.

Acknowledgments

The authors thank the Garo-speaking annotators and language contributors whose work made the parallel corpus possible. This work is part of the NE-Stack research programme at MWire Labs, focused on building language technology infrastructure for Northeast Indian languages.

Appendix A. Reference Inadequacy: Three Systematic Failure Modes

Automatic evaluation metrics such as BLEU are computed against a single reference translation. For Garo, we observe systematic cases where the model output is linguistically superior to the reference yet receives zero or near-zero n-gram credit. We document three distinct phenomena below, with concrete examples from our 500-pair in-domain test set.

Appendix A.1. Native Vocabulary vs. Loanwords

Garo has absorbed loanwords from Hindi, Assamese, and Bengali through prolonged contact. Reference translations in our corpus occasionally use these loanwords where a native Garo equivalent exists. In such cases, a model that produces the native form is penalised despite being more linguistically faithful.
English I am going to the market.
Reference Anga bazarona re·angenga.
B1 Prediction Anga antiona re·angenga.
Analysis bazar is a Hindi/Assamese loanword; anti is the native Garo word for market. The model is more linguistically faithful; BLEU assigns zero credit for the content word.

Appendix A.2. Semantic Precision: Eating vs. Drinking

Garo lexicalises eating solid food and consuming liquids with distinct verb stems. The verb cha·a is the standard form for eating solid food; chi·a is associated with drinking or consuming liquids. References occasionally use the less precise form, while the model selects the semantically correct verb.
English Have you eaten rice?
Reference Na·a mi chi·ahama?
B1 Prediction Na·a mi cha·ahama?
Analysis chi·a (consume liquid) is semantically imprecise for eating rice. cha·a (eat solid food) is the contextually correct verb. The model is more accurate; BLEU penalises it for not reproducing the reference’s imprecision.

Appendix A.3. Politeness and Register

Garo expresses politeness through the dedicated particle Ka·sapae, equivalent to “please” in imperatives. References in our corpus occasionally omit this marker when translating polite English imperatives. The model correctly identifies and translates the pragmatic intent.
English Please sit down.
Reference Asongbo.
B1 Prediction Ka·sapae asongbo.
English Please speak slowly.
Reference Saksate saksate nengmitaniko on·bo.
B1 Prediction Ka·sapae saksate nengmitaniko on·bo.
Analysis In both cases the reference omits the politeness marker. The model adds it correctly, producing a more faithful translation of the English source. BLEU penalises the additional token despite it being semantically necessary.
These three phenomena, loanword substitution, lexical semantic precision, and pragmatic register, collectively demonstrate that single-reference BLEU is a particularly unreliable metric for Garo and motivate multi-reference evaluation and human judgement as essential complements for future Garo MT research.

References

  1. Adelani, David Ifeoluwa, Jesujoba Alabi, Angela Fan, and et al. 2022. A few thousand translations go a long way! leveraging pre-trained models for African machine translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, pp. 3053–3070. [Google Scholar] [CrossRef]
  2. Eberhard, David M., Gary F. Simons, and Charles D. Fennig. 2023. Ethnologue: Languages of the world. [Google Scholar]
  3. Feng, Fangxiaoyu, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. Language-agnostic BERT sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers). Dublin, Ireland: Association for Computational Linguistics, pp. 878–891. [Google Scholar] [CrossRef]
  4. Gala, Jay, Pranjal A Madhani, Mitesh M Khapra, and et al. 2023. IndicTrans2: Towards high-quality and accessible machine translation models for all 22 scheduled Indian languages. In Transactions on Machine Learning Research. [Google Scholar]
  5. Labs, MWire. 2025. garo-english-parallel-corpus (revision 66dbd04). [Google Scholar] [CrossRef]
  6. Müller, Benjamin, Antonios Anastasopoulos, Benoît Sagot, and Djamé Seddah. 2021. Unseen languages are not unseen: Language families are enough for cross-lingual transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, August, pp. 4795–4806. [Google Scholar] [CrossRef]
  7. NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia-Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2022. No language left behind: Scaling human-centered machine translation. arXiv arXiv:2207.04672. [Google Scholar] [CrossRef]
  8. Nyalang, Badal. 2026. NE-BERT: A multilingual language model for nine Northeast Indian languages. In Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026). Rabat, Morocco: Association for Computational Linguistics, March, pp. 1–12. [Google Scholar] [CrossRef]
  9. Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, July, pp. 311–318. [Google Scholar] [CrossRef]
  10. Popović, Maja. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation. Lisbon, Portugal: Association for Computational Linguistics, September, pp. 392–395. [Google Scholar] [CrossRef]
  11. Post, Matt. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels. Association for Computational Linguistics, October, pp. 186–191. [Google Scholar] [CrossRef]
  12. Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin, Germany: Association for Computational Linguistics, August, pp. 86–96. [Google Scholar] [CrossRef]
Table 1. Shared hyperparameters for all fine-tuned configurations.
Table 1. Shared hyperparameters for all fine-tuned configurations.
Hyperparameter Value
Base model NLLB-200-distilled-600M
Epochs 10
Batch size (per device) 32
Gradient accumulation 2 (effective: 64)
Learning rate 5e-5
Warmup steps 200
Precision bf16
Max sequence length 128 tokens
Beam size (decoding) 5
Random seed 42
Hardware NVIDIA A40 (48GB)
Table 4. Sample translations from B1 (en→grt). Example 4 illustrates reference inadequacy: the model correctly adds the Garo politeness marker Ka·sapae for “please”, which the single reference omits.
Table 4. Sample translations from B1 (en→grt). Example 4 illustrates reference inadequacy: the model correctly adds the Garo politeness marker Ka·sapae for “please”, which the single reference omits.
English Reference B1 Prediction Note
Farmers work until night. Gamgiparang walona kingking kam ka·a. Gamgiparang walona kingking kam ka·a. Exact match
I went to the Army Camp. Anga Army Camp-on re·angaha. Anga Army Camp-ona re·angaha. Interpunct variant only
She was very happy to see her family. Ua gisinni gipinrangko nangna pilaknan man·a. Ua gisinni gipin-rangko nangan dake man·a. Valid morphological alt.
Please sit down. Asongbo. Ka·sapae asongbo. Model > reference
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated