Submitted:
12 April 2026
Posted:
14 April 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
- We propose MINT, a 14.7M parameter seq2seq transformer incorporating RoPE, SiLU activations, DropPath, and weight tying, trained from scratch on XL-Sum across seven Indic languages.
- We introduce a two-phase training curriculum that separates fluency learning (cosine-decayed learning with warmup) from content fidelity learning (flat low-rate fine-tuning with coverage and attention entropy losses).
- We conduct the first identical-regime comparison of a lightweight custom model against IndicBART and mT5-small on seven Indic languages, controlling for all training hyperparameters.
- We demonstrate and document that the standard rouge_score library is incorrect for Indic scripts, and provide a whitespace-based alternative.
- We show that MINT achieves approximately 84.8% of IndicBART’s ROUGE-1 on the six overlapping languages using 3.3% of its parameters, with ROUGE-per-parameter ratios 27× better than IndicBART and 3.3× better than mT5-small.
2. Related Work
2.1. Multilingual Summarization
2.2. Efficient NLP Architectures
2.3. Indic NLP
3. The MINT Architecture
3.1. Overview
3.2. Rotary Position Embeddings
3.3. SiLU Feed-Forward Sublayers
3.4. Stochastic Depth (DropPath)
3.5. Weight Tying
3.6. Multilingual Tokenizer
4. Training
4.1. Dataset and Preprocessing
(“Read also”), social media follow invitations
, and BBC Hindi branding text. In Bengali:
. In Marathi:
Similar patterns hold for Tamil, Telugu, Punjabi, and Urdu. We detect these footers by scanning for trigger phrases and truncating at the first match. This cleaning step is applied to both training and evaluation data to ensure consistency.4.2. Two-Phase Training of MINT
Phase 1 — Fluency Learning (Epochs 1–15).
Phase 2 — Content Fidelity Refinement (Epochs 16–25).
4.3. Loss Function
Label-Smoothed Cross-Entropy.
Coverage Loss.
Attention Entropy Regularization.
4.4. Identical Regime for Fair Comparison
- Maximum source length: MINT uses up to 768 tokens per source sequence; IndicBART and mT5-small impose a hard 512-token limit due to their pretrained positional encodings. Articles that exceed 512 tokens are truncated for the baselines.
- Input format: IndicBART uses text </s> <2xx> / <2xx> text </s> format with AlbertTokenizer; mT5-small uses summarize: text prefix with the standard T5 tokenizer; MINT uses <lang_xx> text format with our custom SentencePiece tokenizer. These formatting differences are inherent to each architecture’s design.
| Hyperparameter | Value (all models) |
|---|---|
| Learning rate | (flat, no scheduler) |
| Warmup steps | 0 |
| Optimizer | AdamW |
| 0.9, 0.999 | |
| Weight decay | 0.01 |
| Gradient clipping | 1.0 |
| Batch size | 16 |
| 0.5 | |
| 0.1 | |
| Steps/epoch | = 506 |
| Language order | Random shuffle each step |
| Zero-grad position | Before backward pass |
| Max target length | 128 |
| Beam size | 4 |
| Diversity groups | 2 |
| Diversity penalty | 0.5 |
| N-gram blocking | 4 |
| Length penalty | 1.5 |
| Max generation length | 80 |
| Evaluation samples | 200 per language |
| Evaluation every | 2 epochs |
| Epochs | 10 (IndicBART, mT5-small) / 25 (MINT) |
| Max source length | 512 (IndicBART, mT5-small) / 768 (MINT) |
5. Experiments
5.1. Evaluation Metrics
Whitespace-Based ROUGE.
Subword ROUGE.
BERTScore.
LaBSE Embedding Similarity.
5.2. Baselines
IndicBART
mT5-small
5.3. Architecture Comparison
| Model | Params | Vocab | Min VRAM | Type | Urdu |
|---|---|---|---|---|---|
| MINT (ours) | 14.7M | 32k | 15 GB | scratch | ✔ |
| IndicBART | 440M | 64k | 48 GB | pretrained | × |
| mT5-small | 556M | 250k | 48 GB | pretrained | ✔ |
| mBART-50 | 610M | 250k | 48 GB | pretrained | ✔ |
5.4. Main Results
5.5. Semantic Metrics
5.6. Efficiency Analysis
6. Analysis
6.1. Training Dynamics
6.2. Language-wise Performance Analysis
6.3. Qualitative Analysis
6.4. Ablation Study
6.5. The ROUGE Compatibility Problem
compared to itself, we find 5 whitespace-delimited tokens overlapping perfectly, giving ROUGE-1 = 1.0 under our implementation. The rouge_score library returns 0.0 for the same pair.7. Discussion
Does Scale Matter for Indic Summarization?
The Data Imbalance Challenge.
Limitations.
Why Not Adapters or LoRA?
8. Conclusion
Acknowledgments
Appendix A Complete Hyperparameter Tables
| MINT Architecture | |
|---|---|
| Encoder layers | 3 |
| Decoder layers | 4 |
| Model dim d | 256 |
| Attention heads | 8 |
| FFN dim | 1024 |
| Activation | SiLU |
| Dropout | 0.1 |
| DropPath rate | 0.05 |
| Position encoding | RoPE () |
| Weight tying | Encoder emb = Decoder emb = LM head |
| Vocabulary | 32,000 (SentencePiece Unigram) |
| Total parameters | 14,767,104 |
| Phase 1 Training (Epochs 1–15) | |
| Initial learning rate | |
| LR scheduler | Linear warmup + cosine decay |
| Warmup steps | 3,000–4,000 |
| Optimizer | AdamW |
| 0.9, 0.999 | |
| Weight decay | 0.01 |
| Label smoothing | 0.1 |
| Gradient clipping | 1.0 |
| Batch size | 16 |
| Max source length | 768 tokens |
| Max target length | 128 tokens |
| Embedding freeze | Epoch 1 only |
| Loss | Label-smoothed CE only |
| Phase 2 Training (Epochs 16–25) | |
| Learning rate | (flat) |
| LR scheduler | None |
| Warmup steps | 0 |
| Optimizer | AdamW (moments reset) |
| 0.9, 0.999 | |
| Weight decay | 0.01 |
| Gradient clipping | 1.0 |
| Batch size | 16 |
| 0.5 | |
| 0.1 | |
| Decoding (all models, identical) | |
| Beam size | 4 |
| Diversity groups | 2 |
| Diversity penalty | 0.5 |
| N-gram blocking | 4 |
| Length penalty | 1.5 |
| Max generation length | 80 |
References
- Dabre, Raj; Kunchukuttan, Anoop; Puduppully, Ratish. IndicBART: A pre-trained model for natural language generation of Indic languages. Findings of the Association for Computational Linguistics: ACL 2022, 2022. [Google Scholar]
- Hasan, Tahmid; Bhattacharjee, Abhik; Islam, Md. Saiful; Mubasshir, Kazi; Li, Yuan-Fang; Kang, Yong-Bin; Rahman, M. Sohel; Shahriyar, Rifat. XL-Sum: Large-scale multilingual abstractive summarization for 44 languages. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021; pp. pages 4693–4703. [Google Scholar]
- Houlsby, Neil; Giurgiu, Andrei; Jastrzebski, Stanislaw; Morrone, Bruna; de Laroussilhe, Quentin; Gesmundo, Andrea; Attariyan, Mona; Gelly, Sylvain. Parameter-efficient transfer learning for NLP. International Conference on Machine Learning (ICML), 2019. [Google Scholar]
- Hu, Edward J.; Shen, Yelong; Wallis, Phillip; Allen-Zhu, Zeyuan; Li, Yuanzhi; Wang, Shean; Wang, Lu; Chen, Weizhu. LoRA: Low-rank adaptation of large language models. International Conference on Learning Representations (ICLR), 2022. [Google Scholar]
- Huang, Gao; Sun, Yu; Liu, Zhuang; Sedra, Daniel; Weinberger, Kilian. Deep networks with stochastic depth. European Conference on Computer Vision (ECCV), 2016. [Google Scholar]
- Kakwani, Divyanshu; Kunchukuttan, Anoop; Golla, Satish; Gokul, N.C.; Bhattacharyya, Avik; Khapra, Mitesh M.; Kumar, Pratyush. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020; 2020. [Google Scholar]
- Kudo, Taku. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018; Volume 1. [Google Scholar]
- Lewis, Mike; Liu, Yinhan; Goyal, Naman; Ghazvininejad, Marjan; Mohamed, Abdelrahman; Levy, Omer; Stoyanov, Veselin; Zettlemoyer, Luke. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020; pp. pages 7871–7880. [Google Scholar]
- Li, Xiang Lisa; Liang, Percy. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021; Volume 1. [Google Scholar]
- Lin, Chin-Yew. ROUGE: A package for automatic evaluation of summaries. Text Summarization Branches Out 2004, pages 74–81. [Google Scholar]
- Liu, Yinhan; Gu, Jiatao; Goyal, Naman; Li, Xian; Edunov, Sergey; Ghazvininejad, Marjan; Lewis, Mike; Zettlemoyer, Luke. Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 2020, 8, 726–742. [Google Scholar] [CrossRef]
- Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 2020, 21(140), 1–67. [Google Scholar]
- Ramachandran, Prajit; Zoph, Barret; Le, Quoc V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar] [CrossRef]
- Ramesh, Gowtham; Doddapaneni, Sumanth; Bheemaraj, Aravinth; Jobanputra, Mayank; Raghavan, AK; Sharma, Ajitesh; Sahoo, Sujit; Diddee, Harshita; Mahalakshmi, J; Kakwani, Divyanshu; Kumar, Navneet; Pradeep, Aswin; Nagaraj, Srihari; Kumar, T. Deepak; Kunchukuttan, Anoop; Kumar, Pratyush; Khapra, Mitesh. Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Transactions of the Association for Computational Linguistics 2021, 10. [Google Scholar] [CrossRef]
- Shazeer, Noam. GLU variants improve transformer. arXiv 2020, arXiv:2002.05202. [Google Scholar] [CrossRef]
- Su, Jianlin; Ahmed, Murtadha; Lu, Yu; Pan, Shengfeng; Bo, Wen; Liu, Yunfeng. RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
- Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N.; Kaiser, Łukasz; Polosukhin, Illia. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS); 2017; volume 30. [Google Scholar]
- Xue, Linting; Constant, Noah; Roberts, Adam; Kale, Mihir; Al-Rfou, Rami; Siddhant, Aditya; Barua, Aditya; Raffel, Colin. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021; pp. pages 483–498. [Google Scholar]
- Zhang, Tianyi; Kishore, Varsha; Wu, Felix; Weinberger, Kilian Q.; Artzi, Yoav. BERTScore: Evaluating text generation with BERT. International Conference on Learning Representations (ICLR), 2020. [Google Scholar]


| Language | Script | Train (orig.) | Train (clean) | Val | Test |
|---|---|---|---|---|---|
| Hindi (hi) | Devanagari | 70,778 | 70,343 | 8,847 | 8,847 |
| Bengali (bn) | Bengali | 8,102 | 7,864 | 1,012 | 1,012 |
| Marathi (mr) | Devanagari | 10,903 | 10,654 | 1,362 | 1,362 |
| Tamil (ta) | Tamil | 16,222 | 15,584 | 2,027 | 2,027 |
| Telugu (te) | Telugu | 10,421 | 10,037 | 1,302 | 1,302 |
| Punjabi (pa) | Gurmukhi | 8,215 | 8,006 | 1,026 | 1,026 |
| Urdu (ur) | Perso-Arabic | 67,665 | 66,223 | 8,458 | 8,458 |
| Total | – | 192,306 | 188,711 | 24,034 | 24,034 |
| MINT (ours) | IndicBART | mT5-small | ||||
|---|---|---|---|---|---|---|
| Language | R-1 | R-2 | R-1 | R-2 | R-1 | R-2 |
| Hindi (hi) | 0.1914 | 0.0487 | 0.2956 | 0.0752 | 0.2360 | 0.0667 |
| Bengali (bn) | 0.0744 | 0.0086 | 0.0754 | 0.0070 | 0.1172 | 0.0336 |
| Marathi (mr) | 0.0721 | 0.0149 | 0.1187 | 0.0225 | 0.0945 | 0.0202 |
| Tamil (ta) | 0.0705 | 0.0160 | 0.0843 | 0.0147 | 0.0990 | 0.0281 |
| Telugu (te) | 0.0480 | 0.0054 | 0.0498 | 0.0036 | 0.0691 | 0.0110 |
| Punjabi (pa) | 0.1578 | 0.0261 | 0.2214 | 0.0193 | 0.1908 | 0.0452 |
| Urdu (ur) | 0.2170 | 0.0655 | – | – | 0.2539 | 0.0797 |
| Average (7 lang) | 0.1187 | 0.0265 | – | – | 0.1515 | 0.0407 |
| Average (6 lang) | 0.1024 | 0.0199 | 0.1409 | 0.0237 | 0.1344 | 0.0341 |
| Language | BERTScore-F1 | Subword R-1 | Subword R-2 | LaBSE Sim. |
|---|---|---|---|---|
| Hindi (hi) | 0.8593 | 0.2017 | 0.0605 | 0.4866 |
| Bengali (bn) | 0.8435 | 0.1890 | 0.0324 | 0.4001 |
| Marathi (mr) | 0.8488 | 0.1554 | 0.0382 | 0.4207 |
| Tamil (ta) | 0.8498 | 0.1526 | 0.0375 | 0.4262 |
| Telugu (te) | 0.8459 | 0.1608 | 0.0292 | 0.3733 |
| Punjabi (pa) | 0.8444 | 0.2030 | 0.0444 | 0.3674 |
| Urdu (ur) | 0.8559 | 0.2174 | 0.0663 | 0.5400 |
| Average | 0.8497 | 0.1828 | 0.0441 | 0.4306 |
| Model | Params | VRAM | Avg R-1 (6L) | R-1/M params | Train time/epoch |
|---|---|---|---|---|---|
| MINT (ours) | 14.7M | 15 GB | 0.1024 | 0.00697 | ∼21 min |
| IndicBART | 440M | 48 GB | 0.1409 | 0.00032 | ∼27 min |
| mT5-small | 556M | 48 GB | 0.1344 | 0.00024 | ∼22 min |
![]() |
| Model Variant | ROUGE-1 | ROUGE-2 |
|---|---|---|
| MINT full (Phase 2, epoch 20) | 0.2104 | 0.0595 |
| w/o coverage loss () | 0.1891 | 0.0523 |
| w/o attention entropy () | 0.2049 | 0.0571 |
| w/o diverse beam search (standard beam-4) | 0.1943 | 0.0538 |
| w/o DropPath (drop rate = 0) | 0.2018 | 0.0572 |
| w/o Phase 2 refinement (Phase 1 best) | 0.1914 | 0.0487 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

