Submitted:
06 March 2026
Posted:
06 March 2026
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. NLP Foundations Inside LLM Architectures
2.1. Tokenization and Subword Modeling
2.2. Syntactic and Semantic Representations
2.3. Discourse and Pragmatics
3. The Low-Resource Language Challenge
3.1. The Data Imbalance Problem
3.2. Cross-Lingual Transfer and Multilingual Pre-Training
3.3. Data Augmentation and Synthetic Data Strategies
4. State-of-the-Art Approaches and Benchmarks
4.1. Instruction Tuning and RLHF for Multilingual Settings
4.2. Benchmark Landscape
5. Case Studies in Low-Resource NLP Toolkits
5.1. CalamanCy: A Tagalog NLP Toolkit
5.2. AfroNLP and African Language Models
5.3. Lessons Across Case Studies
6. Open Problems and Future Directions
7. Conclusion
References
- Manning, C. D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.
- Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing (3rd ed.). Draft. https://web.stanford.edu/~jurafsky/slp3/.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.
- Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., et al. (2020). Language models are few-shot learners. NeurIPS, 33, 1877-1901.
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proc. NAACL-HLT, 4171-4186.
- Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(140), 1-67.
- Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the ACL, 8, 842-866.
- Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proc. ACL, 1715-1725.
- Wu, Y., Schuster, M., Chen, Z., Le, Q. V., et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.
- Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. Proc. EMNLP, 66-71.
- Ács, J. (2019). Exploring BERT’s vocabulary. Budapest University of Technology and Economics Blog.
- Rust, P., Pfeiffer, J., Vulic, I., Ruder, S., & Gurevych, I. (2021). How good is your tokenizer? On the monolingual performance of multilingual language models. Proc. ACL-IJCNLP, 3118-3135.
- Hewitt, J., & Manning, C. D. (2019). A structural probe for finding syntax in word representations. Proc. NAACL-HLT, 4129-4138.
- Tenney, I., Das, D., & Pavlick, E. (2019). BERT rediscovers the classical NLP pipeline. Proc. ACL, 4593-4601.
- Blasi, D. E., Anastasopoulos, A., & Neubig, G. (2022). Systematic inequalities in language technology performance across the world’s languages. Proc. ACL, 1407-1423.
- Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer. arXiv:2004.05150.
- Joshi, P., Santy, S., Budhiraja, A., Bali, K., & Choudhury, M. (2020). The state and fate of linguistic diversity and inclusion in the NLP world. Proc. ACL, 6282-6293.
- Kreutzer, J., Caswell, I., Wang, L., Wahab, A., van Esch, D., et al. (2022). Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the ACL, 10, 50-72.
- Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H., & Stoyanov, V. (2018). XNLI: Evaluating cross-lingual sentence representations. Proc. EMNLP, 2475-2485.
- Clark, J. H., Choi, E., Collins, M., Garrette, D., Kwiatkowski, T., Nikolaev, V., & Palomaki, J. (2020). TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the ACL, 8, 454-470.
- Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M. (2020). XTREME: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization. Proc. ICML, 4411-4421.
- Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., et al. (2020). Unsupervised cross-lingual representation learning at scale. Proc. ACL, 8440-8451.
- Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A., & Raffel, C. (2021). mT5: A massively multilingual pre-trained text-to-text transformer. Proc. NAACL-HLT, 483-498.
- Lauscher, A., Ravishankar, V., Vulic, I., & Glavas, G. (2020). From zero to hero: On the limitations of zero-shot language transfer with multilingual transformers. Proc. EMNLP, 4483-4499.
- Muennighoff, N., Wang, T., Sutawika, L., Roberts, A., Biderman, S., et al. (2023). Crosslingual generalization through multitask finetuning. Proc. ACL, 15991-16111.
- Sennrich, R., Haddow, B., & Birch, A. (2016). Improving neural machine translation models with monolingual data. Proc. ACL, 86-96.
- Moller, A. G., Dalsgaard, J. A., Pera, A., & Aiello, L. M. (2023). Is a prompt and a few samples all you need? Using GPT-4 for data augmentation in low-resource classification tasks. arXiv:2304.13861.
- Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., & Le, Q. V. (2022). Finetuned language models are zero-shot learners. Proc. ICLR.
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS, 35.
- Santurkar, S., Durmus, E., Ladd, F., Lee, C., Ganguli, D., & Hashimoto, T. (2023). Whose opinions do language models reflect? Proc. ICML, PMLR 202.
- Muhammad, S. H., Yimam, S. M., Ahmad, I. S., et al. (2023). AfriSenti: A Twitter sentiment analysis benchmark for African languages. Proc. EMNLP, 13968-13981.
- Miranda, L. J. (2023). CalamanCy: A Tagalog natural language processing toolkit based on spaCy. Proc. 3rd Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP), ACL, 1-7.
- Ogueji, K., Zhu, Y., & Lin, J. (2021). Small data? No problem! Exploring the viability of pretrained multilingual language models for low-resourced languages. Proc. 1st Workshop on MRL, 116-126.
- Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. Proc. ICLR.
- V. Parupally, “CalamanCy: A Tagalog Natural Language Processing Toolkit,” 2025 IEEE International Conference on Industrial Technology & Computer Engineering (ICITCE), Penang, Malaysia, 2025, pp. 45-51. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).