Submitted:
01 October 2024
Posted:
03 October 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
1.1. Background and Motivation
2. Literature Review
2.1. High-Resource Languages
2.2. Low-Resource Languages
2.3. Cross-Lingual Transfer Learning
2.4. Tokenization Challenges
3. Methodology
3.1. Language Selection
- High-resource languages: English, Mandarin
- Medium-resource languages: Hindi, Swahili
- Low-resource languages: Nepali, Malagasy
3.2. Datasets and Benchmarks
- XGLUE [6]: A cross-lingual general language understanding evaluation dataset that includes tasks such as text classification, question answering, and machine translation. XGLUE provides a diverse set of evaluation metrics to assess model performance across different languages.
- TyDiQA [7]: A multilingual question-answering dataset that covers a broad range of languages, including low-resource languages such as Swahili and Nepali. TyDiQA focuses on information-seeking questions and requires models to generate accurate and contextually relevant answers across languages.
3.3. Model Evaluation
- Accuracy: Measures the percentage of correct predictions made by the model across various tasks.
- F1-score: A balanced measure of precision and recall, used to evaluate model performance on tasks with imbalanced data.
- BLEU score: A metric used to evaluate the quality of machine translation outputs by comparing them to reference translations.
4. Results
4.1. Performance Analysis by Task
4.2. Tokenization Challenges in Low-Resource Languages
5. Conclusions
References
- OpenAI. Gpt-4 technical report. 2023.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2019.
- Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 670–677, 2018.
- Pratik Joshi, Sebastin Santy, Akshay Budhiraja, Kalika Bali, and Monojit Choudhury. The state and fate of linguistic diversity and inclusion in the nlp world. arXiv preprint arXiv:2004.09095, 2020.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pages 1715–1725, 2015.
- Chen Liang, Yue Wu, Minghui Qiu, Weiwei Chen, et al. Xglue: A new benchmark dataset for cross-lingual pre-training, understanding and generation. arXiv preprint arXiv:2004.01401, 2020.
- Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Yoon Choi, et al. Tydi qa: A benchmark for information-seeking question answering in typologically diverse languages. Transactions of the Association for Computational Linguistics, 2020.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).