Submitted:
15 October 2024
Posted:
17 October 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Develop 108 Arabic word embedding models using advanced techniques like Word2Vec and FastText. By experimenting with a wide range of hyperparameters, we can explore how different configurations affect the quality of the embeddings, ensuring a deep dive into what works best for the Arabic language.
-
Capture the richness of Arabic language variations by training the models on different types of Arabic text. We’ll use three main corpora:
- -
- A news article corpus that, while small, offers valuable up-to-date content.
- -
- A books corpus containing 31 Arabic books, which offers a diverse set of writing styles and linguistic structures, helping us grasp the complexity of Arabic morphology.
- -
- A Wikipedia corpus, which is larger and provides broad coverage of various topics, ensuring the embeddings can handle a wide range of language uses.
- Incorporate an analogy test adapted from Google’s famous Analogy Test dataset, but modified specifically for Arabic. This version will be based on a recently developed Arabic benchmark, ensuring we can properly evaluate how well the embeddings understand relationships between words in Arabic.
- Dive deep into analyzing each model’s performance by testing different combinations of hyperparameters and corpora. We’ll evaluate each model with a variety of metrics, paying close attention to how the size of the corpora and the tuning of hyperparameters impact the overall effectiveness of the embeddings.
- Share the Arabic word embedding models and datasets with the wider NLP community. By making these resources available, we hope to support further research and development in Arabic natural language processing.
2. Research Background
3. Related Work
4. Building Arabic Word Embeddings
4.1. Corpus Collection
- Book corpus: Classical Arabic is one of the Arabic varieties, commonly used in religious contexts, distinguished by its unique and formal structure compared to modern and dialectical Arabic. Regrettably, classical Arabic, particularly ancient Arabic, has not received significant attention in the field of word embeddings. To address this gap, we constructed our first corpus by utilizing 32 well-known Arabic books. The digital versions of these books were obtained from the Almaktabah Alshamelah database2. This database offers a wealth of Arabic words along with detailed descriptions. Notably, one of these books, Lesan Al-arab, is considered to be the comprehensive dictionary of all Arabic dictionaries [17]. Given that these books provide various forms and descriptions for numerous Arabic words, they contribute to the morphological richness of the embedding model. Through the use of this corpus, we aim to analyze the efficacy of employing Arabic books for word embedding models and to assess their ability to capture rare and ancient Arabic words. This corpus contains approximately 11 million words.
- Watan corpus:To build a topically rich medium corpus, we used the Arabic corpus Watan-2004 [18]. This corpus contains 20,000 Arabic news articles covering a wide range of topics such as culture, religion, economy, local news, international news, and sports, all provided in HTML format, additional information about the corpus can be found at [18]. Table 2 shows the topics distribution in the Watan corpus. This corpus contains approximately 6 million words.
- Wiki corpus: Wikipedia3 provides a plethora of articles on various topics in multiple languages. Therefore, we have selected Wikipedia as the source of the large corpus. We collected recent articles from 2023 using the Wikiextractor tool from [19] to gather articles from wikimedia, a website that archives articles in different languages4. The collected files comprise recent Arabic articles, totaling around 2 million articles across different topics. This corpus contains approximately 111 million words.
4.2. Corpus Preprocessing
- Tokenization: Since the text is considered as a single input block for the preprocessing program, we utilize tokenization as the first step. Tokenization breaks the text into individual tokens (words), allowing for easier processing. To accomplish this, we employ the (tokenize) function, which is a part of the pyarabic5 library.
- Cleaning: Then, we clean the text from any unwanted data. This includes removing digits, non-Arabic text, punctuation marks, emojis, and symbols.
-
Normalization: Arabic is a highly complex language due to its richness, which includes representing the same word in various formats. To enhance the robustness of our models, we implement several normalization steps to prevent redundancy in the words.
- -
- Tashkeel Removal: One of the unique features of Arabic is its use of specific marks for each letter in words. These marks represent the required sound for each letter in the word. Their existence affects the representation of the word, as changing one mark can lead the model to perceive the word as a new one. From the model’s perspective, it will treat every unique input as an individual input. To prevent the model from establishing different vectors for each marked version of the word, we removed any tashkeel marks. To perform this, we used the (strip_tashkeel) function from the (pyarabic) library.
- -
- Letters Normalization: We have conducted letters normalization to standardize the format of certain Arabic letters. This step is crucial in ensuring that the model treats each letter consistently, preventing redundancy and improving overall performance.
- -
- Tadweel Removal: Some Arabic writers use (Tadweel) to add a touch of elegance to the text by incorporating extra dashes within the words. We removed it from the text by utilizing the (strip_tatweel) function from the (pyarabic) library.
- Repeated Letters Removal: One of the challenges that affects our training is the presence of unintentionally repeated letters within the same word. To address this issue, we implemented a step to remove repeated letters within the same word.
- Stop Words Removal: In the context of the Arabic language, stop words are words that are frequently used but do not carry significant meaning. By removing these stop words, we can reduce the size of the corpus and focus on more meaningful words. We started by adopting the Arabic stop words list provided by NLTK6, which contains 754 words. However, we noticed that stop words in Arabic can appear in various forms, such as different tenses, formats, or with the conjunction letter attached directly with them. To address this, we expanded the NLTK Arabic stop words list by adding conjunction letter to each word and applied normalization to words that can come in different formats. Additionally, we added present and future verb formats to the stop words list. The resulting stop words list contains around 936 words.
4.3. Model Training
5. Model Evaluation
5.1. Analogy Test
5.2. Sentiment Analysis
5.3. Similarity Score
6. Results and Discussion
6.1. Analogy Test
6.2. Sentiment Analysis
| Model | Best scores |
|---|---|
| Word2vec-Watan Corpus | 91 |
| Word2vec-Book Corpus | 79 |
| Word2vec-Wiki | 85 |
| Fasttext-Watan Corpus | 94 |
| Fasttext-Book Corpus | 96 |
| Fasttext-Wiki | 99 |



6.3. Similarity Score


7. Conclusion
- In the Analogy test, it has been shown that employing a large vocabulary size had a positive influence on the results. Additionally, Fasttext models with skip-gram approaches proved to be effective in solving Analogy Test questions.
- For sentiment analysis, we discovered that vocabulary size played a crucial role. Furthermore, with Fasttext models, applying CBOW architecture led to improved accuracy, while with Word2Vec, applying skip-gram architecture enhanced accuracy. Besides, lower window size has a positive affect in the accuracy.
- Finally, in the similarity score test, Fasttext models delivered the highest scores. Lower window size and smaller vector size positively impacted the results.
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Lai, S.; Liu, K.; He, S.; Zhao, J. How to Generate a Good Word Embedding? IEEE Intelligent Systems 2017, PP, 1–1. [Google Scholar] [CrossRef]
- Maas, A.L.; Daly, R.E.; Pham, P.T.; Huang, D.; Ng, A.Y.; Potts, C. Learning Word Vectors for Sentiment Analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies; Association for Computational Linguistics: Portland, Oregon, USA, 2011; pp. 142–150. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR 2013, 2013. [Google Scholar]
- Salama, R.A.; Youssef, A.; Fahmy, A. Morphological Word Embedding for Arabic. Procedia Computer Science 2018, 142, 83–93, Arabic Computational Linguistics. [Google Scholar] [CrossRef]
- Joulin, A.; Grave, E.; Bojanowski, P.; Douze, M.; Jégou, H.; Mikolov, T. FastText.zip: Compressing text classification models. arXiv preprint, 2016; arXiv:1612.03651. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 2017, 5, 135–146. [Google Scholar] [CrossRef]
- Grave, E.; Bojanowski, P.; Gupta, P.; Joulin, A.; Mikolov, T. Learning Word Vectors for 157 Languages. 2018; arXiv:cs.CL/1802.06893. [Google Scholar]
- Azroumahli, C.; El Younoussi, Y.; Achbal, F. , An Overview of a Distributional Word Representation for an Arabic Named Entity Recognition System; 2018; pp. 130–140. [CrossRef]
- Azroumahli, C.; Rybinski, M.; Younoussi, Y.; Montes, J. Comparative study of Arabic Word Embeddings: Evaluation and Application 2020. 12, 349–362.
- Yagi, S. , Elnagar A., F.S. A benchmark for evaluating Arabic word embedding models. Natural Language Engineering. Natural Language Engineering, 2022; 1–26. [Google Scholar]
- Soliman, A.B.; Eissa, K.; El-Beltagy, S.R. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Computer Science 2017, 117, 256–265, Arabic Computational Linguistics. [Google Scholar] [CrossRef]
- Mikolov, T.; Grave, E.; Bojanowski, P.; Puhrsch, C.; Joulin, A. Advances in Pre-Training Distributed Word Representations. 2017; arXiv:cs.CL/1712.09405. [Google Scholar]
- Naili, M.; Chaibi, A.H.; Ben Ghezala, H.H. Comparative study of word embedding methods in topic segmentation. Procedia Computer Science 2017, 112, 340–349, Knowledge-Based and Intelligent Information Engineering Systems: Proceedings of the 21st International Conference, KES-20176-8 September 2017, Marseille, France. [Google Scholar] [CrossRef]
- Pennington, J.; Socher, R.; Manning, C. GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Doha, Qatar, 2014; pp. 1532–1543. [Google Scholar] [CrossRef]
- Athiwaratkun, B.; Wilson, A.G.; Anandkumar, A. Probabilistic FastText for Multi-Sense Word Embeddings. 2018; arXiv:cs.CL/1806.02901. [Google Scholar]
- Oueslati, O.; Cambria, E.; HajHmida, M.B.; Ounelli, H. A review of sentiment analysis research in Arabic language. Future Generation Computer Systems 2020, 112, 408–430. [Google Scholar] [CrossRef]
- Kayraldeen, A. Ketab Alaalam; 1926.
- Abbas, M. Watan 2004 Corpus, 2004.
- Attardi, G. WikiExtractor. 2015. Available online: https://github.com/attardi/wikiextractor.
- Chen, Z.; He, Z.; Liu, X.; Bian, J. Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases. BMC Medical Informatics and Decision Making 2018, 18, 65. [Google Scholar] [CrossRef]
- Zahran, M.A.; Magooda, A.; Mahgoub, A.Y.; Raafat, H.; Rashwan, M.; Atyia, A. Word Representations in Vector Space and their Applications for Arabic. Computational Linguistics and Intelligent Text Processing; Gelbukh, A., Ed.; Springer International Publishing: Cham, 2015; pp. 430–443. [Google Scholar]
- Dahou, A.; Xiong, S.; Zhou, J.; Haddoud, M.H.; Duan, P. Word Embeddings and Convolutional Neural Network for Arabic Sentiment Classification. Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers; The COLING 2016 Organizing Committee: Osaka, Japan, 2016; pp. 2418–2427. [Google Scholar]
- Gladkova, A.; Drozd, A.; Matsuoka, S. Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. Proceedings of the NAACL Student Research Workshop; Association for Computational Linguistics: San Diego, California, 2016; pp. 8–15. [Google Scholar] [CrossRef]
- Khusainova, A.; Khan, A.; Rivera, A.R. SART - Similarity, Analogies, and Relatedness for Tatar Language: New Benchmark Datasets for Word Embeddings Evaluation. Computational Linguistics and Intelligent Text Processing; Gelbukh, A., Ed.; Springer Nature Switzerland: Cham, 2023; pp. 380–390. [Google Scholar]
- Elrazzaz, M.; Elbassuoni, S.; Shaban, K.; Helwe, C. Methodical Evaluation of Arabic Word Embeddings. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers); Association for Computational Linguistics: Vancouver, Canada, 2017; pp. 454–458. [Google Scholar] [CrossRef]
- Abdul-Mageed, M.; Elbassuoni, S.; Doughman, J.; Elmadany, A.; Nagoudi, E.M.B.; Zoughby, Y.; Shaher, A.; Gaba, I.; Helal, A.; El-Razzaz, M. DiaLex: A Benchmark for Evaluating Multidialectal Arabic Word Embeddings. Proceedings of the Sixth Arabic Natural Language Processing Workshop; Association for Computational Linguistics: Kyiv, Ukraine (Virtual), 2021; pp. 11–20. [Google Scholar]
- Brahimi, B.; Touahria, M.; Tari, A. Improving sentiment analysis in Arabic: A combined approach. Journal of King Saud University - Computer and Information Sciences, 2019. [Google Scholar] [CrossRef]
- Ibrahim, H.S.; Abdou, S.M.; Gheith, M. Sentiment Analysis For Modern Standard Arabic And Colloquial. CoRR, 2015. [Google Scholar]
- Iqbal, F.; Hashmi, J.M.; Fung, B.C.M.; Batool, R.; Khattak, A.M.; Aleem, S.; Hung, P.C.K. A Hybrid Framework for Sentiment Analysis Using Genetic Algorithm Based Feature Reduction. IEEE Access 2019, 7, 14637–14652. [Google Scholar] [CrossRef]
- Nassif, A.B.; Elnagar, A.; Shahin, I.; Henno, S. Deep learning for Arabic subjective sentiment analysis: Challenges and research opportunities. Applied Soft Computing 2021, 98, 106836. [Google Scholar] [CrossRef]
- Terad, M. المعجم المفصل في المترادفات في اللغة العربية ; Dar Alkotob Alarabiah, 2011.
| 1 | Our work is available in: https://github.com/AzzahAllahim/ ArabicWordEmbedding |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | Their analogy dataset is publicly available in: https://github.com/UBC-NLP/dialex
|
| 8 | |
| 9 | Both lexicons are available in our repository: https://github.com/AzzahAllahim/ArabicWordEmbedding
|
| 10 | |
| 11 | The test is available in our repository: https://github.com/AzzahAllahim/ArabicWordEmbedding
|







| Ref | Multiple vector size | Multiple window size | Arabic Analogy | Models Availability | Corpus Availability | Books Corpus |
|---|---|---|---|---|---|---|
| [11] | √ | |||||
| [9] | √ | √ | ||||
| Our paper | √ | √ | √ | √ | √ |
| Topic | Number of articles |
|---|---|
| Culture | 2782 |
| Religion | 3860 |
| Economy | 3468 |
| Local news | 3596 |
| Internation news | 2035 |
| Sports | 4550 |
| Corpus | Source | Number of words | Number of vocabulary |
|---|---|---|---|
| Watan | Watan newspaper | 6M | 35211 |
| Book | 32 Arabic booksr | 11M | 181432 |
| Wiki | Wikipedia | 111M | 445977 |
| Architecture | Model | Best score |
|---|---|---|
| Word2vec | Watan Corpus - CBOW | 40 |
| Watan Corpus - Skip-gram | 39 | |
| Books Corpus - CBOW | 45 | |
| Books Corpus - Skip-gram | 51 | |
| Wiki Corpus - CBOW | 84 | |
| Wiki Corpus - Skip-gram | 86 | |
| Fasttext | Watan Corpus - BOW | 34 |
| Watan Corpus - Skip-gram | 40 | |
| Books Corpus - BOW | 45 | |
| Books Corpus - Skip-gram | 49 | |
| Wiki Corpus - BOW | 80 | |
| Wiki Corpus - Skip-gram | 90 |
| Model | Score |
|---|---|
| Azroumahli et al. [9] | 60% |
| Aravec [11] | 79% |
| Our model-Fasttext approach | 90% |
| Our model-Word2Vec approach | 86% |
| Model | Score |
|---|---|
| Aravec [11] | 90% |
| Our model-Fasttext approach | 99% |
| Our model-Word2Vec approach | 91% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).