Towards Bengali Word Embedding: Corpus Creation, Intrinsic and Extrinsic Evaluations

Distributional word vector representation or word embedding has become an essential in-gredient in many natural language processing (NLP) tasks such as machine translation, document classiﬁcation, information retrieval and question answering. Investigation of embedding model helps to reduce the feature space and improves textual semantic as well as syntactic relations. This paper presents three embedding techniques (such as Word2Vec, GloVe, and FastText) with different hyperparameters implemented on a Bengali corpus consists of 180 million words. The performance of the embedding techniques is evaluated with extrinsic and intrinsic ways. Extrinsic performance evaluated by text classiﬁcation, which achieved a maximum of 96 . 48% accuracy. Intrinsic performance evaluated by word similarity (e.g., semantic, syntactic and relatedness) and analogy tasks. The maximum Pearson ( ˆ r ) correlation accuracy of 60 . 66% ( S s ˆ r ) achieved for semantic similari-ties and 71 . 64% ( S y ˆ r ) for syntactic similari-ties whereas the relatedness obtained 79 . 80% ( R s ˆ r ) . The semantic word analogy tasks achieved 44 . 00% of accuracy while syntactic word analogy tasks obtained 36 . 00% .


Introduction
Word embedding is a distributional vector representation of words in which syntactic and semantic interpretations are derived from the enormous amount of unlabeled texts. Recently, the word embedding is considered as a powerful tool due to its many applications in NLP, thus, gained much attention by NLP experts. This is a growing up research issue for well-resourced language like English, where an embedding algorithm generates a model (Devlin et al., 2019). However, it is a very complicated task to adopt an embedding algorithm (of any language) directly for the resourceconstrained languages such as Bengali due to the scarcity of resources. As a result, the low-resource language is trail end in NLP tools development. Bengali is the most popularly communicated language of Bangladesh while second most communicated of the 22 official languages of India which makes Bengali is the 7 th most spoken language in the world (Hossain and Hoque, 2019). However, due to the shortage of resources, the development of NLP tools are striving. Bengali speaking people are suffering to access modern NLP tools that might be affected their sustainable use of language technologies. Therefore, Bengali word embedding is an essential prerequisite to developing any Bengali language based NLP tools.
There are two well-known evaluation methods used extensively to evaluate embedding techniques such as extrinsic and intrinsic (Zhelezniak et al., 2019a). Extrinsic evaluation refers to downstream tasks like as machine translation (MT) (Banik et al., 2020), and Part of speech (POS) tagging (Priyadarshi and Saha, 2020). Intrinsic evaluation goals to evaluate the quality of language processing tasks, such as semantic and syntactic word similarity (Pawar and Mago, 2018), Word relatedness (Gladkova and Drozd, 2016), and Word analogy (Schluter, 2018). Unavailability of standard Bengali embedding corpus and inadequacy of resources are antecedents that make such a model generation and evaluation very challenging. Moreover, there is no generalized embedding model available to date for Bengali downstream tasks. Thus, the proposed work introduces a Bengali embedding model generation and evaluation techniques with different hyperparameters settings. Specifically, the key contributions of this work are: • Acquire raw monolingual Bengali corpora with 180 million words where the unique words are 13 million.
• Construct and annotates the intrinsic and extrinsic evaluation datasets as well as evaluate the annotation purity.
• Generate ninety embedding models with the combination of three different algorithms (such as Word2Vec, GloVe,and FastText) and variations of model parameters.
• Examine the influence of hyperparameters on the embedding models performance.
As far as we are aware, the proposed work is the first attempt to generate large-scale embedding models evaluates with intrinsic and extrinsic evaluators.

Related Work
Distributional word vector representation or embedding model generation is a well-established research agenda in NLP domain. There are plenty of research works have been carried out on word embedding in high-resource languages, but it remains as a barrier for low resource languages. The first intrinsic evaluation datasets introduced in RG-65 (Rubenstein and Goodenough, 1965) which contains 65 contextual synonymy pairs. The WordSimilarity-353 introduced by (Finkelstein et al., 2001) and the dataset contains 353 words pair with 13 different subjects. In recent time, three embedding model evaluation datasets have been introduced: SimLex-999 (Hill et al., 2015), Se-mEval 2017 (Camacho-Collados et al., 2017), and MEN (Bruni et al., 2014). An Italian language embedding model has developed in (Di Gennaro et al., 2020), which achieved 53.74% overall analogies accuracy for 19, 791 texts. Moreover, this work achieved a semantic analogies accuracy of 59.20% for 8, 915 texts and syntactic accuracy of 48.80% for 10, 876 texts. However, this work considered only 3COSADD similarity score and not consider the word relatedness and extrinsic evaluations. Ercan and Yıldız (2018) devised a Turkish word similarity and relatedness system that produced Turkish embedding dataset derived from English word similarity and relatedness datasets (e. g. WordSimilarity-353, MEN, and SimLex-999). This work achieved spearman score (ρ) of 0.667 for WordSimilarity-353, 0.68 for MEN and 0.67 for SimLex-999 respectively. The developed embedding model was not evaluated with extrinsic evaluation (Chiu et al., 2016).
Although most of the current works on embedding model, resource creation and evaluations conducted for the high-resource languages (e.g., English, Germany, and French) there are few compre-hensive research conducted on low resource languages such as Assamese, Gujarati, Hindi, Kannada, and so on  and Turkish (Ercan and Yıldız, 2018).  introduced the pre-trained word embedding models for Indian languages. The proposed system generates 14 Indian local language embeddings with 8 different approaches, including 436 models. Embedding models are evaluated by extrinsic evaluators and archived more than 90.00% accuracy for UPOS, and XPOS tagging using universal dependency treebank datasets (Nivre et al., 2016). The NER tagging accuracy about 95.00% with FastText embedding.  aimed to solve 14 Indian languages NLP problems, but the generated models are not considered intrinsic evaluations. To best of our knowledge, only single research was conducted concerning Bengali word embedding using Word2Vec (Sadman et al., 2019). However, this work considered only intrinsic evaluations with a self-build dataset. Our approach considered ninety embedding models based on GloVe, FastText and Word2Vec models and measured the performance using intrinsic and extrinsic evaluators.

Methodology
The principal aim of our research is to investigate the affect of intrinsic and extrinsic evaluations on Bengali word embedding models. Thus, the proposed scheme comprises of three main parts: corpus creation, word embedding model development, and evaluation. Figure 1 illustrates the abstract view of our work.

Corpus Creation
We collected Bengali texts from various online sources and distributed these into two sets: word embedding corpus (E) and embedding model evaluation corpus (E v ). We used a Python crawler to crawl data. We collected 910, 720 Bengali text files over a twenty-four month period (September 10, 2018, to September 11, 2020 which are forwarded to data preprocessing step. Initially, the non-Bengali alphabets and digits are removed from the text files. In the next, preprocessing step removed HTML tags, hashtags, URLs, punctuation and white spaces. Finally, the duplicate texts are deleted from the archive. The preprocessing step produced of 882, 352 usable text, and removed 28, 368 blank size text documents from the initial dataset due to various preprocessing operations. These usable preprocessed data are randomly distributed into two sets; one set for embedding model evaluation (100, 000 texts) and another set for word embedding corpus (782, 352 texts). The embedding corpus (W e ) (i.e., total 180, 081, 093 words) is fed to the embedding techniques.
Embedding model evaluation corpus (E v ) is a combination of intrinsic (I d ) and extrinsic datasets (E d ). In order to perform an extrinsic evaluation, out of 100, 000 text documents, a total of 60, 000 documents are randomly selected. This dataset was labelled manually followed by majority voting to assign the suitable label. Two linguistic experts are assigned to annotated each data into one of the six pre-defined categories such as Accident (A t ), Crime (C e ), Entertainment (E t ), Health (H h ), Politics (P s ) and, Sports (S p ). Among 60, 000 text documents, both experts have been agreed upon 54, 858 text labels. The developed corpus (E d ) achieved a Kappa score (K) is 78.53%, which indicates a reasonable agreement between annotators for downstream task. To perform intrinsic evaluations of the embedding models four different sub datasets are used for conducting four measures: semantic word similarity (S s ), syntactic word similarity (S y ), relatedness (R r ), and analogy tasks (A t ). Intrinsic datasets and corresponding kappa score are shown in Table 1 datasets are substantial agreement where as A t is moderate agreement.

Word Embedding Model Development
We consider three well-known embedding techniques for Bengali corpus including Word2Vec, GloVe, and FastText. To realize the effect of hyperparameters, we have considered embedding dimension (size), minimum word frequency count (min count), contextual windows size (window) and number of iteration (epoch) for each of the embedding technique.

Results and Discussion
Intrinsic evaluations are performed for a total of ninenty (e.g., Word2Vec=36, GloVe=18 and Fast-  Text=36) embedding models. Among these, the results of best four embedding models presented for extrinsic and intrinsic evaluators.

Intrinsic evaluation results
The word similarity (semantic and syntactic) score is calculated by Cosine similarity (C). The model performance can be calculated from the spearman (ρ) and pearson (r) correlations (Zhelezniak et al., 2019b). The well-known word analogy solver, 3COSADD (Mikolov et al., 2013b) is used to solve the analogy tasks. Three similarity measures are used to evaluate word similarity and analogy tasks analysis. In order to maintain consistency, we performed training for all models with our developed corpus.
Word similarity: Table 2 shows the intrinsic evaluations performance of the embedding models. Annotators word similarity rates are range from 0 − 10 whereas the cosine similarity score normalized by ten times. All values in Table 2 are normalized by hundreds times. Maximum semantic correlation values are S sρ = 60.02% and S sr = 60.66% for GloVe (size = 300 and window = 15) technique. Highest syntactic correlation is S yρ = 70.41% for GloVe (size = 250 and window = 10) where as S yr = 71.64% for GloVe (size = 250 and window = 5) techniques. The R s highest correlations, R sρ=79.22% and R yr=79.80% have been achieved using GloVe (size = 250 and window = 10) technique. There are eight semantic words pair are not able to process by E m where as four syntactic words pair are not able to process by E m of all techniques. Relatedness words pair are fully processed by all embedding techniques.
Word analogy results: The semantic analogy results are shown in Table 3, while Table 4 denotes the syntactic analogy tasks performance based on our corpus. Due to unavailability of Bengali semantic and syntactic analogy datasets, we have been developed A t datasets were 50 analogy words used for semantic and another fifty used to perform syntactic analogy tasks. GloVe (size = 300 and window = 15) technique has achieved maximum accuracy of 38.00% (Add) and 44.00% (Mull) for semantic analogy tasks. Minimum semantic analogy tasks accuracy is obtained by FastText (Grave et al., 2018) (Grave et al., 2018) 300 -20.00 26.00 syntactic analogy tasks accuracy are 30.00% (Add) and 36.00% achieved by GloVe (size = 300 and window = 15) E m , while 20.00% (Add) and 24.00% (Mull) from FastText (Grave et al., 2018) embedding model.

Extrinsic evaluation results
The E d is a Bengali text classification dataset which partitioned into the three sets: training (39, 079), validation (6, 000) and testing (9,779). The text classifier model trained with a multi-kernel CNN architecture (Kim, 2014). The performance of the text classifier model assesses with extrinsic  (Grave et al., 2018) 300 -20.00 24.00 evaluators including accuracy (A), micro average F1-score, average precision (A p ), average recall (A r ) and confusion matrix (CM) (Wu et al., 2020). The evaluators evaluated the embedding models (E m ) downstream task (e.g., text classification) performance (in Tables 5 and 6).  The GloVe model achieved the highest accuracy of 96.48%. For clarity, we presented only the results of best four embedding models out of ninety models. Table 6 depicts the confusion matrix of GloVe model (size = 200 and window = 10) for text classification performance. The maximum correctly predicted class is Politics and incorrectly predicted class is Crime. The highest misclassification occurred for Crime and Accident pair.   Figure 2 shows few example scores for semantic, syntactic and relatedness words pair score obtained from GloVe model and human annotators. GloVe and FastText (SG) models accuracy are considerable for semantic and relatedness similarities. In the case of extrinsic evaluations, the performance Figure 2: Word pair similarity scores, the Cosine similarity score is normalized by 10 times and annotators score is ranging between 1 to 10.
of GloVe and FastText embedding models are significant for the text classification task.

Conclusion and Future Work
In this work, we have been generated about ninety embedding models for the Bengali language. These models have developed using the combinations of three embedding techniques (such as GloVe, Word2Vec, and FastText) and various hyperparameters. All models have evaluated by extrinsic and intrinsic evaluators on our developed corpus. The performance of an embedding model significantly depends on the hyperparameters, corpus and nature of the model. Although GloVe model performed better than Word2Vec and FastText, there is no generalized embedding model for intrinsic and extrinsic NLP tasks. The embedding models are highly corpus oriented, and hyperparameters also vary from one task to another. In the future, the existing Bengali corpus can be extended for embedding model generation to alleviate the outof-vocabularies problems. The context-dependent feature represents technique (such as BERT, ElMo and XLNet) will be investigated to find suitable embedding technique for Bengali. In addition to that, more analogy tasks can be considered to assess the performance of different embedding models with various intrinsic and extrinsic evaluators.