WHAT ARE COVID-19 ARABIC TWEETERS TALKING ABOUT?

The new coronavirus outbreak (COVID-19) has swept the world since December 2019 posing a global threat to all countries and communities on the planet. Information about the outbreak has been rapidly spreading on different social media platforms in unprecedented level. As it continues to spread in different countries, people tend to increasingly share information and stay up-to-date with the latest news. It is crucial to capture the discussions and conversations happening on social media to better understand human behavior during pandemics and alter possible strategies to combat the pandemic. In this work, we analyze the Arabic content of Twitter to capture the main discussed topics among Arabic users. We utilize Non-negative Matrix Factorization (NMF) to discover main issues and topics based on a dataset of Arabic tweets from early January to the end of April, and identify the most frequent unigrams, bigrams, and trigrams of the tweets. The final discovered topics are then presented and discussed which can be roughly classified into COVID-19 origin topics, prevention measures in different Arabic countries, prayers and supplications, news and reports, and finally topics related to preventing the spread of the disease such as curfew and quarantine. To our best knowledge, this is the first work addressing the issue of detecting COVID-19 related topics from Arabic tweets.


Introduction
In recent years, social networks have become a remarkable source for reflecting societies interest and reactions about a specific topic. Analyzing the content and the diffusion of social networks information has been shown useful and increasingly used in many fields to characterize an event of interest, e.g., political, sports, or medical events. Lately, it was worthwhile to direct this capability toward the pandemic spread of corona virus. Consequently, an expedited research effort has been applied on analyzing social networks contents and activities during the pandemic spread to help recognize and characterize the social response [1].
In the meanwhile, with coronavirus infection spreading around the world, Arabic countries have been suffering from the outbreak of COVID-19 as the rest of the world. Nowadays, many individual's activities and conversations related to the pandemic are carried out through social media platforms such as Facebook, Twitter, Instagram, etc. Twitter is one of the most famous social media platforms that has a strong growth in the Arabic region, the number of posts reaches 17 million tweets per day according to the Arab social media report [2].
Due to its overwhelming usage and popularity, tweet content mining can potentially provide valuable information during health crises. Several studies have shown that Twitter can be exploited as a data for detecting the outbreaks of a pandemic such as the case with H1N1 virus [3], and the out-breaks of influenza [4]. Moreover, it plays an important role in understanding the public behaviors and impressions towards health crises, by analysing the tweet text to identify the main concerns about the Zika virus [5], and filtering the topics related to Ebola [6].
Recently, the rise of coronavirus cases in the Arabic countries has led to an escalating discussions related to the COVID-19 pandemic on social media platforms. Therefore, identifying the main concerns, thoughts, and topics regarding the coronavirus crises might be useful to assist public health professionals and social scientists. It will provide an instantaneous snapshot of the Arabic social opinions and behavioural responses to understand issues more properly. To this aim, getting an overview of the most discussed topics leveraging Arabic content tweets posted by Arabic tweeters by employing text mining techniques is the main goal of this paper. This study presents the first step toward extracting the main topics discussed by Arabic Twitter users regarding the COVID-19 pandemic. This paper used Non-negative Matrix Factorization (NMF), a topic modelling method, to identify latent topics.

Methodology
In this section, we describe the workflow of the methodology we adapted for this study, and explain the main steps we are following in detail. The workflow is depicted in Figure 1 and is composed of the following steps: • Dataset preparation.
• Topics discovery and themes identification: -NMF for topic modelling.

Dataset preparation
We use the dataset of the Arabic Twitter COVID-19 collection 2 [7], which contains 3,934,610 Arabic tweets related to COVID-19. The original dataset was collected through Twitter's streaming API and covers the time span from January 1, 2020 to April 30, 2020. To build a better-quality potential dataset for the experiment, certain filtration and cleaning are applied on the tweets collection to remove noise from the data: • Filtering non-Arabic tweets: many tweets founded were multilingual tweets, since the Arab users may post tweets written in different languages besides Arabic. Therefore, we opted to filter out the multilingual tweets [8]. The non-Arabic tweets identified using the language field in the tweets metadata [9]. • Filtering out the retweets: the retweets were removed from the dataset to eliminate the duplicated content tweets. • Filtering out short tweets: the tweets with one or two words usually could be ambiguous, hence, this will not provide meaningful information. Therefore, the tweets with less than three words were filtered out.
Applying the previous filtering steps, we ended up with 2,426,850 tweets.

Text pre-processing
To prepare the tweets for text mining, it is necessary to represent the tweets in more appropriate form that can be analyzed. The pre-processing involved applying several steps to the entire dataset with the aim of reducing the amount of trivial noise to clean the data. The following text pre-processing techniques were applied: • Noise symbols and characters removal: We processed the dataset first by deleting all mentions and URLs links from the tweet. Then, deleting all emojis, Arabic and English punctuation, all numbers, and all non-alphabet characters. We also removed non-Arabic letters, to keep only the Arabic alphabets. Finally, we removed leading and trailing spaces in addition to line breaks.
• Removing the Arabic vowel diacritics, 'Tashkeel' : 'Tashkeel' [10] are diacritical marks appeared above or below each letter, used to affect the way of Arab pronunciation in accordance to some syntax and grammatical rules. Hence, the words with diacritical marks result in different shapes for words of the same origin. To unify the shape of similar words formats, we removed it from the tweets.
• Tokenization: we tokenized each word in tweets, which resulted in a group of raw tokens as an array of strings. In Arabic language, we have two ways to spell the word virus, which pronounced as "Fairus" or "Firus". Therefore, we also applied an extra normalization to unify virus word in Arabic, we converted from to .

NMF for topic modelling
Non-negative Matrix Factorization (NMF) is an unsupervised technique for reducing the dimensionality of non-negative matrices [11]. It has been successfully applied in the field of text mining to identify topics [12,13]. Our study utilized (NMF) according to its ability to give semantically meaningful results. A study done by O'callaghan et al. [14] founded that NMF produced more coherent topics than other popular topic modelling technique such as the latent Dirichlet allocation (LDA) model. To apply NMF, the pre-processed tweets were transformed to log-based Term Frequency-Inverse Document Frequency (TF-IDF) vectors, where each row corresponds to a term and each column to a document [15]. NMF based on (TF-IDF) values, approves its usefulness since it can account for the importance of a word to a document within a collection of texts [14,16].

Topic model coherence evaluation employing word2vec
According to the difficulty of defining the similarity measure in high-dimensional sparse vector space, we incorporate the potential of word embedding techniques to determine the number of topics. We opted to use the presented measure, Topic Coherence-Word2Vec (TC-W2V) metric in [14] that measures the coherence between words assigned to a topic via Word2Vec. Word2Vec basically consists of a model to represent words as vectors. It is one of the most promising techniques in NLP that captures the meaning of the words [17]. We employed word2vec by training our model based on the 2,426,850 tweets using the Skipgram algorithm with a dimension of 50. The word vectors were produced using the Gensim package in Python.  Figure 3, the highest average value was 0.3504 with k=11. Based on the result obtained from (TC-W2V) metric, we trained the NMF model with the optimal number of topics using the scikitlearn implementation of NMF (including NNDSVD initialization) with k equal to 11. Basic unigrams, bigram and trigram frequency analysis over time will reflect the change of Arabic tweeters trends and concerns during the pandemic. After applying the pre-processing steps, we constructed unigrams, bigrams and trigrams frequency table for the entire pre-proccessed dataset. Then, we analyzed the frequency of each gram over the whole dataset, and explored the topmost unigrams, bigrams and trigrams over weeks.
For unigram frequency, we generated the word cloud of the top 10,000 unigrams in the dataset. Figure 4 illustrated that the word " " which translates to "corona" was the most frequent word. Also we investigated the volume of specific words appeared in the Arabic tweet content in January and associated with COVID-19 pandemic; which stand for "corona", "epidemic", and "Wuhan", respectively. Figure 5 plots the number of occurrences of these words. An increase is clearly noticed over the last two weeks of the January and reached the highest occurrences on the 25th of January to reach 13,938, 3,717, and 2,013 for corona, epidemic, and Wuhan, respectively. With respect to bigrams and trigrams, Table 1 shows the top 10 bigrams and trigrams from the dataset. From the constructed bigrams and trigrams table, we created a list of bigrams and trigrams. Then, we tracked each of them by combining each bigrams and trigrams with its corresponding grams that have the same meaning. The day with highest frequency for each month represented in Table 2. The month that has zero or very low frequency counts for a bigram, or a trigram, was omitted from the table.
In February 2020, the news about coronavirus started to disseminate over Arabic countries. The bigrams "corona virus", , and "corona covid", , appeared mostly at the first week of the month, while the  bigram "quarantine", , started to increase over the last days in February as shown in Figure 6. Similarly, Figure 7 shows the evolution of the top three trigrams, the trigram "corona virus infection", , had the highest occurrences in the second week. corona virus spread, , started with the higher occurrences, in the first week of February. We also noticed that the trigram order flight ban, , was the most frequent trigram at the end of February. In March 2020, the number of infections with Corona virus was increasing rapidly in Arabic countries, and so the tweets about the virus. We track the bigrams and trigrams for both March and April 2020 as done previously.
The bigrams list was separated into two lists: bigrams related to coronavirus, and bigrams included the Health ministry bigram and four bigrams about prevention measures as shown in Figure 8 and, Figure 9. In terms of bigrams frequency related to coronavirus, Figure 8 showed that in March there was stability in the pattern of bigrams in comparing with April.
Regarding the second list, the bigrams quarantine, , and curfew, , appeared as the topmost frequent bigrams from the second week of March to the end of the fourth week. However, these bigrams were used during April albeit less frequently as shown in Figure 9. Moreover, the bigrams "washing hand", , and   In terms of trigrams frequency in March, the trigram "home quarantine activities", , was the most frequent trigram in March. Although this trigram was the sixth top frequent trigram in the entire dataset as listed in Table 1, it appeared only a few times over April. The trigram "corona virus spread" was used The rest trigrams which include supplications "oh God, remove the affliction", , "Allah, let our lives be extended so that we live to see the holy month of Ramadan", reached the highest in March, and continued to appear over April with lower frequency. Moreover, the trigram "please stay at home", , was appeared in March only. Figure 10 showed the top trigrams frequency for both March and April.

Exploratory Topic Discovery
We analyzed the 11 topics extracted from tweets using the NMF described earlier in Section 2.3. The distributions and the top-7 terms associated with each topic are shown in Table 3. To provide an overview of the main discussed topics regarding the coronavirus in Arabic tweets, we inspected a few chosen tweets from each topic along with top bigrams and trigrams, and we observed the following: • Topic 1: Prevention measures taken against the virus. Staying at home, and protection from coronavirus infection. The most frequent countries mentioned in the tweets were Saudi Arabia, Egypt, Lebanon, China, Jordan and Oman.
• Topic 2: About quarantine, its impact on individuals, and quarantine activities. Moreover, appealing to increase charitable donations.
• Topic 3: Corona is a global epidemic, Stopping schools, and the coronavirus epidemic.
• Topic 4: About China, flight cancellations from and to China, and discussion about spreading the virus in Wuhan city. • Topic 5: About curfew, tweeters mostly mention Kuwait, Saudi Arabia, and Jordan countries in the tweets.
Moreover, appeals to sit at home mostly written in Gulf dialect such as please stay home, " " .
• Topic 6: Mainly about coronavirus spreading in Egypt. The most tweets were written in Egyptian dialectal words. • Topic 7: Supplications, such as may Allah save us, and protect Muslims. Examples of trigrams founded: " " and, " ".

Conclusions
This paper presents a preliminary analysis and topic extraction of Arabic tweets posted during COVID-19 pandemic from January to April 2020. An analysis of the topmost frequent bi-grams and trigrams showed change in topic