DATASET | doi:10.20944/preprints202004.0263.v1
Online: 16 April 2020 (08:15:28 CEST)
The COVID-19 pandemic spread of the coronavirus across the globe has affected our lives on many different levels. The world we knew before the spread of the virus has become another one. Every country has taken preventive measures, including social distancing, travel restrictions, and curfew, to control the spread of the disease. With these measures implemented, people have shifted to social media platforms in the online sphere, such as Twitter, to maintain connections. In this paper, we describe a coronavirus data set of Arabic tweets collected from January 1, 2020, primarily from hashtags populated from Saudi Arabia. This data set is available to the research community to glean a better understanding of the societal, economical, and political effects of the outbreak and to help policy makers make better decisions for fighting this epidemic.
REVIEW | doi:10.20944/preprints202103.0734.v1
Online: 30 March 2021 (12:17:49 CEST)
(1) Background: Spain launched an official campaign, #EsteVirusLoParamosUnidos, to try and unite the efforts of the entire country through citizen cooperation to combat coronavirus. The research goal is to analyze the Twitter campaign’s repercussion on general citizen feeling. (2) Methods: The research is based on a composite design that triangulates from a theoretical model, a quantitative analysis and a qualitative analysis. (3) Results: Of the 7357 tweets in the sample, 72.32% were found to be retweets. Four content families were extracted: politics, education, messages to society and defense of occupational groups. The feelings expressed ranged along a continuum, from unity, admiration and support at one end to discontent and criticism regarding the health situation at the other. (4) Conclusions: The development of networked sociopolitical and technical measures that enable citizen participation facilitates the development of new patterns of interaction between governments and digital citizens, increasing citizens’ possibilities of influencing the public agenda and therefore strengthening citizen engagement vis-à-vis such situations.
ARTICLE | doi:10.20944/preprints202002.0170.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Twitter; dataset; redundancy; reduction; archive
Online: 13 February 2020 (12:45:44 CET)
The data from social networks like Twitter is a valuable source for research but full of redundancy, making it hard to provide large-scale, self-contained, and small datasets. The data recording is a common problem in social media-based studies and could be standardized. Sadly, this is hardly done. This paper reports on lessons learned from a long-term evaluation study recording the complete public sample of the German and English Twitter stream. It presents a recording solution proposal that merely chunks a linear stream of events to reduce redundancy. If events are observed multiple times within the time-span of a chunk, only the latest observation is written to the chunk. A 10 Gigabyte Twitter raw dataset covering 1,2 Million Tweets of 120.000 users recorded between June and September 2017 was used to analyze expectable compression rates. It turned out that resulting datasets need only between 10\% and 20\% of the original data size without losing any event, metadata or the relationships between single events. This kind of redundancy reduction recording makes it possible to curate large-scale (even nation-wide), self-contained, and small datasets of social networks for research in a standardized and reproducible manner.
ARTICLE | doi:10.20944/preprints202007.0172.v1
Online: 9 July 2020 (07:25:19 CEST)
The new coronavirus outbreak (COVID-19) has swept the world since December 2019 posing a global threat to all countries and communities on the planet. Information about the outbreak has been rapidly spreading on different social media platforms in unprecedented level. As it continues to spread in different countries, people tend to increasingly share information and stay up-to-date with the latest news. It is crucial to capture the discussions and conversations happening on social media to better understand human behavior during pandemics and alter possible strategies to combat the pandemic. In this work, we analyze the Arabic content of Twitter to capture the main discussed topics among Arabic users. We utilize Non-negative Matrix Factorization (NMF) to discover main issues and topics based on a dataset of Arabic tweets from early January to the end of April, and identify the most frequent unigrams, bigrams, and trigrams of the tweets. The final discovered topics are then presented and discussed which can be roughly classified into COVID-19 origin topics, prevention measures in different Arabic countries, prayers and supplications, news and reports, and finally topics related to preventing the spread of the disease such as curfew and quarantine. To our best knowledge, this is the first work addressing the issue of detecting COVID-19 related topics from Arabic tweets.
REVIEW | doi:10.20944/preprints202008.0235.v1
Online: 10 August 2020 (04:45:36 CEST)
Background: Twitter is a major tool for communication in emergencies such as natural disasters. This online social network allows the user to produce content, and it is not designed exclusively for news releases, as opposed to other service providers. Aim: The aim of this study is to investigate Twitter uses in natural disasters and pandemics. Methods: The included studies reported the role of Twitter in natural disasters. The studies that report in settings other than the natural disasters (such as man-made disasters) and other social media were excluded. Electronic databases for a comprehensive literature search including MEDLINE, Web of Science, CINAHL, PsycINFO, Cochrane Register of Controlled Trials (CENTRAL) and EMBASE were used to identify the records that match the mentioned inclusion criteria published till May 2020. The study characteristics were extracted from the qualified studies including year of publication, findings, and geographical location of the study conduct. A narrative synthesis for this literature review was used. Results: The search identified 822 articles of which 780 articles were removed, 256 were not available, 311 papers were not relevant, 16 were duplicated articles, and 197 were non-related to the emergencies. 45 articles met the selection criteria and were included in the review. eleven themes were found in the narrative synthesis including early warning, disseminating information and misinformation, advocacy, personal gains, assessment, various roles of organizations, public mood, geographical analysis, charity, using influencers, and trust. Conclusions: It is recommended that influential individuals be identified in each country and community before disasters occur so that the necessary information can be disseminated in response to disasters. Preventing the spread of misinformation is one of the most important issues in times of disaster, especially pandemics. Disseminating accurate, transparent, and prompt information from relief organizations and governments can help. Also, analyzing Twitter data can be a good source for understanding the mental state of the community, estimating the number of injured people, estimating the points affected by natural disasters, and modeling the prevalence of epidemics. Therefore, various groups such as politicians, the government, non-governmental organizations, aid workers, and the health system can use this information to plan and implement interventions.
ARTICLE | doi:10.20944/preprints202007.0257.v1
Subject: Social Sciences, Geography Keywords: Twitter; Spatiotemporal analysis; Mega-events; Olympic Games
Online: 12 July 2020 (14:25:23 CEST)
Olympic Games have a huge impact on the cities where they are held, both during the actual celebration of the event and before and after it. This study presents a new approach based on spatial analysis, GIS, and data coming from Location Based Social Networks to model the spatiotemporal dimension of impacts associated with the Rio 2016 Olympic Games. Geolocalized data from Twitter are used to analyze the activity pattern of users from two different viewpoints. The first monitors the activity of Twitter users during the event -the arrival of visitors, where they came from, and the use resident and tourist made of different areas of the city. The second assesses the spatiotemporal use of the city by Twitter users before the event, compared to the use during and after the event. The results not only reveal which spaces were the most used while the Games were being held but also changes in the urban dynamics after the Games. Both approaches can be used to assess the impacts of mega-events and to improve the management and allocation of urban resources such as transport and public services infrastructure.
ARTICLE | doi:10.20944/preprints202202.0306.v1
Subject: Medicine & Pharmacology, Other Keywords: COVID-19; Twitter; Blood Clots; Social Media; Clots
Online: 24 February 2022 (09:35:35 CET)
After the first weeks of vaccination against the SARS-CoV-2, several cases of acute thrombosis were reported. These news reports began to be shared frequently across social media platforms. The aim of this study was to conduct an analysis of Twitter data related to the overall discussion. Data was retrieved from 14th March to 14th April using the keyword ‘blood clots’. A dataset with n=266,677 tweets was retrieved, and a systematic random sample of 5% of tweets (n=13,334) were entered into NodeXL for further analysis. Social network analysis was used to analyse the data by drawing upon the Clauset-Newman-Moore algorithm. Influential users were identified by drawing upon the betweenness centrality metric. Text analysis was applied to identify the key hashtags and websites used at this time. More than half of the network was comprised of retweets and the largest groups within the network were broadcast clusters where a number of key users were retweeted. The most popular narratives were around highlighting the low risk of obtaining a blood clot from a vaccine and highlighting higher blood clot risks in medicines commonly consumed. A wide-variety of actors drove the discussion on Twitter ranging from writers, physicians, the general public academics, celebrities, and journalists. Twitter was used to highlight the low potential of obtaining a blood clot from a vaccine and encouraged vaccinations among the public.
ARTICLE | doi:10.20944/preprints202110.0070.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Online social networks (OSNs); Deep Learning; cyberbullying; Twitter
Online: 5 October 2021 (08:27:41 CEST)
Online social networks (OSNs) play an integral role in facilitating social interaction; however, these social networks increase antisocial behavior, such as cyberbullying, hate speech, and trolling. Aggression or hate speech that takes place through short message service (SMS) or the Internet (e.g., in social media platforms) is known as cyberbullying. Therefore, automatic detection utilizing natural language processing (NLP) is a necessary first step that helps prevent cyberbullying. This research proposes an automatic cyberbullying method to detect aggressive behavior using a consolidated deep learning model. This technique utilizes multichannel deep learning based on three models, namely, the bidirectional gated recurrent unit (BiGRU), transformer block, and convolutional neural network (CNN), to classify Twitter comments into two categories: aggressive and not aggressive. Three well-known hate speech datasets were combined to evaluate the performance of the proposed method. The proposed method achieved promising results. The accuracy of the proposed method was approximately 88%.
ARTICLE | doi:10.20944/preprints202110.0015.v1
Subject: Medicine & Pharmacology, Nursing & Health Studies Keywords: Social media; Community; Facebook; Twitter; Google; Information; Interaction
Online: 1 October 2021 (12:03:09 CEST)
Background: Caregivers often use the internet to access information related to stroke care to improve preparedness, thereby reducing uncertainty and enhancing the quality of care. Method: Social media communities used by caregivers of people affected by stroke were identified using popular keywords searched for using Google. Communities were filtered based on their ability to provide support to caregivers. Data from the included communities were extracted and analysed to determine the content and level of interaction. Results: There was a significant rise in the use of social media by caregivers of people affected by stroke. The most popular social media communities were charitable and governmental organizations with the highest user interaction – this was for topics related to stroke prevention, signs and symptoms, and caregiver self-care delivered through video-based resources. Conclusion: Findings show the ability of social media to support stroke caregiver needs and practices that should be considered to increase their interaction and support.
ARTICLE | doi:10.20944/preprints202004.0031.v1
Subject: Keywords: Saudi Arabia, COVID-19, Sentiment Analysis, Twitter, Measures
Online: 3 April 2020 (11:45:33 CEST)
Background: Countries around the world are facing extraordinary challenges in implementing various measures to slow down the spread of the novel coronavirus (COVID-19). Guided by international recommendations, Saudi Arabia has implemented a series of infection control measures after the detection of the first confirmed case in the country. However, in order for these measures to be effective, public attitudes and compliance must be conducive as perceived risk is strongly associated with health behaviors. The primary objective of this study is to assess Saudis’ attitudes towards COVID-19 preventive measures to guide future health communication content. Methods: Naïve Bayes machine learning model was used to run Arabic sentiment analysis of Twitter posts through the Natural Language Toolkit (NLTK) library in Python. Tweets containing hashtags pertaining to seven public health measures imposed by the government were collected and analyzed. Results: A total of 53,127 tweets were analyzed. All measures, except one, showed more positive tweets than negative. Measures that pertain to religious practices showed the most positive sentiment. Discussion: Saudi Twitter users showed support and positive attitudes towards the infection control measures to combat COVID-19. It is postulated that this conducive public response is reflective of the overarching, longstanding popular confidence in the government. Religious notions may also play a positive role in preparing believers at times of crises. Findings of this study broadened our understanding to develop proper public health messages and promote stronger compliance with control measures to control COVID-19.
DATA DESCRIPTOR | doi:10.20944/preprints202206.0246.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: dataset; twitter; tweets; IMDb ratings; movies; sentiment analysis; NLP
Online: 17 June 2022 (04:39:16 CEST)
In this paper we intend to present a dataset that contain a collection of tweets generated as reactions of the release of 50 different movies. The dataset can be used for gaining useful insights regarding the conversation that is generated around a particular movie. It is particularly suitable for conducting sentiment analysis and other NLP techniques. The dataset contains approximately 2.5 million tweets with their related meta data and cover 50 movies. For each movie, its IMDb rating is included. The movies are the 25 releases with the highest number of votes during 2020 and 2021. The collected tweets represent the reactions of the twitter community during the first week of the release date in US of that particular movie. The tweets per movie ranged from 1.000 to approximately 200.000 tweets with an average of 50.000 per release. We used The Internet Archive Wayback Machine in order to retrieve the IMDb movie rating after one week of the US release date. The tweets and related metadata have been collected using the Tweet Downloader tool.
Subject: Social Sciences, Other Keywords: COVID-19; Twitter; Geo-Tagged; Metropolitan; Computational Social Science
Online: 18 May 2021 (10:24:58 CEST)
One of the unfortunate findings from the ongoing COVID-19 crisis is the disproportionate impact the crisis has had on people and communities who were already socioeconomically disadvantaged. It has, however, been difficult to study this issue at scale and in greater detail using social media platforms like Twitter. Several COVID-19 Twitter datasets have been released, but they have very broad scope, both topically and geographically. In this paper, we present a more controlled and compact dataset that can be used to answer a range of potential research questions (especially pertaining to computational social science) without requiring extensive preprocessing or tweet-hydration from the earlier datasets. The proposed dataset comprises tens of thousands of geotagged (and in many cases, reverse-geocoded) tweets originally collected over a 255-day period in 2020 over 10 metropolitan areas in North America. Since there are socioeconomic disparities within these cities (sometimes to an extreme extent, as witnessed in `inner city neighborhoods’ in some of these cities), the dataset can be used to assess such socioeconomic disparities from a social media lens, in addition to comparing and contrasting behavior across cities.
ARTICLE | doi:10.20944/preprints202207.0196.v1
Subject: Medicine & Pharmacology, Nutrition Keywords: Food Security; Machine Learning; Topic Modeling; Twitter; Natural Language Processing
Online: 13 July 2022 (09:14:39 CEST)
Objective: Food security during public health emergencies relies on situational awareness of needs and resources. Artificial intelligence (AI) has revolutionized situational awareness during crises, allowing the allocation of resources to needs through machine learning algorithms. Limited research exists monitoring Twitter for changes in the food security-related public discourse during the COVID-19 pandemic. We aim to address that gap with AI by classifying food security topics on Twitter and showing topic frequency per day. Methods: Tweets were scraped from Twitter from January 2020 through December 2021 using food security keywords. Latent Dirichlet Allocation (LDA) topic modeling was performed, followed by time-series analyses on topic frequency per day.Results: 237,107 tweets were scraped and classified into topics, including food needs and resources, emergency preparedness and response, and mental/physical health. After the WHO’s pandemic declaration, there were relative increases in topic density per day regarding food pantries, food banks, economic and food security crises, essential services, and emergency preparedness advice. Threats to food security in Tigray emerged in 2021.Conclusions: AI is a powerful yet underused tool to monitor food insecurity on social media. Machine learning tools to improve emergency response should be prioritized, along with measurement of impact. Further food insecurity word patterns testing, as generated by this research, with supervised machine learning models can accelerate the uptake of these tools by policymakers and aid organizations.
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Online Social Media prediction, Covid-19 prediction, Twitter, Google Trends
Online: 3 June 2021 (11:37:56 CEST)
As the coronavirus disease 2019 (COVID-19) continues to rage worldwide, the United States has become the most affected country with more than 34.1 million total confirmed cases up to June 1, 2021. In this work, we investigate correlations between online social media and Internet search for the COVID-19 pandemic among 50 U.S. states. By collecting the state-level daily trends through both Twitter and Google Trends, we observe a high but state-different lag correlation with the number of daily confirmed cases. We further find that the predictive accuracy measured by the correlation coefficient is positively correlated to a state’s demographic, air traffic volume and GDP development. Most importantly, we show that a state’s early infection rate is negatively correlated with the lag to the previous peak in Internet search and tweeting about COVID-19, indicating that earlier collective awareness on Twitter/Google correlates with lower infection rate. Lastly, we demonstrate that correlations between online social media and search trends are sensitive to time, mainly due to the attention shifting of the public.
ARTICLE | doi:10.20944/preprints202005.0015.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: COVID-19; coronavirus; machine learning; sentiment analysis; textual analytics; Twitter
Online: 2 May 2020 (13:52:28 CEST)
Along with the Coronavirus pandemic, another crisis has manifested itself in the form of mass fear and panic phenomena, fuelled by incomplete and often inaccurate information. There is therefore a tremendous need to address and better understand COVID-19's informational crisis and gauge public sentiment, so that appropriate messaging and policy decisions can be implemented. In this research article, we identify public sentiment associated with the pandemic using Coronavirus specific Tweets and R statistical software, along with its sentiment analysis packages. We demonstrate insights into the progress of fear-sentiment over time as COVID-19 approached peak levels in the United States, using descriptive textual analytics supported by necessary textual data visualizations. Furthermore, we provide a methodological overview of two essential machine learning classification methods, in the context of textual analytics, and compare their effectiveness in classifying Coronavirus Tweets of varying lengths. We observe a strong classification accuracy of 91% for short Tweets, with the Naive Bayes method. We also observe that the logistic regression classification method provides a reasonable accuracy of 74% with shorter Tweets, and both methods showed relatively weaker performance for longer Tweets. This research provides insights into Coronavirus fear sentiment progression, and outlines associated methods, implications, limitations and opportunities.
ARTICLE | doi:10.20944/preprints201905.0141.v1
Subject: Mathematics & Computer Science, Other Keywords: twitter spam detection; adversarial machine learning; online social networks; survey
Online: 13 May 2019 (01:49:41 CEST)
Online Social Networks (OSNs), such as Facebook and Twitter, have become a very important part of many people’s daily lives. Unfortunately, the high popularity of these platforms makes them very attractive to spammers. Machine-learning (ML) techniques have been widely used as a tool to address many cybersecurity application problems (such as spam and malware detection). However, most of the proposed approaches do not consider the presence of adversaries that target the defense mechanism itself. Adversaries can launch sophisticated attacks to undermine deployed spam detectors either during training or the prediction (test) phase. Not considering these adversarial activities at the design stage makes OSNs’ spam detectors prone to a range of adversarial attacks. This paper thus surveys the attacks against Twitter spam detectors in an adversarial environment. In addition, a general taxonomy of potential adversarial attacks is proposed by applying common frameworks from the literature. Examples of adversarial activities on Twitter were provided after observing Arabic trending hashtags. A new type of spam tweet (Adversarial spam tweet), which can be used to undermine deployed classifier, were found. In addition, possible countermeasures that could increase the robustness of Twitter spam detectors against such attacks are investigated.
ARTICLE | doi:10.20944/preprints202008.0487.v1
Subject: Social Sciences, Geography Keywords: Twitter; data reliability; risk communication; data mining; Google Cloud Vision API
Online: 22 August 2020 (02:32:40 CEST)
While Twitter has been touted to provide up-to-date information about hazard events, the reliability of tweets is still a concern. Our previous publication extracted relevant tweets containing information about the 2013 Colorado flood event and its impacts. Using the relevant tweets, this research further examined the reliability (accuracy and trueness) of the tweets by examining the text and image content and comparing them to other publicly available data sources. Both manual identification of text information and automated (Google Cloud Vision API) extraction of images were implemented to balance accurate information verification and efficient processing time. The results showed that both the text and images contained useful information about damaged/flooded roads/street networks. This information will help emergency response coordination efforts and informed allocation of resources when enough tweets contain geocoordinates or locations/venue names. This research will help identify reliable crowdsourced risk information to enable near-real time emergency response through better use of crowdsourced risk communication platforms.
ARTICLE | doi:10.20944/preprints202007.0648.v1
Subject: Keywords: Twitter; Social Media; NLP; Tweet; User Categorizations and Mathematical Frame Work
Online: 26 July 2020 (17:23:33 CEST)
Social networking applications such as Twitter have increasingly gained significance in terms of socio-economic, political, and religious as well as entertainment sectors. This in turn, has witnessed a wide gamut of information explosion in the social networking realm that can tend to be both useful as well as misleading at the same point of time. Spam detection is one such solution that caters to this problem through identification of irrelevant users and their data. However, existing research has so far laid primary focus on user profile information through activity detection and relevant techniques that may underperform when these profiles exhibit characteristics of temporal dependency, poor reflection of generated content from the user profile, etc. This is the primary motivation for this paper that addresses the aforementioned problem of user profiles by focusing on both profile information and content-based spam detection. To this end, this work delivers three significant contributions. Firstly, exhaustive use of Natural language processing (NLP) techniques has been rendered towards creation of a new comprehensive dataset with a wide range of content-based features. Secondly, this dataset has been fed into a customized state-of-art hybrid machine learning model that has been exclusively built using a combination of both machine learning and deep learning techniques. Extensive simulation based analysis not only records over 98% accuracy but also establishes the practical applicability of this proposal by proving that modeling based on the mixed profile and content-generated data is more capable of spam detection in contrast to each of these standalone approaches. Finally, a novel methodology based on logistic regression is proposed and supported by analytical formulations. This paves the way for the custom-built dataset to be analyzed and corresponding probabilities to be obtained that differentiate legitimate users from spammers. The obtained mathematical outcome can henceforth be used for future prediction of user categories through appropriate parameter tuning for any given dataset. This makes our method a truly generic one capable of identifying and classifying different user categories.
ARTICLE | doi:10.20944/preprints202007.0019.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: COVID-19; Coronavirus; reopen; sentiment analysis; Twitter; Census; Binary Logit Model
Online: 3 July 2020 (08:35:46 CEST)
Investigating and classifying sentiments of social media users (e.g., positive, negative) towards an item, situation, and system are very popular among the researchers. However, they rarely discuss the underlying socioeconomic factor associations for such sentiments. This study attempts to explore the factors associated with positive and negative sentiments of the people about reopening the economy, in the United States (US) amidst the COVID-19 global crisis. It takes into consideration the situational uncertainties (i.e., changes in work and travel pattern due to lockdown policies), economic downturn and associated trauma, and emotional factors such as depression. To understand the sentiment of the people about the reopening economy, Twitter data was collected, representing the 51 states including Washington DC of the US. State-wide socioeconomic characteristics of the people (e.g., education, income, family size, and employment status), built environment data (e.g., population density), and the number of COVID-19 related cases were collected and integrated with Twitter data to perform the analysis. A binary logit model was used to identify the factors that influence people toward a positive or negative sentiment. The results from the logit model demonstrate that family households, people with low education levels, people in the labor force, low-income people, and people with higher house rent are more interested in reopening the economy. In contrast, households with a high number of members and high income are less interested to reopen the economy. The accuracy of the model is good (i.e., the model can correctly classify 56.18\% of the sentiments). The Pearson chi2 test indicates that overall this model has high goodness-of-fit. This study provides a clear indication to the policymakers where to allocate resources and what policy options they can undertake to improve the socioeconomic situations of the people and mitigate the impacts of pandemics in the current situation and as well as in the future.
ARTICLE | doi:10.20944/preprints202009.0213.v1
Subject: Social Sciences, Geography Keywords: twitter; discourse analysis; Covid-19; Coronavirus; disinformation; misinformation; social media activity; downplay
Online: 10 September 2020 (03:28:01 CEST)
Misinformation can amplify humanity's most significant challenges. As the novel coronavirus spreads across the world, concerns regarding the spreading of misinformation about it and also people downplaying the severity of it are also growing. This article investigates social media activity in May 2020, specifically Twitter, with respect to COVID-19, the themes of tweets, where the discussion is emerging from, disinformation shared about the virus, and its relationship with COVID-19 incidence rate at the state and county level. A geodatabase of all geotagged COVID-19 related tweets was compiled. Multiscale Geographically Weighted Regression was employed to examine the association between social media activity, population, and the spatial variability of disease incidence; our results suggest that MGWR could explain 96.7% of the variations. Moreover, Covid-19 related twitter dataset content analysis reveals a meaningful strong spatial relationship that exists between social media activity and known cases of COVID-19. Discourses analysis was conducted on tweets to index tweets downplaying the Pandemic or disseminating disinformation; the discourses analysis findings suggest that states in where twitter users spread more misinformation and showed more resistance to pandemic management measures in May are experiencing a surge in the number of cases in July.
Subject: Mathematics & Computer Science, Algebra & Number Theory Keywords: Urdu Twitter Dataset; Urdu Natural language processing (NLP); Urdu text Sentiments and Emoticons
Online: 24 March 2021 (12:03:46 CET)
This article presents a dataset of tweets in the Urdu language. There are 1,140,824 tweets in the dataset, collected from Twitter for September and October 2020. This large-scale corpus of tweets is generated by performing pre-processing which includes removing columns containing user information, retweet’s count, followers information, duplicate tweets, removing unnecessary punctuation, links, symbols, and spaces, and finally extracting emojis if present in the tweet text. In the final dataset each tweet record contains columns for tweet id, text, and emoji extracted from the text with a sentiment score. Emojis are extracted to validate Machine Learning models used for the multilingual sentiment and behavior analysis. These are extracted using a Python script that searches for an emoji from the list of 751 most frequently used emojis. If an emoji is present in the text, a column with the emoji description and sentiment score is added.
Subject: Social Sciences, Accounting Keywords: Covid-19; Twitter; sustainable cities; sustainable citizenship; environmental awareness; responsible consumption; sustainable tourism
Online: 5 February 2021 (22:15:27 CET)
The social confinement resulting from the COVID-19 crisis temporarily reduced greenhouse gas emissions. Although experts consider that the decrease in pollution rates was not drastic, some surveys detect a growth in social concern about the climate. In this new climate-conscious environment, municipalities and local governments are promoting a new way of living and caring for cities, even before they can regain national and international freedom of movement. This work analyzes the connection between the new climate awareness arising from the COVID-19 crisis, the proposals of sustainable citizenship around the world, and its communication on Twitter to educate the new eco-conscious audience. The methodology mixes quantitative and qualitative analysis, using the Twitonomy Premium tool and the Twitter research tool, with data extracted at the end of December 2020. Among the top 10 most influential and active accounts, the results show educational institutions, local institutions, companies, neighborhood, associations, and influencers. The impossibility of living the city, has not prevented citizen education and commitment to make real change for when that city and its citizens return to normality. Although this new normality must be different: more ecological, more responsible, more sustainable and practiced from early childhood.
ARTICLE | doi:10.20944/preprints201901.0077.v1
Subject: Mathematics & Computer Science, Other Keywords: Privacy; security; Machine Learning; K-Means; Natural Language Processing; Twitter; Private Information Retrieving
Online: 8 January 2019 (15:40:09 CET)
The violation of privacy, others people or personal, is a very current problem, which concerns not only on the web but also in private life. In the years 1990 it was expected that nowadays, that any routine operation was carried out "manually", and it would be performed through mobile phones or personal computers. The problem pertains the distribution network that allows to share and bring together information and as result the network becomes unsafe, if subjected to attacks. Nowaday we put personal information on web because otherwise we are seen as “weak”. This work aims to measure and analyze how much information are shared by users of a pre-established social network and it is carried out through a set of algorithms techniques of machine learning.
ARTICLE | doi:10.20944/preprints201803.0247.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: security; social sentiment sensor; hackers; social media; statistics; L1 regression; twitter; cyber attacks
Online: 29 March 2018 (07:47:48 CEST)
In recent years, online social media information has been subject of study in several data science fields due to its impact on users as a communication and expression channel. Data~gathered from online platforms such as Twitter has the potential to facilitate research over social phenomena based on sentiment analysis, which usually employs Natural Language Processing and Machine Learning techniques to interpret sentimental tendencies related to users opinions and make predictions about real events. Cyber attacks are not isolated from opinion subjectivity on online social networks. Various security attacks are performed by hacker activists motivated by reactions from polemic social events. In this paper, a methodology for tracking social data that can trigger cyber attacks is developed. Our main contribution lies in the monthly prediction of tweets with content related to security attacks and the incidents detected based on ℓ1 regularization.
REVIEW | doi:10.20944/preprints202211.0427.v1
Subject: Social Sciences, Organizational Economics & Management Keywords: Corporate Social Responsibility; Twitter; Stakeholder Management; Social Media Communication; Social Media; CSR; Communication Strategy
Online: 23 November 2022 (01:14:53 CET)
Corporate social responsibility (CSR) has become increasingly important for companies in recent years. On the one hand, regulatory frameworks require the disclosure of measures for sustainable management. On the other hand, for long-term corporate success, stakeholders must be strategically engaged in the dialog on sustainability aspects. Social media, and Twitter in particular, offer the potential to foster a meaningful stakeholder dialogue on CSR topics. Due to Elon Musk's acquisition in the fall of 2022, this strategic disruption provides an opportunity to systematically capture the platform's past activities and strategies to synthesize practical information that can guide Twitter usage decision making and be used for research to serve as the basis for future comparative longitudinal studies of changes in usage. We conducted a literature review including 42 papers to contribute to the body of evidence on CSR communication strategies on Twitter across industries and countries by deriving interdisciplinary suggestions for strategic CSR-related stakeholder management. Results cover relevant CSR topics, prioritized stakeholder groups for CSR communication on Twitter and successful communication strategies for companies to obtain beneficial results, such as generating social media capital. The results contribute to the strategic planning and implementation of CSR stakeholder management on Twitter and offer starting points for future studies on social media mining and CSR communication strategies.
ARTICLE | doi:10.20944/preprints202210.0238.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: NLP; NLU; Twitter; Sentiment Analysis; Opinion Mining; Nigeria; Election; Machine Learning; BERT; LSTM; SVM
Online: 17 October 2022 (12:01:42 CEST)
Introduction: Social media platforms such as Facebook, LinkedIn, Twitter, among others have been used as a tool for staging protests, opinion polls, campaign strategy, medium of agitation and a place of interest expression especially during elections. Past studies have established people’s opinion elections using social media posts. The advent of state-of-the-art algorithms for unstructured text processing implies tremendous progress in natural language processing and understanding. Aim: In this work, a Natural Language framework is designed to understand Nigeria 2023 presidential election based on public opinion using Twitter dataset. Methods: Raw datasets concerning discourse around Nigeria 2023 elections from Twitter of 2,059,113 18 dimensions were collected. Sentiment analysis was performed on the preprocessed dataset using three different machine learning models namely: Long Short-Term Memory (LSTM) Recurrent Neural Network, Bidirectional Encoder Representations from Transformers (BERT) and Linear Support Vector Classifier (LSVC) models. Personal tweet analysis of the three candidates provided insight on their campaign strategies and personalities while public tweet analysis established the public’s opinion about them. The performance of the models was also compared using accuracy, recall, false positive rate, precision and F-measure. Results: LSTM model gave an accuracy, precision, recall, AUC and f-measure of 88%, 82.7%, 87.2% , 87.6% and 82.9% respectively; the BERT model gave an accuracy, precision, recall, AUC and f-measure of 94%, 88.5%, 92.5%, 94.7% and 91.7% respectively while the LSVC model gave an accuracy, precision, recall, AUC and f-measure of 73%, 81.4%, 76.4%, 81.2% and 79.2% respectively. Conclusion: The experimental results show that sentiment analysis and other Natural Language Processing tasks can aid in the understanding of the social media space. Results also revealed the leverage of each aspirant towards winning the election. We conclude that sentiment analysis can form a general basis for generating insights for election and modeling election outcomes.
COMMUNICATION | doi:10.20944/preprints202206.0172.v3
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Monkeypox; monkey pox; Twitter; Dataset; Tweets; Social Media; Big Data; Data Mining; Data Science
Online: 25 July 2022 (09:41:19 CEST)
ARTICLE | doi:10.20944/preprints202109.0216.v2
Subject: Medicine & Pharmacology, Other Keywords: Rural Health; Twitter Messaging; Social Media; Covid-19, SARS-CoV-2; coronavirus; social network analysis
Online: 19 November 2021 (14:41:47 CET)
Individuals from rural areas are increasingly using social media as a means of communication, receiving information, or actively complaining of inequalities and injustices. This study captured 57 days’ worth of Twitter data from June to August 2021 related to rural health using English language keywords. The study utilised social network analysis and natural language processing to analyse the data. It was found that Twitter served as a fruitful platform to raise awareness of problems faced by those living in rural areas. Overall, Twitter was utilised in rural areas to express complaints, to debate, and share information. Twitter could be leveraged as a powerful social listening tool for individuals and organisations who want to gain insight into popular narratives around rural health.
ARTICLE | doi:10.20944/preprints202108.0516.v1
Subject: Social Sciences, Other Keywords: machine learning; time; naive bayes classification; recurrent neural networks, Twitter; social media data; automatic classification
Online: 27 August 2021 (11:23:50 CEST)
Machine learning (ML) is increasingly useful as data grows in volume and accessibility as it can perform tasks (e.g. categorisation, decision making, anomaly detection, etc.) through experience and without explicit instruction, even when the data are too vast, complex, highly variable, full of errors to be analysed in other ways , . Thus, ML is great for natural language, images, or other complex and messy data available in large and growing volumes. Selecting a ML algorithm depends on many factors as algorithms vary in supervision needed, tolerable error levels, and ability to account for order or temporal context, among many other things. Importantly, ML methods for explicitly ordered or time-dependent data struggle with errors or data asymmetry. Most data are at least implicitly ordered, potentially allowing a hidden `arrow of time’ to affect non-temporal ML performance. This research explores the interaction of ML and implicit order by training two ML algorithms on Twitter data before performing automatic classification tasks under conditions that balance volume and complexity of data. Results show that performance was affected, suggesting that researchers should carefully consider time when selecting appropriate ML algorithms, even when time is only implicitly included.
ARTICLE | doi:10.20944/preprints202105.0447.v1
Subject: Keywords: Vaccine; Sentiment analysis; Public Sentiment Scenarios framework; COVID-19; Coronavirus; Twitter; Textual analytics; Public policy
Online: 19 May 2021 (13:51:57 CEST)
There exists a compelling need to better understand the temporal dynamics of public sentiment towards COVID-19 vaccines in the US on a national and state-wise level for facilitating appropriate public policy applications. Our analysis of social media data from early February of 2021 and late March of 2021 shows that in spite of overall strength of positive sentiment, and increasing numbers of Americans being fully vaccinated, negative sentiment about COVID-19 vaccines still persists among sections of people who are hesitant towards the vaccine. In this study, we performed sentiment analytics on vaccine tweets, studied changes in public sentiment over time, conducted vaccination sentiment validation using actual vaccination data from the US CDC and Household Pulse Survey (HPS), explored influence of maturity of Twitter user-accounts and generated geographic mapping of sentiments by location of Twitter users. Furthermore, we leverage the emotion polarity based Public Sentiment Scenarios (PSS) framework which was developed for COVID-19 sentiment analytics, to systematically analyze directions for public policy processes to potentially improve the administration of vaccines. Application of the PSS framework provides important time sensitive insights for state and federal government agencies and associated organizations to better implement public policy processes for healthcare management, communication, transparency, motivation and societal operational policies such as social distancing. These insights are expected to contribute to processes that can expedite the vaccination program and move closer to the cherished herd immunity goal.
ARTICLE | doi:10.20944/preprints202106.0196.v3
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Twitter; Social Media; Social Networking; Social Network Analytic; DistilBERT; Text Similarity; Natural Language Processing; Character Computing
Online: 17 February 2022 (13:15:23 CET)
Social media platforms have been entirely an undeniable part of the lifestyle for the past decade. Analyzing the information being shared is a crucial step to understanding human behavior. Social media analysis aims to guarantee a better experience for the user and risen user satisfaction. For deriving any further conclusion, first, it is necessary to know how to compare users. In this paper, a hybrid model has been proposed to measure Twitter profiles’ similarity and quantifies the likeness degree of profiles by calculating features considering users’ behavioral habits. For this, first, the timeline of each profile has been extracted using the official TwitterAPI. Then, in parallel, three aspects of a profile are deliberated. Behavioral ratios are time-series-related information showing the consistency and habits of the user. Dynamic time warping has been utilized to compare the behavioral ratios of two profiles. Next, the audience network is extracted for each user, and for estimating the similarity of two sets, Jaccard similarity is used. Finally, for the Content similarity measurement, the tweets are preprocessed respecting the feature extraction method; TF-IDF and DistilBERT for feature extraction are employed and then compared using the cosine similarity method. Results have shown that TF-IDF has slightly better performance; therefore, the more straightforward solution is selected for the model. Similarity level of different profiles. As in the case study, a Random Forest classification model was trained on almost 20000 users revealed a 97.24% accuracy. This comparison enables us to find duplicate profiles with nearly the same behavior and content.
DATA DESCRIPTOR | doi:10.20944/preprints202206.0146.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: COVID-19; COVID; Omicron; online learning; remote learning; online education; Twitter; dataset; Tweets; social media; Big Data
Online: 21 July 2022 (08:05:19 CEST)
COMMUNICATION | doi:10.20944/preprints202206.0383.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: Exoskeleton; Twitter; Tweets; Big Data; social media; Data Mining; dataset; Data Science; Natural Language Processing; Information Retrieval
Online: 21 July 2022 (04:06:53 CEST)
The exoskeleton technology has been rapidly advancing in the recent past due to its multitude of applications and diverse use-cases in assisted living, military, healthcare, firefighting, and industry 4.0. The exoskeleton market is projected to increase by multiple times of its current value within the next two years. Therefore, it is crucial to study the degree and trends of user interest, views, opinions, perspectives, attitudes, acceptance, feedback, engagement, buying behavior, and satisfaction, towards exoskeletons, for which the availability of Big Data of conversations about exoskeletons is necessary. The Internet of Everything style of today's living, characterized by people spending more time on the internet than ever before, with a specific focus on social media platforms, holds the potential for the development of such a dataset, by the mining of relevant social media conversations. Twitter, one such social media platform, is highly popular amongst all age groups, where the topics found in the conversation paradigms include emerging technologies such as exoskeletons. To address this research challenge, this work makes two scientific contributions to this field. First, it presents an open-access dataset of about 140,000 tweets about exoskeletons that were posted in a 5-year period from May 21, 2017, to May 21, 2022. Second, based on a comprehensive review of the recent works in the fields of Big Data, Natural Language Processing, Information Retrieval, Data Mining, Pattern Recognition, and Artificial Intelligence that may be applied to relevant Twitter data for advancing research, innovation, and discovery in the field of exoskeleton research, a total of 100 Research Questions are presented for researchers to study, analyze, evaluate, ideate, and investigate based on this dataset.
ARTICLE | doi:10.20944/preprints201808.0269.v1
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: social sensing; supervised learning; statistical methods; social networks; twitter; tweets; natural disaster; random forest, kernel density estimation
Online: 15 August 2018 (11:34:43 CEST)
In recent years, online social networks have received important consideration in spatial modelling fields given the critical information that can be extracted from them for events in real time; one of the most latent issues is that regarding various natural disasters such as earthquakes. Although it is possible to retrieve data from these social networks with embedded geographic information provided by GPS, in many cases this is not possible. An alternative solution is to reconstruct specific locations using probabilistic language models, more specifically those based on Name Entity Recognition (NER), which extracts names from a user’s description about an event occurring in a specific place (e.g., a collapsed building on a specific avenue). In this work, we present a methodology to use twitter as a social sensor system for disasters. The methodology scores NER locations with a kernel density estimation function for different subtopics originating from a natural disaster and that maps them into a geographic space is proposed. The proposed methodology is evaluated with tweets related to the 2017 earthquake in Mexico.
ARTICLE | doi:10.20944/preprints201908.0226.v1
Subject: Earth Sciences, Environmental Sciences Keywords: crowdsourcing; citizen science; ecotourism; Facebook; Flickr; photo-elicitation; Instagram; photovoice; social media; social networking sites; Twitter; wildlife conservation
Online: 21 August 2019 (10:34:58 CEST)
The first two decades of the 21st-century have seen the emergence of the modern citizen science movement, increased demand for niche eco and wildlife tourism experiences, and the willingness of people to voluntarily share information and photographs online. To varying extents, the rapid growth of these three phenomena has been driven by the availability of portable smart devices, access to the Web 2.0 internet from almost anywhere on the planet, and the development of applications and services, including social media/networking sites (SNSs). In addition, the number of peer-reviewed publications that explore how text and images shared on SNSs can be data-mined for academic research has surged in recent years. This systematic quantitative review has two goals. The first goal is to provide an oversight of how the photographs that ecotourists share online are contributing to wildlife tourism research. The second goal is to promote the emerging photovoice technique as a theoretical context for social research based on the photographs and comments that ecotourists share on SNSs. From the perspectives of community benefits, conservation behaviours, and environmental education, there are many similarities between authentic ecotourism experiences and quality ecological citizen science programs. Much of the literature regarding the theory and practice of citizen science reports on the difficulties of attracting, training, motivating and retaining community members. The synthesis of this review is that crowdsourcing wildlife and tourism data from comments and photographs that ecotourists share on SNSs is a credible method of research that provides a self-replenishing pool of citizen scientists.
ARTICLE | doi:10.20944/preprints202205.0238.v2
Subject: Mathematics & Computer Science, Information Technology & Data Management Keywords: COVID-19; SARS-CoV-2; Omicron; Twitter; tweets; sentiment analysis; big data; Natural Language Processing; Data Science; Data Analysis
Online: 7 July 2022 (08:36:40 CEST)
This paper presents the findings of an exploratory study on the continuously generating Big Data on Twitter related to the sharing of information, news, views, opinions, ideas, knowledge, feedback, and experiences about the COVID-19 pandemic, with a specific focus on the Omicron variant, which is the globally dominant variant of SARS-CoV-2 at this time. A total of 12028 tweets about the Omicron variant were studied, and the specific characteristics of tweets that were analyzed include - sentiment, language, source, type, and embedded URLs. The findings of this study are manifold. First, from sentiment analysis, it was observed that 50.5% of tweets had the ‘neutral’ emotion. The other emotions - ‘bad’, ‘good’, ‘terrible’, and ‘great’ were found in 15.6%, 14.0%, 12.5%, and 7.5% of the tweets, respectively. Second, the findings of language interpretation showed that 65.9% of the tweets were posted in English. It was followed by Spanish or Castillian, French, Italian, Japanese, and other languages, which were found in 10.5%, 5.1%, 3.3%, 2.5%, and <2% of the tweets, respectively. Third, the findings from source tracking showed that “Twitter for Android” was associated with 35.2% of tweets. It was followed by “Twitter Web App”, “Twitter for iPhone”, “Twitter for iPad”, “TweetDeck”, and all other sources that accounted for 29.2%, 25.8%, 3.8%, 1.6%, and <1% of the tweets, respectively. Fourth, studying the type of tweets revealed that retweets accounted for 60.8% of the tweets, it was followed by original tweets and replies that accounted for 19.8% and 19.4% of the tweets, respectively. Fifth, in terms of embedded URL analysis, the most common domains embedded in the tweets were found to be twitter.com, which was followed by biorxiv.org, nature.com, wapo.st, nzherald.co.nz, recvprofits.com, science.org, and other URLs. Finally, to support similar research and development in this field centered around the analysis of tweets, we have developed an open-access Twitter dataset that comprises tweets about the SARS-CoV-2 omicron variant since the first detected case of this variant on November 24, 2021.
ARTICLE | doi:10.20944/preprints202111.0023.v1
Subject: Engineering, Other Keywords: Twitter; Social Media Analysis; User Behavior Mining; Crime Detection; Feature Extraction; Graph Analysis; Natural Language Processing; Text Classification; Aspect-based Sentiment Analysis; DistilBERT
Online: 1 November 2021 (15:25:19 CET)
Maintaining a healthy cyber society is a big challenge due to the users’ freedom of expression and behaving. It can be solved by monitoring and analyzing the users’ behavior and taking proper actions towards them. This research aims to present a platform that monitors the public content on Twitter by extracting tweet data. After maintaining the data, the users’ interactions are analyzed using Graph Analysis methods. Then the users’ behavioral patterns are analyzed by applying Metadata Analysis, in which the timeline of each profile is obtained; also, the time-series behavioral features of users are investigated. Then in the Abnormal Behavior Detection Filtering component, the interesting profiles are selected for further examinations. Finally, in the Contextual Analysis component, the contents will be analyzed using natural language processing techniques; A binary text classification model (SVM + TF-IDF with 88.89% accuracy) for detecting if the tweet is related to crime or not. Then, a sentiment analysis method is applied to the crime-related tweets to perform aspect-based sentiment analysis (DistilBERT + FFNN with 80% accuracy); because sharing positive opinions about a crime-related topic can threaten society. This platform aims to provide the end-user (Police) suggestions to control hate speech or terrorist propaganda.
REVIEW | doi:10.20944/preprints202005.0234.v1
Subject: Mathematics & Computer Science, Artificial Intelligence & Robotics Keywords: SIRD; Twitter; GHSI; Pre-symptomatic; EHR; Contact tracing; On-line survey; qRT-PCR; X-ray; CT/HRCT; CNN; Autoencoder; Drug affinity; CPI; and Inflation.
Online: 14 May 2020 (11:25:57 CEST)
World is now experiencing a major health calamity due to the coronavirus disease (COVID-19) pandemic, caused by the severe acute respiratory syndrome coronavirus clade 2 (SARS-CoV- 2). The foremost challenge facing the scientific community is to explore the growth and transmission capability of the virus. Use of artificial intelligence (AI), such as, deep learning, in (i) rapid disease detection from x-ray/computerized tomography (CT)/ high-resolution computed tomography (HRCT) images, (ii) accurate prediction of the epidemic patterns and their saturation throughout the globe, (iii) identification of the epicenter in each country/state and forecasting the disease from social networking data, (iv) prediction of drug-protein interactions for repurposing the drugs, and (v) socio-economic impact and prediction of future relapses, has attracted much attention. In the present manuscript, we describe the role of various AI-based technologies for rapid and efficient detection from CT images complementing quantitative real time polymerase chain reaction (qRT-PCR) and immunodiagnostic assays. AI-based technologies to anticipate the current pandemic pattern, possibility of future relapses and socio-economic impact are also discussed. We inspect how the virus transmits depending on different factors, such as, population density and mobility among others. We depict how AI-based mobile app for contact tracing and surveys can prevent the transmission. A modified deep learning technique can assess affinity of the most probable drugs to treat COVID-19. Here a few effective antiviral drugs, such as, Geneticin, Avermectin B1, and Ancriviroc among others, have been reported with their appropriate validation from previous investigations.
ARTICLE | doi:10.20944/preprints202211.0005.v1
Subject: Life Sciences, Other Keywords: Epidemics; Twitter; Natural Language Processing; Topic Modelling; Sentiment Analysis; ARI; Cholera; Ebola; HIV/AIDS; Influenza; Malaria; Spanish influenza; Swine flu; Tuberculosis; Typhus; Yellow fever; and Zika
Online: 1 November 2022 (01:17:14 CET)
At the end of 2019, while the world was being hit by the COVID-19 virus and, consequently, was living a global health crisis, many other pandemics were putting humankind in danger. The role of social media is of paramount importance in these kinds of contexts since they help health systems to cope with emergencies by contributing to conducting some activities such as the identification of public concerns, the detection of infections’ symptoms, and the traceability of the virus diffusion. In this paper, we have analyzed comments on events related to cholera, ebola, HIV/AIDS, influenza, malaria, Spanish influenza, swine flu, tuberculosis, typhus, yellow fever, and zika, collecting 369,472 tweets from the 3rd of March to the 15th of September, 2022. Our analysis has started with the collection of comments composed of unstructured texts on which we have applied natural language processing solutions. Afterward, we have employed topic modelling and sentiment analysis techniques to obtain a collection of people’s concerns and attitudes toward these pandemics. According to our findings, people's discussions were mostly about malaria, influenza, and tuberculosis and the focus was on the diseases themselves. As regards emotions, the most popular were fear, trust, and disgust where trust is mainly regarding HIV/AIDS tweets.