Money Often Costs Too Much: A Study to Investigate The Effect Of Twitter Sentiment On Bitcoin Price Fluctuation

Introduced in 2009, Bitcoin has demonstrated a huge potential as the world’s first digital currency and has been widely used as a financial investment. Our research aims to uncover the relationship between Bitcoin prices and people’s sentiments about Bitcoin on social media. Among various social media platforms, micro-blogging is one of the most popular. Millions of people use micro-blogging platforms to exchange ideas, broadcast views, and to provide opinions on different topics related to politics, culture, science, and technology. This makes them a potentially rich source of data for sentiment analysis. Therefore we chose one of the busiest micro-blogging platforms, Twitter, to perform sentiment analysis on Bitcoin. We used ELMo embedding model to convert Bitcoin-related tweets into a vector form and SVM classifier to divide the tweets into three sentiment categories - positive, negative, and neutral. We then used the sentiment data to find its relation with Bitcoin price fluctuation using the linear mixed model.


Introduction
Micro-blogging, as the name suggests, refers to a broadcast medium where content consists of short sentences, individual images, or video links [1]. Twitter, Tumblr, and Pinterest are notable micro-blogging platforms. Twitter is the most prominent micro-blogging platform with 166 million daily active users as of Q1 2020 [2]. Users can post micro-posts or tweets which can be up to 280 characters long and attach images, videos, GIFs, and links to websites with tweets. Twitter provides a convenient way for millions around the world to broadcast views globally. Furthermore, tweets can be marked with hashtags and these hashtags can be used to search for tweets on a specific topic. On Twitter, 166 million daily active users contribute 500 million tweets per day on a variety of topics [3]. We have used a dataset consisting of tweets related to Bitcoin as a source for sentiment analysis.
Bitcoin was introduced in 2008 as a decentralized digital currency that uses peer to peer network for transactions, eliminating the need for a central authority like a central bank [4]. Initially, Bitcoin was primarily adopted by black markets such as Silk Road [5] which exclusively accepted Bitcoins as payment but over time Bitcoin has gained wider acceptance by online merchants and vendors [6]. Venture capitalists have invested in various Bitcoinbased startups since 2012 [7], [8] and in December 2017, the Chicago Board Options Exchange and Chicago Mercantile Exchange started trading Bitcoin futures [9], [10]. All these developments popularised Bitcoin as an investment option which resulted in the spectacular increase in its price in 2017, climbing to an all-time high of USD 19,783.06 on 17 December 2017 [11]. This invited the curiosity of the general public and further popularised the idea of alternative currencies. Bitcoin has not only revolutionized the concept of digital currencies but has also cultivated the seeds for tech development based on blockchain such as smart contracts [12], peer to peer energy trading [13], and blockchain-based supply chain management [14]. As of August 1, 2020, the number of cryptocurrencies exceeds 2000 and the market capitalization of cryptocurrencies stands at over 300 Billion USD [15]. Before one invests in the crypto trading options, it is very important to understand the nature and behavior of this new and emerging market. Various factors affect the price of the cryptocurrency and understanding them is crucial to understanding the crypto market. This paper is solely dedicated to finding a correlation, if exists, between Bitcoin price and people's sentiments on Twitter and how closely they are related.
We start by first defining our hypothesis which lays the foundation for our research followed by the research questions that we aim to answer. In the Literature Review section, we examined various techniques used to represent text into word embedding, methodologies to perform sentiment analysis, enhancements in the field, and drawbacks and advantages of each approach. We also reviewed the literature on the Bitcoin price prediction and have summarised the methodology and techniques by researchers to identify the impact of various factors on Bitcoin price fluctuation. Furthermore, we explored the research papers on the challenges and complexities involved in performing sentiment analysis on a multilingual text.
Once equipped with adequate knowledge of the litera-ture, we devised our methodology which is described in the Methodology and Results section. In the first subsection, we outline the data collection and transformation process for our tweet dataset followed by data preparation which involved manually labeling 35,000 tweets into sentiment categories -positive, negative, or neutral and converting the tweets into word embedding. We implemented various classification algorithms on our labeled data and to identify the best algorithm, we performed a resampling procedure and a statistical test. We classified the remaining tweets into sentiment categories using the selected algorithm. Finally, we used a linear mixed model to determine the relationship between Bitcoin prices and sentiments. In the end, we discuss the results and suggest future work to further advance the research.

Hypothesis
Inspired by the work and enhancements in the field of sentiment analysis [16], [17] and the future potential of Bitcoin as the world's first digital currency and new investment option, we tried to understand the effect of people's sentiments on Bitcoin price. To understand this effect we bind our whole research into the following hypothesis -Ho: There is no relationship between Bitcoin price and Twitter sentiments.

Research Question
To reject our null hypothesis we formulate some questions and try to answer these questions through our methodology. These questions form the base of our research. Our first research question was i) Are tweets on Bitcoins a good source for sentiment analysis on Bitcoin. ii) Can a model be created which can accurately classify tweets into different sentiment categories. iii) Does there exist a correlation between Bitcoin price and people's sentiments on Twitter. To answer the first research question we manually read and labeled a sample of tweets to check if tweets can be categorized into positive, negative, or neutral sentiments. To answer the second research question we used ELMo embedding to convert our text data into vectors of embedding. We trained different machine learning classifiers and selected the one with the highest accuracy to classify our data into three categories. We further performed significance tests to support our claim. For our last research question, we use a linear mixed model to find the correlation between Bitcoin prices and sentiments.

Literature Review
The literature review is divided into three sections. The first section is on sentiment analysis. It contains research on how sentiment analysis is performed, the different methodologies used by different researchers, results, and enhancements with time. The second section is on Bitcoin price fluctuation, factors that affect Bitcoin price, and various models used by researchers to find correlations and predict Bitcoin price. In the third section, we look at the challenges with sentiment analysis on multilingual texts and describe why we chose only English for sentiment analysis.

Sentiment Analysis
Opinion mining is one of the key factors in understanding people's perspectives and creating new strategies in many key areas. Much of the work on sentiment analysis involves classifying documents by their overall sentiments. Tweets are classified into positive and negative by using polarity scores [18]. In this paper, the researchers have classified tweets based on polarity scores with an accuracy score of 80.6 percent. The results are used to understand the effect on the stock price. The author found that the approach was good with respect to time complexity and accuracy but did not consider other aspects of a sentence e.g. syntactic or syntax and semantic, and arbitrary weights were assigned to sentences using polarity score. Further developments in the field of sentiment analysis included models like unigram, feature-based, and tree kernel-based models [19]. The results show improvement in sentiment scores. This approach was better than the one discussed before [18] as the prior priority of words and their part-of-speech tags both were used to create a feature in this approach. While semantic analysis, topic modeling, and parsing were not considered during feature creation. Later, the development of WordNet 3.0, an enhanced lexical analysis for opinion mining, produced better results than previously used techniques [20]. The work presents the use of enhanced lexical resources designed for opinion mining applications and for supporting sentiment classification. The results showed significant improvement. 19.4 and 21.96 percent relative improvement in classifying positive and negative sentiment as well as improvement from other baseline models. WordNet is a rich set of vocabulary created using thousands of words, their meaning, part of speech, and degree of positivity and negativity, ranging from 0 to 1. WordNet is widely used by researchers for text classification. But WordNet does not consider syntax and semantics which is a shortcoming of this work [21]. A significant work in natural language processing emerges through the development of linguistic analysis [19]. In linguistic analysis, not only the polarity of words is considered but the whole sentence is considered, adding more meaning and weight to the analysis to perform text classification. The classifier created by using the corpus generated through linguistic analysis performs better than the rest of the baseline models. The performance of the model is measured using the F Score given by formula [19]. The B-Gram model was found to outperform the Ngram model and that attaching negation words provides very high accuracy. Thus the classifier generates a very accurate result. Although the accuracy was found to be better than all previous researches but it does not improve beyond a certain point. The new approach phrase-level sentiment analysis as discussed in [22] changes the overall perspective on classifying text. All previous work as discussed in [18], [19], [20] only includes identifying positive and negative reviews. But to seek an answer for complicated problems like a) mining product reviews b) multi-perspective questioning and answering and summarization, opinion-oriented information extraction, a better approach was needed. To answer all these questions accurately we required sentence-level or phraselevel sentiment analysis. The researchers used the phraselevel sentiment analysis to answer all these questions. The first step is to determine whether an expression is neutral or polar followed by disambiguating the polarity of the polar expression. Using this approach, the system automatically identifies contextual polarity even for large datasets with improved accuracy. The result is also better than the baseline models. Even for the simplest classifier, the accuracy was 48 percent higher than the classifiers used previously.
Finally, the advancement in opinion mining led to the development of a content-based approach. Content-based approaches are finest of all but are computationally expensive [21]. Many embedding models are aimed to provide transfer learning to a wide variety of natural language processing tasks after being trained on huge corpora to learn deep semantic representation. One of the popular pre-trained models of all is the ELMo(ELMO) [21]. ELMo is a pretrained machine learning model that can encode sentences into deeply contextualized embeddings. The results show that such models have outperformed previous state-of-art approaches such as traditional word embeddings for many natural language processing tasks [18], [19], [20], [23], [22]. Such NLP tasks include calculating semantic similarity and relatedness, text classification, etc. Such advancement has led to the execution of complicated tasks such as calculating document relatedness in case of a research paper recommender system [21]. The result generated shows that ELMo outperformed other baselines models but was outperformed by USE, BERT, And SciBert. The possible reason was that ELMo was trained on corpora that contain news crawls and natural language inference while the latter are trained on the technical and scientific texts.

Bitcoin Analysis
Launched in 2008, Bitcoin has gained immense popularity owing to its spectacular adoption and stunning price increase from 2013 onwards. Such popularity also opened various trading and investment options with Bitcoin such as Bitcoin futures [9]. The sentiment is always considered an important driver in Bitcoin price fluctuation especially during the duration of excessive volatility [24]. The researchers used the Autoregressive(AR) Model to show the relationship between sentiments and Bitcoin prices. They were successful in capturing a significant impact of positive sentiments on the prices. They also realized that positive sentiments have more impact than negative sentiments. The results only present a marginal explanatory value of price fluctuation of Bitcoin due to sentiments. The results were as expected because only one source of sentiment was used. In the next paper, the researchers used a time-series model (Vector error correction model) to analyse and study the relationship between various economic indicators, tweets, and technological factors [25]. Sentiment analysis was performed on a daily basis using a support vector machine(SVMs). The results generated show that Bitcoin prices are positively related to positive sentiments, Wikipedia searches, and hash rate (a measure of mining difficulty), while the USD and EURO exchange rates have a negative impact on Bitcoin. In the long-run analysis, the Bitcoin price and poor S&P 500 performance have positive and negative impact respectively. The paper shows significant findings and scope for future research. The dataset used by researchers was not very big and a vector autoregressive model might be better instead of OLS to study the short-term dynamics of price and sentiment analysis can be used to predict long-term prices by using appropriate algorithmic processes. Investors were also interested in predicting the price of cryptocurrencies other than Bitcoin (also called altcoins) as they offer equal investing potential. Researchers also study the pattern of another cryptocurrency (Ether) and compare its behavior with Bitcoin [26]. In this paper, researchers used Twitter data and Google trends data to understand the price fluctuations of the two most popular cryptocurrencies, Bitcoin, and Ether.
The studies show that tweet volumes rather than sentiments were the predictor of price direction [26]. The Linear Model was used that took Google trends and tweets as input data and was able to accurately predict the price direction but not its value. The result generated shows that sentiment analysis of tweets is not a good predictor when the price is falling. While both tweets and Google trend volume are highly co-related with the price. The relationship is robust during the period of high variance and non-linearity. The paper also represents a wider scope for future work. Only Linear Models were used, so complex relationships were not captured. The tweets collected were found to be biased towards positive sentiments as people tweet positively even when prices are falling. So, a more robust model is required to incorporate a balanced measure towards volume. The paper highlights that the price of cryptocurrency is highly co-related with search volume index and tweet volume during the fall as well as a rise in price. So, the author noted that a multiple regression model with lagged variables would be more accurate in predicting future price changes. The scope of the internet and the micro-blogging site is further analyzed to understand the dynamics of the price of Bitcoin using cross-correlation and linear regression analysis [27]. Sentiments are analyzed using algorithms to automatically understand public opinion, evaluation, and attitude. The result generated shows that there exists a relationship between future Bitcoin price and volume of tweets on a daily level. The result was generated using two-month data. Tweets were collected for a period of 60 days and so was the corresponding Bitcoin price data. The author has suggested that the analysis can be extended to more than 60 days and pre-trained models can be used like Bert, ELMo, USE, etc for better sentiment analysis.

Challenges With Multilingual Text
There exist many challenges when performing sentiment analysis on the multilingual text as explained in [28], [29], [30]. The first challenge is to perform language identification. There isn't any pre-trained model that can identify all active languages in the world. Doing it manually requires a lot of effort, time, and aid of language experts [28]. The second challenge is to perform the sentiment analysis task. There is no resource like WordNet 3.0, POS tagger, etc. for code-mix languages [28]. Stopword removal and stemming are important aspects of NLP to improve the performance of a model. Traditional classification models and optimized models such as Bert have shown promising results with a monolingual dataset (Dataset with only one language). But such models are not successful for mix-code language. To use such a model we need to train them on huge corpora which is a rich source of linguistic complexities of each of the languages that form the part of the text under analysis. Finding and creating such corpora is not only difficult but nearly impossible as explained in [28]. This is the third challenge with multilingual text. The fourth challenge is we have to rely heavily on linguistic experts to understand the context of the languages. This increases dependency, time, and effort [31]. After examining the challenges, computational costs and complexities involved with the analysis of multilingual texts, we decided to only consider English for sentiment analysis.

Methodology and Results
This section is divided into four subsections. The first subsection contains information on data collection and transformation. Data transformation involves removing punctuation, stop words, and non-English text. The second subsection provides information on data preparation. Data preparation involves categorizing data into three sentiment categories -positive, neutral, and negative. The third subsection provides information on model training and the implementation of algorithms. We compare the results from different algorithms using a re-sampling method and a statistical test and select the one with maximum accuracy. The fourth subsection consists of information on the linear mixed model and how it is used to find a correlation between Bitcoin prices and Twitter sentiments.

Data Collection and Transformation
We collected the Twitter data on Bitcoin from Kaggle [32]. The data was in CSV format. The dataset had 16 million records with each record having 8 columns -first name, last name, id, likes, tweets, retweets, date. For our analysis, we used data for the year 2017. As 2017 was the year of maximum fluctuations and best results are captured during such period [27]. The tweets in the dataset were in many different languages which included Chinese, Spanish, French, etc. After cleaning the data, we were left with 190,000 records for the whole year. The cleaning process involved removing special characters, digits, punctuation marks, and non-English sentences. We also removed stop words after labeling the sentences into three categories positive, negative, and neutral.

Data Preparation
To prepare our data we manually labeled 35,000 tweets into positive, negative, and neutral sentiments. To evaluate the validity of our labeled dataset we calculated the Kappa coefficient [33]. The equation for the Kappa coefficient (Eq.1). Where P o is the relative observed agreement among raters. P e is the hypothetical probability of chance agreement. Kappa coefficient, also known as Cohen's Kappa coefficient, is a statistical method used to measure the interrater reliability for categorical data. The Kappa coefficient value for our data was 0.83 which proves that we had almost similarly labeled our data individually into positive, negative, and neutral and there exists no conflict of opinion between the authors of this paper. Next, we used ELMo embedding to convert the sentences into word embedding. Word Embedding is the numerical representation of a string of words. In word embedding, similar strings are represented by similar embedding. Figure 2 show diagrammatic representation of word embedding [34]. ELMo is a bi-directional LSTM language model used to compute contextualized character-based word representation [21]. It has two layers stacked together. Each layer has two pass one forward and backward. In the bi-directional language model, both the left and right contexts of the target word are used. It means that the target word can be predicted either by using the left context or the right context [35]. Diagrammatic representation of ELMo embedding is shown in figure 3 [34]. While most of the embeddings only capture the frequency of words to give them the weight and hence are unable to capture the meaning of words, ELMo embedding captures both the syntax and semantics [36]. We used the following ELMo tenserflow configuration trained on one billion word benchmark shown in table 4. Further, this module supports text in both forms, string as well as tokenized. The rest of the ELMo parameters are used at default configuration with fixed embedding at each LSTM layer, 3 layers learnable aggregation, and a fixed mean-pooled vector representation of the input. We split our dataset into a trained and test dataset with a ratio of 70:30 which is a standard ratio to follow.

Model Training And Implementation Of Algorithms
Now that we had the labeled data for 35,000 records and ELMo representation of all the tweets. We needed to find the  best classifier to classify the remaining tweets into positive, negative, and neutral categories. For training and testing our algorithms, we used an equal number of records from each of the sentiment categories i.e. the same number of positive, negative, and neutral records. We implemented one parametric [37] and three non-parametric machine learning models [38]. For the non-parametric models, we used three classifier algorithms. They are, i) KNN neighbours [39], ii) Decision Tree [40] and iii) SVM [41]. For the parametric model, we used only one classifier algorithm, Naïve Bayes [42].
We used K-fold technique to find the algorithm with maximum accuracy [43]. The results of the four algorithms, where K-fold is equal to 10, are shown in a highlighted table 5. Red-green diverging color gradient shows how significantly different the results are for each algorithm at each fold. The table also shows the average results for all algorithms. From the results, we can conclude that SVM is a better classifier algorithm than the other algorithms under    [44]. We use ANOVA followed by post-hoc comparison test [45]. To conduct ANOVA analysis, we formulate the following hypothesis "H0: There is no significant difference between machine learning algorithms". Once we had rejected the null hypothesis we also performed a post-hoc comparison test to identify the machine algorithm that is significantly different from all other algorithms. For posthoc comparison we use Tukey HSD [46]. Table 1 shows the ANOVA result.
From table 1, the p-value is less than 0.05. Where 0.05 is 5% significance level. So, we reject our null hypothesis which states that "H0:-There is no significant difference among algorithms" and accept the alternate hypothesis. Further, we use a post-hoc test to find the algorithms which are significantly different from each other. Table 2 shows the result of Tukey HSD test.
From the result of the post-hoc test table 2, we conclude that the accuracy of the non-parametric models is significantly different from that of a parametric model. In other words, the non-parametric models are better than the parametric model for our research work. This is because there exists a bias in data. The dimension of words from ELMo embedding is 1024. While the number of records used to train and test algorithms is 27,000. This introduced bias in the train and test data. So, the non-parametric models produced better results because they have high variance [47] that balances the bias in our data. Further analysis from the post-hoc test shows that there is no significant difference among different non-parametric models used. We chose SVM classifier over other non-parametric models because after performing K-fold cross validation at 10-fold the mean accuracy of SVM is 0.92 approx figure 5, which was the maximum of all models. So, we selected the combination of ELMo+SVM to label our complete dataset.

Finding Correlation Between Bitcoin And Twitter Sentiments
In this last part, we use the linear mixed model to understand the effect of Twitter sentiments on Bitcoin price. Linear mixed models are used when we need to find a correlation between the dependent and independent variables. Linear models are also useful when we need to consider both fixed and random effects [48]. The equation is shown in figure 6. In our model, we consider per day count of positive, negative, and neutral sentiments as fixed effect and volume (total no of positive, negative, and neutral) as a random effect. While the dependent variable was difference between per day opening and closing price. The model is tested on a different level of significance. Figure 7 show result at 5% significance level, figure 8 at 1% significance level, and figure 9 at 10% significance level.
From the results, we can see our p-value is more than the level of significance at all the significance level under  consideration. Hence, we do not have enough evidence to reject the null hypothesis. Therefore we accept the null hypothesis that there exists no relationship between sentiments and Bitcoin price.

Conclusion and Future work
From the results, we can conclude that Bitcoin prices are not affected by people's sentiments on Twitter. So, it might be possible that other factors such as technology, government policies, institutional investors, new ventures, partnerships, news on cryptocurrency, and new crypto technologies may affect the price of Bitcoin. The result generated also scratches on the possibility that Bitcoin price might affect people's opinions/sentiments. In other words, the price fluctuation of Bitcoin controls the emotion of people on Twitter. So, whatever people tweet on Bitcoin, either positive or negative, is because of Bitcoin price fluctuation. For example, someone holding a share of Bitcoin will express gratitude if the price rises while others may show sadness and anger. So, the results show that tweets on Bitcoin are nothing more than just the aftermath of Bitcoin price fluctuations.
The above research is restricted to only one language(English). So, it can be extended further to multiple languages. For word-embedding, a new advanced model can be used such as Bert. Instead of tweets, news headlines can be used as a source of sentiments as well as statements by prominent personalities. The period used is limited to a single year when the price of Bitcoin was highly volatile. The research can be extended to 2-3 years of data. The above work is limited to only a single cryptocurrency. It is possible that different cryptocurrencies behave differently in regards to people's sentiments. Also, the above work does not talk about the relationship between cryptocurrencies such as how the price of one affects the other or why Bitcoin is more volatile than all other cryptocurrencies. Weights can be assigned to each sentiment instead of labels depending on the type of context, the person who released the statement, and likes on a comment. Thus per day total weight can be used instead of per day total counts of different tweets.