Submitted:
22 July 2024
Posted:
23 July 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Literature Review
3. Methodology
3.1. Data Collection
3.1.1. Bitcoin Market Data
3.1.2. Online Community Activity
- Relevance and Influence: Popular posts typically garner the most attention and engagement within the community. By focusing on these posts, we capture the discussions that are most likely to influence and reflect the broader sentiment and opinions of the subreddit members. These high-engagement posts often drive significant discussions and can have a more substantial impact on market perceptions and behaviors.
- Quality Over Quantity: Collecting a vast number of posts indiscriminately can introduce a lot of noise, as many low-engagement posts might not contribute meaningful insights to the analysis. By selecting only the most popular posts, we ensure a higher quality dataset that is rich in content and context, which is more suitable for robust sentiment and topic analysis.
- Resource Efficiency: Analyzing the entire stream of posts and comments from a busy subreddit like r/cryptocurrency can be resource-intensive in terms of computational power and time. Focusing on the most popular posts allows us to perform a more manageable and efficient analysis while still capturing the essential dynamics of the community.
- Reflecting Community Trends: Popular posts are often indicative of trending topics and hot-button issues within the cryptocurrency community. By analyzing these posts, we can better understand the current trends, concerns, and sentiments that are prevalent among active participants in the subreddit.
- Enhanced Sentiment Analysis: Sentiment analysis benefits from context-rich data. Popular posts and their comments are likely to contain more detailed and passionate expressions of sentiment, providing a clearer picture of the community’s emotional and attitudinal landscape.
- Focus on Key Influences: High-engagement posts are more likely to be shared and discussed beyond the subreddit, potentially influencing wider public opinion and media coverage. By analyzing these posts, we can gain insights into the key influences shaping the narrative around Bitcoin and other cryptocurrencies.
3.2. Text Preprocessing
- Tokenization: The text is split into individual words or tokens. This helps in handling each word separately during further processing.
- Lowercasing: All text is converted to lowercase to ensure uniformity. For example, "Bitcoin" and "bitcoin" are treated as the same token.
- Removing Punctuation and Special Characters: Punctuation marks and special characters (e.g., #, @, !) are removed to avoid their interference in the analysis. This step helps in reducing noise in the data.
- Removing URLs and Hyperlinks: URLs and hyperlinks are extracted and discarded as they do not contribute meaningful information for sentiment or thematic analysis.
- Removing Stopwords: Common words that do not carry significant meaning (e.g., "and", "the", "is") are removed. We use the NLTK [26] stopword list and expand it based on domain-specific knowledge.
- Stemming and Lemmatization: Words are reduced to their base or root form. Stemming cuts words to their base form, while lemmatization uses dictionary-based methods to achieve this. For instance, "running" becomes "run". We employed lemmatization using the WordNet lemmatizer from NLTK as it preserves the context better than stemming.
- Handling Emojis and Emoticons: Emojis and emoticons are often used to convey sentiment. We replace them with corresponding text descriptions (e.g., ":)" becomes "smiley_ face") to retain their sentiment information.
- Removing Redundant Whitespace: Multiple spaces, tabs, and newlines are normalized to a single space to ensure consistent formatting.
- Retaining Domain-Specific Terms: Cryptocurrency discussions often involve specific jargon (e.g., "HODL", "FOMO", "moon"). These terms are retained to ensure the relevance and accuracy of sentiment and topic analysis.
3.3. Sentiment Analysis
- Text Preprocessing: The text of each comment was preprocessed using the steps described in Section 3.2.
- Sentiment Classification: Using the VADER model, each comment was assigned a sentiment score. Comments were categorized based on their scores: positive (score > 0.5), neutral (score between -0.5 and 0.5), and negative (score < -0.5).
- Aggregation: Sentiment scores were aggregated on a daily basis to track sentiment trends over time. The distribution of comments into positive, neutral, and negative categories was calculated for each day.
3.4. Topic Modelling
- Text Preprocessing: Similar to sentiment analysis, text preprocessing was performed to clean the data as described in Section 3.2. Additionally, words that appear very infrequently (e.g., only once or twice in the dataset) were removed to enhance model performance. These words often do not contribute significant information and can introduce noise.
-
Choosing the Number of Topics: The optimal number of topics was determined using a perplexity plot and multiple evaluation metrics, as produced by the ldatuning [29] package in GNU R [30], with each metric offering a different perspective on model evaluation.
- -
- Perplexity [28]: Perplexity measures how well a probabilistic model predicts a sample, with lower values indicating better generalization performance.
- -
- CaoJuan2009 [31]: This metric aims to minimize the distance between topics. The optimal number of topics is identified at the minimum point of the curve.
- -
- Arun2010 [32]: Similar to CaoJuan2009, this metric also seeks the minimum value, which reflects the most distinct topics.
- -
- Griffiths2004 [33]: This metric evaluates the likelihood of the model. The optimal number of topics is indicated by the maximum value of the curve.
- -
- Deveaud2014 [34]: This metric measures topic coherence, with higher values indicating better coherence.
In order to decide the optimal number of topics to use, the combined performance of all metrics should be taken into account. - Model Training: The LDA model was trained on the preprocessed dataset. The top ten terms for each topic were extracted to interpret the themes.
- Topic Distribution: The distribution of topics over time was analyzed to observe changes in the thematic focus of the subreddit.
3.5. Correlation Analysis
- Between the number of retrieved posts and BTC closing prices and volumes at various lags.
- Between the number of retrieved comments and BTC closing prices and volumes at various lags.
4. Results
4.1. Bitcoin Market Data
4.2. Online Community Data
4.2.1. Correlations with BTC Market Data
4.3. r/Cryptocurrency Comments
4.3.1. Correlations with BTC Market Data
4.4. Sentiment Analysis
4.4.1. Correlations with BTC Market Data
4.5. Topic Modelling
- scam: The presence of this term with the highest beta value indicates that discussions often involve concerns about fraudulent activities in the cryptocurrency market. This term suggests that the community is vigilant about identifying and discussing potential scams.
- dip: This term refers to a temporary decline in cryptocurrency prices. Its prominence suggests that community members frequently discuss price fluctuations and strategies for navigating market downturns.
- pump: The term "pump" is associated with rapid increases in asset prices, often as a result of coordinated efforts. Discussions around "pump" suggest a focus on market manipulation tactics and their impacts.
- cash: This term could refer to liquid assets or fiat currency in the context of cryptocurrency trading. Its inclusion indicates discussions about liquidity, cashing out, or converting crypto to cash.
- fund: This term suggests topics related to investment funds, funding sources, or financial backing within the cryptocurrency space. It highlights conversations about financial strategies and investment opportunities.
- elon: The presence of Elon Musk’s first name suggests that his influence on the cryptocurrency market, especially through tweets and public statements, is a significant topic of discussion.
- meme: The term "meme" indicates the role of internet culture and humor in cryptocurrency discussions. Memes often reflect market sentiment and can influence trading behavior.
- tweet: This term reinforces the influence of social media, particularly Twitter, on market movements. Tweets from influential figures can drive significant changes in market dynamics.
- origin: This term may refer to the origin or beginnings of certain cryptocurrencies, projects, or movements within the market. It suggests historical discussions and tracing the roots of market trends.
- pull: This term could refer to "rug pulls," a type of scam where developers abandon a project and take investors’ funds, or to pulling out investments. Its inclusion highlights concerns about exit strategies and potential scams.
- bitcoin: As the dominant term with the highest beta value, "bitcoin" indicates that a significant portion of the discussion focuses on Bitcoin, the most well-known and widely discussed cryptocurrency.
- doge: The inclusion of "doge" (referring to Dogecoin) suggests that another popular cryptocurrency is a frequent topic of conversation. Dogecoin’s meme origins and its community-driven popularity often make it a subject of interest.
- bank: This term points to discussions about the role of traditional banking institutions in the cryptocurrency space. It may involve topics like banks’ interactions with cryptocurrencies, the impact of crypto on banking, or the adoption of blockchain technology by banks.
- govern: The presence of this term indicates discussions about government policies, actions, and involvement in the cryptocurrency market. This could include regulatory frameworks, government-backed cryptocurrencies, or geopolitical influences.
- countri: This term suggests that discussions often focus on how different countries are approaching cryptocurrencies. Topics may include national regulations, adoption rates, and international differences in crypto policies.
- mine: The term "mine" refers to cryptocurrency mining, the process of validating transactions and generating new coins. Discussions may cover mining technologies, environmental impacts, profitability, and geographical distribution of mining operations.
- currenc: This term likely represents "currency," highlighting the broader discussion about cryptocurrencies as a form of digital money. This includes debates on their viability as currency, comparison with fiat currencies, and their role in the financial system.
- flat: This term is likely a misspelling or abbreviation of "fiat," referring to traditional government-issued currencies. Discussions might compare fiat currencies to cryptocurrencies, covering topics like stability, value, and adoption.
- cap: The term "cap" likely refers to market capitalization, a common metric used to assess the value of cryptocurrencies. Discussions may involve the market cap rankings of different cryptocurrencies, trends, and their implications.
- regul: Short for "regulation," this term signifies discussions about the regulatory environment surrounding cryptocurrencies. This includes laws, compliance requirements, regulatory challenges, and their impact on the market.
- long: The term "long" refers to a long-term investment strategy, indicating discussions about holding assets over an extended period to realize gains. This term suggests that a significant portion of the community engages in or discusses long-term investment approaches.
- hodl: "Hodl" is a popular term in the cryptocurrency community, derived from a misspelling of "hold." It represents the strategy of holding onto cryptocurrency investments regardless of market volatility. Its presence indicates strong discussions around the hodling philosophy.
- bear: This term refers to a bear market, characterized by declining prices. The inclusion of "bear" suggests that the community frequently discusses market downturns and strategies for navigating bearish conditions.
- shitcoin: A derogatory term used to describe cryptocurrencies with little to no value or potential. The presence of this term suggests that community members are critical and discerning about the quality and viability of various cryptocurrencies.
- space: This term likely refers to the broader cryptocurrency ecosystem or market space. Discussions around "space" may include market trends, developments, and the overall state of the cryptocurrency industry.
- risk: The term "risk" highlights discussions about the inherent risks associated with cryptocurrency investments. Topics may include risk management strategies, volatility, and the factors contributing to investment risk.
- bull: In contrast to "bear," the term "bull" refers to a bull market, characterized by rising prices. Discussions involving "bull" suggest that the community also focuses on bullish conditions and strategies for capitalizing on upward market trends.
- drop: This term indicates price drops or market corrections. Its presence suggests that community members frequently discuss sudden declines in cryptocurrency prices and their implications.
- ada: This term likely refers to Cardano’s cryptocurrency (ADA). The inclusion of "ada" indicates that specific cryptocurrencies, particularly Cardano, are a significant topic of discussion within this theme.
- bit: Likely referring to Bitcoin or bits as a unit of Bitcoin. The term "bit" suggests discussions about Bitcoin in general or its fractional units.
- eth: The term "eth" (Ethereum) has the highest beta value, indicating that discussions frequently involve Ethereum. This suggests a significant focus on one of the most prominent and influential cryptocurrencies in the market.
- exchang: This term likely refers to cryptocurrency exchanges, platforms where users can trade cryptocurrencies. The prominence of this term suggests extensive discussions about exchange-related topics, such as trading strategies, exchange reviews, and transaction experiences.
- moon: In the cryptocurrency community, "moon" refers to significant price increases. Discussions involving "moon" suggest that community members are interested in and hopeful for substantial price surges and investment returns.
- fee: The term "fee" indicates discussions about transaction costs associated with trading or transferring cryptocurrencies. This can include exchange fees, gas fees on Ethereum, and other costs that impact traders and investors.
- nft: Non-fungible tokens (NFTs) are unique digital assets representing ownership of specific items or content. The presence of "nft" suggests that the community is actively discussing this burgeoning sector within the cryptocurrency space.
- asset: The term "asset" points to discussions about cryptocurrencies as financial assets. Topics might include asset management, valuation, and the role of different cryptocurrencies in investment portfolios.
- coinbas: Likely referring to Coinbase, one of the largest and most popular cryptocurrency exchanges. This term indicates that discussions frequently involve Coinbase, its services, and user experiences.
- token: This term refers to various types of cryptocurrency tokens, which can represent assets, utility, or value within specific platforms. Discussions about tokens might cover new token offerings, token performance, and their utility within ecosystems.
- predict: The term "predict" suggests discussions about price predictions, market forecasts, and analytical methods used to anticipate future market movements.
- secur: Likely referring to "secure" or "security," this term indicates discussions about the security of cryptocurrency assets, exchanges, and transactions. Topics might include best practices for securing assets, security breaches, and regulatory measures.
5. Discussion and Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| BTC | Bitcoin |
| LDA | Latent Dirichlet Allocation |
| NFT | Non Fungible Token |
| NLP | Natural Language Processing |
| NLTK | Natural Language Toolkit |
| PRAW | Python Reddit API Wrapper |
| VADER | Valence Aware Dictionary and sEntiment Reasoner |
References
- Nakamoto, S. Bitcoin: A peer-to-peer electronic cash system, 2008.
- Breidbach, C.F.; Tana, S. Betting on Bitcoin: How social collectives shape cryptocurrency markets. Journal of Business Research 2021, 122, 311–320. [CrossRef]
- Kang, K.; Choo, J.; Kim, Y. Whose Opinion Matters? Analyzing Relationships Between Bitcoin Prices and User Groups in Online Community. Social Science Computer Review 2020, 38, 686–702. [CrossRef]
- Oikonomopoulos, S.; Tzafilkou, K.; Karapiperis, D.; Verykios, V. Cryptocurrency Price Prediction using Social Media Sentiment Analysis. In Proceedings of the 2022 13th International Conference on Information, Intelligence, Systems & Applications (IISA), 2022, pp. 1–8. [CrossRef]
- Tandon, C.; Revankar, S.; Palivela, H.; Parihar, S.S. How can we predict the impact of the social media messages on the value of cryptocurrency? Insights from big data analytics. International Journal of Information Management Data Insights 2021, 1, 100035. [CrossRef]
- Raheman, A.; Kolonin, A.; Fridkins, I.; Ansari, I.; Vishwas, M. Social Media Sentiment Analysis for Cryptocurrency Market Prediction, 2022, [arXiv:cs.CL/2204.10185].
- Steinert, L.; Herff, C. Predicting altcoin returns using social media. PLOS ONE 2018, 13, 1–12. [CrossRef]
- Garg, S.; Panwar, D.S.; Gupta, A.; Katarya, R. A Literature Review On Sentiment Analysis Techniques Involving Social Media Platforms. In Proceedings of the 2020 Sixth International Conference on Parallel, Distributed and Grid Computing (PDGC), 2020, pp. 254–259. [CrossRef]
- Loginova, E.; Tsang, W.K.; van Heijningen, G.; Kerkhove, L.P.; Benoit, D.F. Forecasting directional bitcoin price returns using aspect-based sentiment analysis on online text data. Machine Learning 2020, 113, 4761–4784. [CrossRef]
- Phillips, R.C.; Gorse, D. Mutual-Excitation of Cryptocurrency Market Returns and Social Media Topics. In Proceedings of the Proceedings of the 4th International Conference on Frontiers of Educational Technologies, New York, NY, USA, 2018; ICFET ’18, p. 80–86. [CrossRef]
- Wooley, S.; Edmonds, A.; Bagavathi, A.; Krishnan, S. Extracting Cryptocurrency Price Movements from the Reddit Network Sentiment. In Proceedings of the 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), 2019, pp. 500–505. [CrossRef]
- Gurrib, I.; Kamalov, F. Predicting bitcoin price movements using sentiment analysis: a machine learning approach. Studies in Economics and Finance 2022, 39, 347–364.
- Kraaijeveld, O.; De Smedt, J. The predictive power of public Twitter sentiment for forecasting cryptocurrency prices. Journal of International Financial Markets, Institutions and Money 2020, 65, 101188. [CrossRef]
- Phillips, R.C.; Gorse, D. Predicting cryptocurrency price bubbles using social media data and epidemic modelling. In Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence (SSCI), 2017, pp. 1–7. [CrossRef]
- Naeem, M.A.; Mbarki, I.; Shahzad, S.J.H. Predictive role of online investor sentiment for cryptocurrency market: Evidence from happiness and fears. International Review of Economics & Finance 2021, 73, 496–514. [CrossRef]
- Lamon, C.; Nielsen, E.; Redondo, E. Cryptocurrency price prediction using news and social media sentiment. SMU Data Sci. Rev 2017, 1, 1–22.
- Pang, Y.; Sundararaj, G.; Ren, J. Cryptocurrency Price Prediction using Time Series and Social Sentiment Data. In Proceedings of the Proceedings of the 6th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, New York, NY, USA, 2019; BDCAT ’19, p. 35–41. [CrossRef]
- Narman, H.S.; Uulu, A.D. Impacts of Positive and Negative Comments of Social Media Users to Cryptocurrency. In Proceedings of the 2020 International Conference on Computing, Networking and Communications (ICNC), 2020, pp. 187–192. [CrossRef]
- Agosto, A.; Cerchiello, P.; Pagnottoni, P. Sentiment, Google queries and explosivity in the cryptocurrency market. Physica A: Statistical Mechanics and its Applications 2022, 605, 128016. [CrossRef]
- Georgoula, I.; Pournarakis, D.; Bilanakos, C.; Sotiropoulos, D.; Giaglis, G.M. Using time-series and sentiment analysis to detect the determinants of bitcoin prices. Available at SSRN 2607167 2015.
- .
- Murphy, J.J. Technical analysis of the financial markets: A comprehensive guide to trading methods and applications; Penguin, 1999.
- Reddit. Accessed Jun 15, 2024. https://reddit.com/.
- PRAW: The Python Reddit API Wrapper. Accessed Jun 15, 2024. https://praw.readthedocs.io.
- Glenski, M.; Pennycuff, C.; Weninger, T. Consumers and Curators: Browsing and Voting Patterns on Reddit. IEEE Transactions on Computational Social Systems 2017, 4, 196–206. [CrossRef]
- NLTK: Natural Language Toolkit. Accessed Jun 15, 2024. https://www.nltk.org/.
- VADER: Valence Aware Dictionary and sEntiment Reasoner. Accessed Jun 15, 2024. https://vadersentiment.readthedocs.io/.
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. Journal of machine Learning research 2003, 3, 993–1022.
- Nikita, M. ldatuning: Tuning of the Latent Dirichlet Allocation Models Parameters 2020. R package version 1.0.2.
- R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2023.
- Cao, J.; Xia, T.; Li, J.; Zhang, Y.; Tang, S. A density-based method for adaptive LDA model selection. Neurocomputing 2009, 72, 1775–1781. Advances in Machine Learning and Computational Intelligence, . [CrossRef]
- Arun, R.; Suresh, V.; Veni Madhavan, C.E.; Narasimha Murthy, M.N. On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations 2010. pp. 391–402.
- Griffiths, T.L.; Steyvers, M. Finding scientific topics. Proceedings of the National academy of Sciences 2004, 101, 5228–5235.
- Deveaud, R.; SanJuan, E.; Bellot, P. Accurate and effective latent concept modeling for ad hoc information retrieval. Document numérique 2014, 17, 61–84. [CrossRef]
- Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time series analysis: forecasting and control; John Wiley & Sons, 2015.
- Hyndman, R.J.; Athanasopoulos, G. Forecasting: principles and practice, 2nd ed.; Melbourne: OTexts, 2018.
- Mai, F.; Shan, Z.; Bai, Q.; Wang, X.S.; Chiang, R.H. How Does Social Media Impact Bitcoin Value? A Test of the Silent Majority Hypothesis. Journal of Management Information Systems 2018, 35, 19–52. [CrossRef]
- Wang, S.; Vergne, J.P. Buzz Factor or Innovation Potential: What Explains Cryptocurrencies’ Returns? PLOS ONE 2017, 12, 1–17. [CrossRef]
- Hutto, C.; Yardi, S.; Gilbert, E. A longitudinal study of follow predictors on twitter. In Proceedings of the Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, New York, NY, USA, 2013; CHI ’13, p. 821–830. [CrossRef]
- Ba, C.T.; Zignani, M.; Gaito, S. The role of cryptocurrency in the dynamics of blockchain-based social networks: The case of Steemit. PLOS ONE 2022, 17, 1–22. [CrossRef]
- Corbet, S.; Lucey, B.; Yarovaya, L. Datestamping the Bitcoin and Ethereum bubbles. Finance Research Letters 2018, 26, 81–88. [CrossRef]
- Wołk, K. Advanced social media sentiment analysis for short-term cryptocurrency price prediction. Expert Systems 2020, 37, e12493, [https://onlinelibrary.wiley.com/doi/pdf/10.1111/exsy.12493]. e12493 EXSY-Apr-19-215.R1, . [CrossRef]
- Linton, M.; Teo, E.G.S.; Bommes, E.; Chen, C.Y.; Härdle, W.K., Dynamic Topic Modelling for Cryptocurrency Community Forums. In Applied Quantitative Finance; Härdle, W.K.; Chen, C.Y.H.; Overbeck, L., Eds.; Springer Berlin Heidelberg: Berlin, Heidelberg, 2017; pp. 355–372. [CrossRef]

































| Min | Q1 | Mean | Median | Q3 | Max | SD |
|---|---|---|---|---|---|---|
| 36 | 787 | 1493.8 | 1157.5 | 1743 | 43428 | 2073.9 |
| Min | Q1 | Mean | Median | Q3 | Max | SD |
|---|---|---|---|---|---|---|
| 1 | 48 | 163.6 | 88 | 175 | 9032 | 276.3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
