Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Retweet Prediction based on Heterogeneous Data Sources: The Combination of Text and Multilayer Network Features

Version 1 : Received: 29 September 2022 / Approved: 8 October 2022 / Online: 8 October 2022 (03:00:41 CEST)

A peer-reviewed article of this Preprint also exists.

Meštrović, A.; Petrović, M.; Beliga, S. Retweet Prediction Based on Heterogeneous Data Sources: The Combination of Text and Multilayer Network Features. Appl. Sci. 2022, 12, 11216. Meštrović, A.; Petrović, M.; Beliga, S. Retweet Prediction Based on Heterogeneous Data Sources: The Combination of Text and Multilayer Network Features. Appl. Sci. 2022, 12, 11216.

Abstract

Retweet prediction is an important task related to different problems such as information spreading analysis, the automatic detection of fake news, social media monitoring, etc. In this study we explore the possibilities of retweet prediction based on heterogeneous data sources. In order to classify the tweet according to the amount of retweets, we combine features extracted from the multilayer network and the text. More specifically, we introduce a multilayer framework that proposes the multilayer network representation of Twitter. This formalism captures different users' actions and complex relationships as well as other key properties of communication on Twitter. We select a set of local network measures from each layer and construct a set of multilayer network features. In addition, we adopt a BERT-based language model, namely Cro-CoV-cseBERT to capture high-level semantics and structure of tweets as a set of text features. Then, we train six machine learning (ML) algorithms: random forest, multilayer perceptron, light gradient boosting machine, category embedding model, neural oblivious decision ensembles and attentive interpretable tabular learning model in the task of retweet prediction. We compare the performance of all six algorithms in three different setups (i) using only text features, (ii) using only multilayer network features and (iii) using both sets of features. We evaluate all setups in terms of standard evaluation measures i.e. precision, recall, F1-score and accuracy. For this task, we first prepare and use an empirical dataset of 199,431 tweets in the Croatian language posted during the period between January 1, 2020 and May 31, 2021. Our results indicate that by integrating multilayer network features with text features the prediction model would perform better than using just one set of features.

Keywords

retweet prediction; multilayer network; natural language processing

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.