Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

A Method Based on NLP for Twitter Spam Detection

Version 1 : Received: 25 July 2020 / Approved: 26 July 2020 / Online: 26 July 2020 (17:23:33 CEST)

How to cite: Chowdhury, R.; Das, K.G.; Saha, B.; Bandyopadhyay, S.K. A Method Based on NLP for Twitter Spam Detection. Preprints 2020, 2020070648 (doi: 10.20944/preprints202007.0648.v1). Chowdhury, R.; Das, K.G.; Saha, B.; Bandyopadhyay, S.K. A Method Based on NLP for Twitter Spam Detection. Preprints 2020, 2020070648 (doi: 10.20944/preprints202007.0648.v1).

Abstract

Social networking applications such as Twitter have increasingly gained significance in terms of socio-economic, political, and religious as well as entertainment sectors. This in turn, has witnessed a wide gamut of information explosion in the social networking realm that can tend to be both useful as well as misleading at the same point of time. Spam detection is one such solution that caters to this problem through identification of irrelevant users and their data. However, existing research has so far laid primary focus on user profile information through activity detection and relevant techniques that may underperform when these profiles exhibit characteristics of temporal dependency, poor reflection of generated content from the user profile, etc. This is the primary motivation for this paper that addresses the aforementioned problem of user profiles by focusing on both profile information and content-based spam detection. To this end, this work delivers three significant contributions. Firstly, exhaustive use of Natural language processing (NLP) techniques has been rendered towards creation of a new comprehensive dataset with a wide range of content-based features. Secondly, this dataset has been fed into a customized state-of-art hybrid machine learning model that has been exclusively built using a combination of both machine learning and deep learning techniques. Extensive simulation based analysis not only records over 98% accuracy but also establishes the practical applicability of this proposal by proving that modeling based on the mixed profile and content-generated data is more capable of spam detection in contrast to each of these standalone approaches. Finally, a novel methodology based on logistic regression is proposed and supported by analytical formulations. This paves the way for the custom-built dataset to be analyzed and corresponding probabilities to be obtained that differentiate legitimate users from spammers. The obtained mathematical outcome can henceforth be used for future prediction of user categories through appropriate parameter tuning for any given dataset. This makes our method a truly generic one capable of identifying and classifying different user categories.

Subject Areas

Twitter; Social Media; NLP; Tweet; User Categorizations and Mathematical Frame Work

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our diversity statement.

Leave a public comment
Send a private comment to the author(s)
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.