A Method Based on NLP for Twitter Spam detection

Social networking applications such as Twitter have increasingly gained significance in terms of socio-economic, political, and religious as well as entertainment sectors. This in turn, has witnessed a wide gamut of information explosion in the social networking realm that can tend to be both useful as well as misleading at the same point of time. Spam detection is one such solution that caters to this problem through identification of irrelevant users and their data. However, existing research has so far laid primary focus on user profile information through activity detection and relevant techniques that may underperform when these profiles exhibit characteristics of temporal dependency, poor reflection of generated content from the user profile, etc. This is the primary motivation for this paper that addresses the aforementioned problem of user profiles by focusing on both profile information and content-based spam detection. To this end, this work delivers three significant contributions. Firstly, exhaustive use of Natural language processing (NLP) techniques has been rendered towards creation of a new comprehensive dataset with a wide range of content-based features. Secondly, this dataset has been fed into a customized state-ofart hybrid machine learning model that has been exclusively built using a combination of both machine learning and deep learning techniques. Extensive simulation based analysis not only records over 98% accuracy but also establishes the practical applicability of this proposal by proving that modeling based on the mixed profile and content-generated data is more capable of spam detection in contrast to each of these standalone approaches. Finally, a novel methodology based on logistic regression is proposed and supported by analytical formulations. This paves the way for the custom-built dataset to be analyzed and corresponding probabilities to be obtained that differentiate legitimate users from spammers. The obtained mathematical outcome can henceforth be used for future prediction of user categories through appropriate parameter tuning for any given dataset. This makes our method a truly generic one capable of identifying and classifying different user categories. Key-words: Twitter, Social Media, NLP, Tweet, User Categorizations and Mathematical Frame Work


Introduction
The impact of social media has brought drastic changes in the past few years in terms of socio-economic as well as organizational development. Facebook, twitter, LinkedIn are the most leading social media platform that enable users to interact with each other by both sharing, consuming information and building meaningful connection with people. As twitter data is freely available and has huge content so its derived features are very much effective for the researchers to work on various domains like spam detection, Personality identification, sarcasm detection, event detection, etc. This huge micro blogging platform has more than 313 million monthly active users tweeting around 350,000 tweets per minutes which is around 500 million tweets per day [1] Furthermore, Twitter is also infected by spammers for their private or organizational gain. Recent reports from Twitter depicts that around 9.9 million spammy twitter accounts were identified per week. Twitter spam [2] also known as unsolicited tweets contains malicious information and links. Various unfair means are used by spammers to spread their spams such as using abusive and bold languages as a reply to the users to seek attention, posting some hostile links, creating redundant profiles which can be created either by using automated tools or by manually, posting identical updates and trolling latest links to catch attention. A few spam and non spam accounts along with their respective tweets have been listed in the table1.Spam tweets often consist of Uniform resource locators (URLs) having links which are either adult contemporary or out of the context content. With the help of those URLs, spammers redirect the users to those mischievous sites which contain viruses. By using spoofing, they get personal information of the users. Spammers are adopting technologies like 'bot' [3][4] which automatically follows massive number of readers per day.  The researchers and anti-spam team of twitter are collaboratively trying to restrict the spammers on user level as well as on tweet level. Twitter has applied a branch of restrictions in recent past, it has suspended the accounts which behaves abnormally and trimmed the number of accounts that a user can follow in a day. As per the reports published in 2019 [5], a verified account can follow up to 1000 accounts per day and in case of unverified account, the count reduces to 400. Moreover, an active twitter user can follow up to 5000 accounts. For additional following, it must receive followers to a certain threshold. Researchers on the other hand are applying various content based, URL based, graph based and account based methods for spam detection [6][7]. The above methods have been further categorized into clustering, classification and hybrid problems. The various graph based, account based, content based and URL based features are listed in figure 1. Broadly, spam detection has been performed either at user level or at tweet level. Although tweet-level detection can identify spam tweets in real-time, it is increasingly difficult to capture user-level characteristics from a single tweet. On the contrary, user-level detection facilitates in providing more distinctive information of a particular user from the profile. Thus user-level detection makes it more suited to uniquely identify a legitimate user from a spammer and is therefore adopted in this paper.
In this paper along with the account based feature we have proposed other four new content based features namely stylistic, embedded, topic word based and hashtag based features. The applicability of the proposed features have been verified by various machine learning, deep learning and hybrid frameworks. So the main contributions of the paper are: 1.! Creation of a more upright dataset contains 1200 legitimate and 800 spam accounts. 2.! In tune with the previous contribution the next focus of this paper is the application of various NLP methods for feature extraction. Different novel techniques like Latent Dirichlet allocation (LDA), Latent Semantic allocation(LSA) have been used for creation of various derived features. 3.! Exhaustive investigation on the applicability of the derived feature is carried out through various machine learning, deep learning and hybrid methods. This establishes the feasibility and part superiority of the proposed methodology over and above existing methods in literature. 4.! A novel methodology based on logistic regression is further proposed and implemented on the given dataset.
Subsequent analytical framework delivers corresponding probabilities for legitimate and spam users and can further be used for user categorization on any given dataset.
The rest of the paper is structured as follows, section 2 reflects the literature survey portion, section 3 describes dataset creation and preprocessing, section 4 shows the feature extraction, section 5 represents the proposed model, section 6 analyzes the results and performs various comparisons. Furthermore, section7 introduces and implements the proposed methodology for user categorization using a mathematical framework based on logistic regression. Finally, the paper is concluded in section 8.

Literature survey
In twitter environment, to broadcast an event, it is necessary to identify the relevant group of users related to that event. Twitter spam is a dispensable information or event for a particular group of user. In a report, Nexgate [8] mentioned that on an average within every 200 social media post, there is at least one spam tweet and according to [9] approximately 15% of the tweeter users are automatic bots. The demand of various social media platforms are increasing rapidly. According to the social media statistics [10] within 2020 approximately one third of the global population will be connected to social media. So spam detection and identification is an ongoing social threat. This section describes different research works performed by the researchers based on various account based, graph based, content based and URL based technique in user level as well as tweet level. In [11][12][13][14][15] the researchers have used various account based features described in figure 1. But in recent scenario, these account based feature can be easily fabricated. Among various graph based methods, in [16] the authors have estimated a relationship between sender and receiver, the connectivity represents the strength of the connection. Their experimental result shows that most of the spam comes from account that has less relation with its receivers. In another work, Alex hai wang [17] proposed a directed social graph model to identify the follower and friend relationship. They have used 4 graph based features namely follower, friend, mutual friend and stranger. Their proposed model records above 90% accuracy. Amleshwaram et al [18] presented a CATS system based on some Bait-oriented, behavioral entropy oriented, URL based and content entropy based features. Their work provides low latency and fast detection rate.
In Considering the problem of their dynamic variation, it can be concluded that there is a lot of room for improvement. The modification can be effectively performed by embedding new features, analyzing the contents of maximum number of tweets of a particular user and by applying various hybrid and ensemble approaches.

Dataset creation and preprocessing
This section describes the details of various data collection and preprocessing procedures. Two different datasets have been considered for experimental purpose. The first one is social honey pod(SHP) [33] dataset, collected in between Dec 30, 2009 to Aug 2, 2010 that contains 22,223 and 19,276 number of polluters and legitimate users respectively. Another dataset is a custom dataset which is a manually created dataset containing 1200 and 800 number of legitimate and spam users respectively. As mentioned in previous section, the recent scenario spammers are more intelligent, may pretend as legitimate users and after gaining acceptance from other users they post spam tweets [31]. So to find the applicability of the derived features in recent scenario, the custom dataset was prepared in 2019. For legitimate class, the tweets of different verified users have been collected via tweepy API. More than 2000 tweets per user have been considered, where the users belong to different categories like sports, business, politics, education etc. For polluted class, a spammy keyword dictionary has been applied. The dictionary was created by Snovio in 2019 that contains more than 550 catchy spam trigger words. In custom dataset, the unverified twitter accounts containing more than two spammy keywords are marked as polluted. The details of the custom dataset preparation are shown in figure 2.   The profile age is selected here as the spammers generally create new accounts to replace their suspended accounts within a short interval of time. So the age of the spam account is generally less as compared to genuine account. Similarly, the reputation score creates a significant distinction among spam user, celebrity user and genuine user based on their following follower relationship. For genuine user the value of the reputation score is less than one, for celebrity user the value is nearly equal to one and for spam user the value is nearly equal to zero. Tweet frequency is considered as a next feature because a spam user sends a bulk number of tweets with in a short period of time. So the tweet frequency count is very high in case of spammers. The next two significant features are length of screen name and length of profile description. Since the spammers frequently generate new account so they do not provide proper information into their profile.  The spammers generate enormous number of tweets with repeated URLs for promotional purpose that contains very fewer number of emoticons, stop words, hash tags and punctuation as compared to the genuine user. In case pornographic user the slang word counts are very high.

Hashtag based feature
Hashtags are the type of metadata tag used in various social media platform which allows users to apply dynamic, user-generated tagging facilities to make it possible for others to easily find messages on a specific topic or phenomenon. The hashtag based features are the information containing in the hashtag. Latent semantic analysis is a popular natural language processing technique used here to extract the vector representation of the hashtags, The LSA has been applied on each and every hashtags of a particular user and a 25bits vector has been prepared. LSA works in 3 steps.

1)!
Initially, a separate technique called term frequency-inverse document frequency (tf-idf) is used to find the frequency of the word in each document. 2)! In next step, singular value decomposition (SVD) is applied on tf-idf for dimensionality reduction.

Word embedding based feature
Word embedding is the vector representation of a particular text where the words with the similar meaning have similar representation. In present work, Word embedding based feature are the vector representation of similar words present in a tweet. GloVe and word2vec model are two popular vector representation technique used in

Topic word based feature
Topic words are the important keyword present in a particular document. In this work, to identify the topic word present in a tweet, Latent Dirichlet Allocation (LDA) topic modeling technique is applied.

Proposed model
The proposed model as illustrated in Figure 3 comprises of 4 primary sections -i) Tweet extraction; ii) preprocessing; iii) Feature extraction; iv) Train-Test split. Both SHP and custom dataset initially contains the raw tweets. In preprocessing phase, all the non English tweets have been removed. Whereas in feature extraction phase, a 338 bits feature vector has been prepared for training. The feature vector 1-8 containing the stylistic feature, vector 9-33 represents a 25 bits hashtag feature. The next 200 bits are the embedded feature containing the Glove output of the tweets. A 100 bits topic word based feature has been added next by applying LDA followed by LSA on the tweet respectively. Finally, 5 account based features are considered.