Efficiencies of Feature Engineering in the Machine Learning approach for Fake News Classification

The rapid infiltration of fake news is a flaw to the otherwise valuable internet, a virtually global network that allows for the simultaneous exchange of information. While a common, and normally effective, approach to such classification tasks is designing a deep learning-based model, the subjectivity behind the writing and production of misleading news invalidates this technique. Deep learning models are unexplainable in nature, making the contextualization of results impossible because it lacks explicit features used in traditional machine learning. This paper emphasizes the need for feature engineering to effectively address this problem: containing the spread of fake news at the source, not after it has become globally prevalent. Insights from extracted features were used to manipulate the text, which was then tested on deep learning models. The original unknown yet substantial impact that the original features had on deep learning models was successfully depicted in this study.

different underlying motives of fake news are the essence of this research, such that its spread can be contained at the source, not after it has impacted elections or reached the opposite side of the world.

Related Work
Fake news has been around since the origins of language. In a similar format to its modern implications, individuals have spread false or merely exaggerated rumors, formally known as propaganda, to give themselves a higher ranking. Efforts to spread propaganda have only evolved into more resilient methods. Now with the global network that is the internet, virtually any person can state their opinion on a digital platform for millions of people to see, regardless of credibility or underlying motives.
Due to the longevity of this crisis, its evolution can be observed in full: from its role in ancient conflicts to its current daily presence in social media. This provides the opportunity to identify distinctive characteristics of fake news throughout history and test their validity on modern fake news. Some indicators include longer titles, more all-capitalized words, more proper nouns yet fewer total nouns, fewer stop words [3]. Another supervised machine learning approach extracts features such as simple bag-of-words, n-grams, -the permutations of word combinations, term-frequency -specifically tf-idf that accounts for the significance of the term, not solely just its frequency.
In summary [4], there are currently five main approaches used to classify fake news: language, topic-agnostic, machine learning, knowledge based, and hybrid. The language approach considers the content by analyzing structure of language, grammar, syntax, among others. Topic-Agnostic looks at other features such as number of advertisements, longer headlines, whether author name is included. Machine learning is the traditional statistic based method of training data to improve and fine-tune algorithms. Examples include crowdsourcing, rumor identification, and the Twitter crawler [5]. Knowledge based uses external sources to verify validity of news; for instance, expert oriented fact checking, computational oriented fact checking, crowdsourcing oriented. Lastly, hybrid which combines human and machine learning based methods.
Natural Language Processing systems, a subset of the language approach, are used to understand, analyze, and quantify elements of the subjective human language. It has been observed that Deep Learning (DL), an unsupervised learning subset of Machine Learning (ML), is an approach for various NLP related tasks [6], including fake news detection. Specifically, common Deep Learning approaches are NN, CNN, and RNN, with CNN having an apparent fast performance and "functional evidence for representational learning and feature extraction." Overall, DL eliminates the need for background knowledge as well as demands of human engineering. Thus, this has been a popular method for fake news. One study [7] reinforces the supremacy of DL classification; it compares three DL models, Long Short-Term Memory (LSTM), Neural Network with Keras (NN-KERAS, and Neural Network with TensorFlow (NN-TF), to two standards supervised ML models, Naive Bayes and Support Vector Machine. As expected, the DL models, specifically LSTM, significantly outperforms with an average accuracy rate of 94.21%, with others following closely, such as KNN with 92.99%. Another study [8] applied a combination of DL approaches, a joint CNN-LSTM (Convolution Neural Net and LSTM) model. It uses the CNN as a "trainable feature detector" for the input text resulting in powerful convolutional features. These are then inputted to the LSTM which generates a description of the content, allowing it to classify or map content onto its correct label. The highest accuracy was lower at 72.25%, which could be credited to the small dataset, however, it was higher than the individual SVM and CNN models, validating the efficacy of combining DL methods. A similar approach, using a [9] CNN-RNN on a much larger dataset yields a 99% accuracy rate. DL methods have undoubtedly brought higher accuracy rates and although the above are very impressive results, they lack a crucial component, the insight and contextualization of features, that is only attainable through traditional feature engineering. There are potential biases in fake news that DL accounts for but because there are no officially determined features, the observer/researcher is unable to understand the distinctive trends and loses the opportunity to find motives of fake news. These hidden motives could shedlight-on how to eliminate fake news at its first appearance, instead of simply filtering it out after it eventually surfaces.

Dataset and Tools
For the classification purpose, a simple pre-labeled dataset where each text or article has a true/false label is sufficient. The dataset chosen, the ISOT dataset [10,11], is from the University of Victoria, Canada. It includes real sources from 2016 -2017 : articles labeled "true" were collected from Reuters.com and "fake" texts were from a variety of sources flagged by PolitiFact -an American organization for fact-checking articles -and Wikipedia [12]. Below is the description of the data from which our features are extracted. Among standard python libraries (NumPy, pandas, Matplotlib), various additional packages were imported to access certain features: readability, emotion, sklearn, nltk, text2emotion, spacy. Other packages to implement machine learning algorithms include XGBRegressor, Linear Regression, Logistic Regression, Random Forest Classifier, and Sklearn.metric (used to evaluate the accuracy of models). Tables 2 and 3, the preliminary details were initially extracted from the texts: article title, text, type and date of publishment.  The original preprocessing didn't remove the incorrect punctuation and preexisting mistakes in fake news articles. As a result, deep learning methods could extract these details as biased features without considering that the proper punctuation criteria is a minor factor. Many misleading articles are generated by the same well-respected news outlets that produce truthful articles. Automated grammar checks are applied to both types of articles making this characteristic largely inconsistent, therefore unreliable.

As seen in
While there was this minimal organization of the data, we applied the following additional steps to prepare the data: • To start, only each "true" text had its source as the first word(s). This was extracted to create a separate "source" column as seen below. Since "fake" articles didn't have a source, no feature could be extracted with this information, instead it would only harm other features: readability scores, tf-idf (term frequency), etc.
• Additionally, a small but significant number of rows, mostly among the "fake" articles, were often completely empty in the text column whilst still having an assigned title. Approximately 500 rows of this nature were dropped to prevent any unnecessary errors in the future.
• Approximately 20% of the concatenated dataset was reserved to properly evaluate and fine-tune the algorithms previously modeled on the other 80%, the training dataset. After applying train_test_split(), the final training and testing dataset amount to 33817 and 10450, respectively.
Other characteristics, somewhat distinct to either true or false articles, include the subject (World-News, Political-News, Government-News, Middle east, US News, left-news, politics, News), length of the article, and presence of a source. If any of these elements remain, there is potential for biased features and, ultimately, a biased, inaccurate model only viable for this specific dataset.
Stop words, often defined as the most common words of a language, are filtered out as they provide lowlevel [13] information that could hinder the focus on more important information, such as in the calculation of readability scores. Horne and Adalt [3] introduce the idea that fake news has significantly less stop words; a potential explanation could claim that fake news writers "are attempting to squeeze as much substance into the titles", necessary to create the desired reaction. A reader, for instance, is overwhelmed by this "substance", possibly negative details, regarding a politician and is promptly convinced the only "correct" reaction is to oppose said politician. While the stop words provide little meaning, their abundance could reveal a distinction. Thus, there is value in both ridding the text of stop words and leaving them included; two types of datasets, both with and without stop words, were finalized.

Feature Extraction
The following features were selected for analysis • Emotion. Being that fake news is often written to incite an emotional [14] reaction in the reader, it can be reasoned that quantifying the emotions of truthful and misleading text will yield different results. Ideally, truthful news should be as fact-based and literal as possible if it is to accurately report the newsworthy event. The following emotions -Happy, Angry, Surprise, Sad, Fear -were extracted from the datasets via the text2emotion python package. These quantified values were then inputted into Machine Learning methods: Multiple Linear Regression, Logistic Regression, and XgBoost. • Readability scores were calculated for each text to estimate the reading difficulty and other elements of complexity: Gunning Fog score and grade level; Flesch Kincaid score and grade level; Flesch score, ease, and grade level; Coleman Liau score and grade level; Dale Chall score and grade level; Ari score, grade level, and ages; Linsear Write score and grade level; Spache score and grade level; Smog score and grade level. Only numeric values were imputed into the standard models (same as emotion, above).
• General Features describe those without package dependencies, including average sentence length and stop word count ratio. The former refers to the average number of words in a sentence of a particular text; the latter is the frequency of stop words in a text, weighted by the total word count of a text. • Term Frequency. A list of all unique words through the entirety of the text was compiled. Then, each word's frequency is calculated to create a list of the 200 most frequent words. • Named Entity Recognition. To classify each unique word, Named Entity Recognition (NER) [15]   The frequency of each category within an article is extracted as a feature and is similarly inputted into the same models as above. Accuracy is the percentage of correct predictions over total predictions. Feature Importance quantifies the usefulness of a feature in predicting a target variable. The calculated readability scores are evaluated above. Most features did not have a substantial impact with the exception of Dale Chall, Ari, and Spache scores (bolded in Table 6). These features are separately reevaluated with XgBoost in Table 7 and yield a marginally higher accuracy.  In addition to the evaluations of Table 8, when applied to the combination of all emotion features, MLR yields a 58.97% accuracy and a 58.49% accuracy to the text without stop words.  Table 9 depicts the general features that, together, produce at 76.44% accuracy. This exceeds any emotion feature accuracy in Table 8 and the accuracy of the cumulative readability scores in Table 6. However, once the top three weighted readability scores are evaluated (separate from the others), they yield a higher 78.03% accuracy, indicating that there were various counterproductive scores in Table 6, meaning that they were harming the prediction.  After combing the top yielding features of each category (NER, General Features, and Readability Scores) a new highest accuracy was achieved via XgBoost at approximately 88.23% (Table 10). The most important features are bolded. Additional hyperparameter tuning (of the XgBoost model) concluded at the following parameter quantities: max_depth = 7 and alpha = 5.

Analysis of Results
Since pronouns are exempt from the "person" label given by NER, if every name was changed to its respe ctive pronoun (he, she, they), it would still be grammatically correct even if the text loses some clarity . To quantify the impact of this feature, an publicly available, fully developed Long Short-Term Memory (LSTM) deep learning model specifically for classifying fake news was tested [16] To create a "no-name" dataset on a large scale, all entities labeled "person" in each article were extracted from the original, prepr ocessed dataset and replaced with the "they" pronoun. On the LSTM model, both the normal and "no-nam e" datasets were inputted to observe the impact of this one feature on a DL model. The LSTM model has a sense of memory meaning it can learn order dependence and context required to make predictions. Theref ore, the only minimal decline in accuracy was expected when testing between the original and "no-name" datasets. However, the LSTM did show a notable change in the prediction distributions, shown in Figure 3 below. While the total averages of predictions from original and the "no-name" test data were similar (0. 4738337 to 0.4736906), the plotted distributions of each prediction depict less certainty with the "no-nam e" test data. (XGB) Accuracy (%) 88.23 Another publicly available model developed to classify fake news was tested with the "no-name" test dataset [17]. It uses a Recurrent Neural Network algorithm, similar to the prior LSTM model in that it is also Deep Learning based. Again, there are no explicit features that the model is trained on, leaving little insight into the physical behavior of the system. When tested on randomly selected texts, the model was successful tricked into thinking the input was true when it was, in fact, fake. Specifically, a fake, original article was inputted into the RNN model and achieved a correct, false prediction, Then, when the same article, taken from the "no-name" dataset, was inputted, an opposite prediction resulted: the model classified it as "true." In Figure 1, there are isolated words disproportionally frequent in either true or false classes. For instance, there is an abundance of names in fake articles: "Donald", "Trump", "Hillary", "Clinton", "Obama". This could be traced back to the fact that a overwhelming majority of fake news is generated for political motives, such as the positive or negative hyperbolization of a politician's actions

Discussion:
Despite the original speculation that emotion is a significant aspect of misleading news, the emotion features produced the lowest prediction accuracies of all other features. However, the readability scores features were significantly reliable features which, when considered, is plausible because misleading news is generated with motives to deceive, not the usual purpose to educate and inform. Therefore, lower readability scores such as less clarity and more primitive vocabular, sentence structure, and language are likely characteristics of fake news -which is reflected in the results. As for general features such as average sentence length and stop word count ratio, the idea that misleading news often tries to conceal itself as informative could justify the high accuracies. This follows the tendency of fake news articles to have longer, more informative titles. It is an effort of concentrating as much information as possible to distract from the fabricated parts. Since stop words provide little meaning, it can be reasoned that fake news authors naturally avoid stop words as they are more concerned with the details providing instead of coherency and clarity that most news sources prioritize. However, this may not necessarily apply to misleading news spread through social media, which is still a valid type of fake news. NER provides insight into the most distinctive categories between fake and true, including Date, Person, Product, GPE, and Time. Most significantly, the person tag can be explained by the tendency to exaggerate politician's actions. This is very commonly seen among the two most prominent political parties of the United States, Republican and Democratic. Thus, Figure 1 highlights words such as "Trump", "Hillary", "Clinton', "Obama", and "Republican". These are in accordance with the results achieved with the NER features.
When applying the "no-name" test datasets to other LSTM and RNN models, the very slight change in results with LSTM is explained by its prominent "memory" aspect. This capability allows it to quickly Prediction Distribution on LSTM Original Test "No-name" test adapt to new types of data. Since RNN does not have this feature, it is subject to overlooking this variation in data and is not able to adapt as well as the LSTM model. Thus, individual deceptions were successful in the RNN. However, the LSTM was still affected in its distribution of predictions. There are increased amounts between the 0.2-0.3 and 0.7-0.8 ranges as pictured in Figure 3. This shows more uncertainty regarding predicting "no-name" test datasets rather than the original test datasets. It is still mostly grammatically correct, yet the effect of the "no-name" dataset is an instance of an unforeseen text that the model was not trained for. Especially in such a digital and rapidly changing world, creating a versatile model is required to tackle the fake news epidemic. This is best achieved by returning to feature engineering, because the context it provides is crucial to comprehending the different underlying motives, which will allow for a greater understanding of how to best solve the issue.

Conclusion
The subjectivity of fake news makes it much more difficult to classify than medical data, for instance. Therefore, the heavy reliance on data-driven models using deep learning is a flawed approach to this issue. The deep learning models are not explainable in nature, making the feature engineering aspect of traditional machine learning a more appealing route. As per the technique described in this paper, understanding the origins of the feature-dependent models is vital to contextualizing the results and tackling this urgent issue effectively.