Submitted:
10 September 2024
Posted:
11 September 2024
You are already at the latest version
Abstract
Keywords:
1. Introduction
2. Related Works
3. Methodology




- Support Vector Classifier (SVC): The SVC is a type of SVM used for classification problems. This method classifies data points in multidimensional space uniquely and divides the data into classes by finding the best hyperplane. The algorithm takes the input data and uses a line (2D space) or hyperplane (3D space and above) that separates the data into the classes with the greatest margin output.
- Support Vector Regression (SVR): Unlike SVC, SVR is used for regression problems. Instead of trying to fit the maximum street size possible between two classes while limiting margin violations, SVR tries to fit as many cases as possible in the street. The street width hyperparameter is called the epsilon and governs. SVR is a linear regression in a high (infinite) dimensional space.
- Logistic Regression: A logistic regression is a statistical model that uses a logistic function to model a binary dependent variable. In other words, the probability of an event is determined by fitting the data to the logistic curve. The results can be thought of as the probability of the given input point belonging to a class (mostly reliable or mostly unreliable).
- reading and preparing the data;
- initialization and ML;
- data analysis.
4. Results of the Experiments
- Learning Curve: Shows the number of training examples on the X-axis and the corresponding model score, such as the accuracy and F1 score, on the Y-axis. The red curve represents the score of the training cohort, while the green curve represents the cross-validation score. It provides information about how well the model is learning based on the number of training examples.
- Model scalability: The chart explicates the model’s scalability by showing the number of training sets on the X-axis and the time taken to fit the model on the training data on the Y-axis. It visualizes how the fit time changes with fluctuations in the size of the training set, providing a clear representation of the model’s scalability.
- Model Performance: This graph shows the relationship between the number of fits (X-axis) and the model performance (Y-axis), allowing us to explore how the model’s performance changes with different numbers of fits.
5. Application Infrastructure
- -h, --help: Used to provide program usage information and describe available options.
-
train -dataset_path ./data/factcheck.csv [-x text] [-y target] [-save_to ./result] [-model SVC] [-vectorizer TfidfVectorizer] [-kfold 10] [-test_size 0.2]: Primarily used to train ML models on the specified CSV-formatted data.
- ○
- dataset_path: path to the dataset.
- ○
- x: name of the column containing the input text. Default: “text”
- ○
- y: name of the column containing the output labels. Default: “target”
- ○
- save_to: the path of saving the trained model file. Default: The path where the program starts. Default model name: “model.mdl”.
- ○
- model: select a training model. Three models are available: SVC, SVR, and LogisticRegression. Default model: SVC.
- ○
- vectorizer: select the text vectorization. Three approaches are available. CountVectorizer, TfidfVectorizer, and HashingVectorizer. Default vectorizer: TfidfVectorizer.
- ○
- kfold: number of folds to use for cross-validation. Default 1.
- ○
- test_size: size of the test set. Default 0.
-
validate -model_path ./model.mdl -dataset_path ./data/factcheck.csv [-x text] [-y target] [-test_size 0.2]: Validate the model using the provided dataset.
- ○
- model_path: path to the trained model.
- ○
- dataset_path: path to the dataset.
- ○
- x: name of the column containing the input text. Default: “text”
- ○
- y: column name containing the output labels. Default: “target”
- ○
- test_size: size of the test set. Default: 0.2.
-
predict -model_path ./model.mdl -text “fake news text”: This may be used for prediction via a previously trained model or to extract information from a given text.
- ○
- model_path: path to the trained model.
- ○
- text: text for prediction.
-
visualize -model_path ./model.mdl -text “fake news text” [-features 60] [-save_to ./result]: Generates an HTML visualization of model predictions for a given text input using LIME (local interpretable model-agnostic explanations).
- ○
- model_path: path to the trained model.
- ○
- text: next to predict.
- ○
- features: the maximum number of tokens displayed in the table. Default: 40.
- ○
- save_to: save the rendered results in HTML. Default: “./results/1.html”.
-
host -model_path ./model.mdl [-address 0.0.0.0] [-port 5000]: Can be used to host the trained model as a service on the specified port, enabling other systems or applications to take advantage of the model for prediction tasks
- ○
- model_path: path to the trained model.
- ○
- address: IP address for the API host. Default: 0.0.0.0.
- ○
- port: port for the API host. Default: 5000.
- /model/predict?text=here is text (GET method): Gets the text for the prediction and returns the predicted result in JSON format.
- /model/visualize?text=here is text (GET method): Gets the text for the prediction and returns the image with prediction and model explanation.
6. Conclusion
References
- Aïmeur, E. Amri, S., Brassard, G. (2023). Fake News, Disinformation and Misinformation in Social Media: a Review, Soc. Netw. Anal. Min. 13(1): 30.
- Apejoye, A. (2015). Comparative Study of Social Media, Television and Newspapers’ News Credibility.
- Yuan, L., Jiang, H., Shen, H., Shi, L., Cheng, N.. (2023). Sustainable Development of Information Dissemination: A Review of Current Fake News Detection Research and Practice. 458. [CrossRef]
- Lazer, D. M. J. et al. The Science of Fake News. Science 359, 1094–1096 (2018). https://www.science.org/doi/10.1126/science.aao2998. [CrossRef]
- Vlachos, A., & Riedel, S. (2014). Fact Checking: Task Definition and Dataset Construction. ACL Workshop.
- Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). “Why Should I Trust You?” Explaining the Predictions of Any Classifier. KDD ‘16.
- Conroy, N. K., Rubin, V. L. & Chen, Y. Automatic Deception Detection: Methods for Finding Fake News. Proc. Assoc. Inf. Sci. Technol. 52, 1–4. (2015). https://doi.org/10.1002/pra2.2015.145052010082. [CrossRef]
- Castillo, C., Mendoza, M., Poblete, B. (2011). Information Credibility on Twitter. In: Proceedings of the 20th International Conference on World Wide Web, pp 675–684.
- Khanam, Z., B. N. Alwasel, H. Sirafi and M. Rashid. Fake News Detection Using Machine Learning Approaches. [CrossRef]
- Wang, W. Y. (2017). “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection. ACL ‘17. Available at https://paperswithcode.com/dataset/liar.
- Zhou, X., Zafarani, R. (2020). A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities.
- Tang, D., Qin, B., Liu, T. (2015). Document Modeling with Gated Recurrent Neural Network for Sentiment Classification.
- Zhang et al. (2019). Simple RNN of Hidden Layer.
- Chen, Y., Cheng, Q., Cheng, Y.,Yang, H. (2019). Applications of Recurrent Neural Networks in Environmental Factor Forecasting. [CrossRef]
- Shu, K. , Wang, S., Liu, H. (2019). Beyond News Contents: The Role of Social Context for Fake News Detection.
- Monti, F., Frasca, F., Eynard, D., Mannion, D., Bronstein, M.M. (2019). Fake News Detection on Social Media Using Geometric Deep Learning.
- Lupei, M., Mitsa, O., Sharkan, V., Vargha, S., Gorbachuk, V. The Identification of Mass Media by Text Based on the Analysis of Vocabulary Peculiarities Using Support Vector Machines. 2022 International Conference on Smart Information Systems and Technologies (SIST), 2022. (DOI: 10.1109/sist54437.2022.9945774). [CrossRef]
- Lupei, M., Mitsa, O., Sharkan, V., Vargha, S., Lupei N (2023). Analyzing Ukrainian Media Texts by Means of Support Vector Machines: Aspects of Language and Copyright.






| ID | Source | Sample | Features | Target |
| 1 | LIAR dataset | 10270 | Article Body |
Mostly Reliable – 1 Mostly Unreliable - 0 |
| 2 | Politifact parser | 4472 | Article Title | |
| 3 | Politifact parser | 23640 | Article Body |
| ID | K-FOLD | Model | Values | Accuracy |
| 1 | 5 | Logistic Regression | F1=0.67 | 0.593 |
| 1 | 5 | SVC | F1=0.696 | 0.595 |
| 1 | 5 | SVR* | F1=0.691 | 0.595 |
| 2 | 5 | Logistic Regression | F1=0.845 | 0.158 |
| 2 | 5 | SVC | F1=0.032 | 0.838 |
| 2 | 5 | SVR* | F1=0.045 | 0.838 |
| 3 | 5 | Logistic Regression | F1=0.822 | 0.78 |
| 3 | 5 | SVC | F1=0.883 | 0.793 |
| 3 | 5 | SVR* | F1=0.772 | 0.772 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).