Submitted:
14 November 2023
Posted:
15 November 2023
You are already at the latest version
Abstract
Keywords:
1. Introduction
- Machine learning modeling uses the meta-vectorizer and meta-classifier methods to determine the best model.
- Carrying out experiments with self-learning scenarios using thresholds of 0.6, 0.7, and 0.8 in the proposed model and labeled datasets with the proportion of 5%, 10%, and 20%, and with a training data to test data ratio of 80%.
- Experimental evaluation of the machine learning methods proposed in this study and their comparisons based on the mentioned above scenarios.
2. Materials and Methods
2.1. Datasets
- All elements of hate speech are needed, namely hate speech, non-hate speech, and even very negative hate speech.
- The truth of the contents and existence of this video are guaranteed because it is broadcast by official broadcasting institutions/channels.
- Public opinion in YouTube video comments is fair because there is no limit to the length of the message, there is no censorship, and it is possible for users to interact with each other (reply).
- Comments and conversations are free and unstructured and contain emoticons, punctuation, and special characters that reflect the sentiments and emotional state of the user.
- Some comments directly respond to the contents of the video, some comments respond to other comments (replies), and some reveal the user’s side.
- Comments contain elements of ambiguity, such as satire, polysemy, slang words, stop-words, and metaphors.
2.1.1. Population and Sampling
- The results of our initial observations found hate speech in YouTube video comments on the topics of the 2019 Indonesian presidential debate and Covid-19. This study assumes that the data are well-matched to our research objectives.
- Data from YouTube video comments is publicly available and can be downloaded free of charge.
- The data collection stage downloads all comments from the presidential debates 1 to 5, each broadcast in full by 2 official channels. So, the comments from 10 videos of presidential debates and 5 news about Covid-19 were downloaded.
- Moreover, comments were also downloaded from the official channel videos that do not feature the complete presidential debate but are deemed necessary to download due to the high number of views (more than 10,000 views), comments (over 1,000 comments), and exciting topics.
2.2. Pre-processing
- Cleaning the text—it involves removing any unnecessary or irrelevant characters, such as punctuation marks or special characters, from the text [31].
- Tokenization—it involves splitting the text into smaller units, called tokens, such as words or phrases [32].
- Removing stop-words—stop-words are common words that do not provide significant meaning [33], such as “the” or “and”, and can be removed from the text to reduce its size and improve the performance of the NLP model.
- Stemming and lemmatization—these are the techniques that are used to reduce words to their base form [34], called the stem or lemma, which reduces the number of words in the text and improves the model’s ability to understand the text.
- Part-of-speech tagging—it involves identification of the parts of speech in the text, such as a noun or verb, which can be useful for certain NLP tasks [35].
- Normalization—it involves formatting the text consistently, which includes converting all words to lowercase to make it easier for the NLP model to process the text [36].
2.3. Meta-vectorization Based on Text Feature Extraction
2.3.1. Term Frequency-Inverse Document Frequency (TF-IDF)
2.3.2. Word Embedding (Word2Vec)
2.4. Meta-Classification Using Machine Learning Algorithms
2.4.1. Support Vector Machine
2.4.2. Decision Tree (DT)
2.4.3. K-Nearest Neighbors (KNN)
2.4.4. Naive Bayes (NB)
3. Results
3.1. Experimental Scenario Setup
3.2. Experimental Results of the Machine Learning Based Approach
| Listing 1. Text Auto-Annotation Based on Semi-Supervised and Self-Learning Approach |
STEPS:
|
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A

Appendix B
| No | Labeled Data | Unlabeled Data | Threshold | SVM TF-IDF | DT TF-IDF | KNN TF-IDF | NB TF-IDF | SVM Word2Vec | DT Word2Vec | KNN Word2Vec | NB Word2Vec |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.2 | 0.8 | 0.6 | 0.793 | 0.856 | 0.594 | 0.812 | 0.823 | 0.728 | 0.827 | 0.766 |
| 2 | 0.2 | 0.8 | 0.7 | 0.782 | 0.861 | 0.601 | 0.804 | 0.858 | 0.753 | 0.866 | 0.833 |
| 3 | 0.2 | 0.8 | 0.8 | 0.828 | 0.891 | 0.613 | 0.793 | 0.888 | 0.775 | 0.88 | 0.861 |
| 4 | 0.2 | 0.8 | 0.9 | 0.853 | 0.895 | 0.593 | 0.801 | 0.9 | 0.796 | 0.893 | 0.865 |
| 5 | 0.1 | 0.9 | 0.6 | 0.756 | 0.915 | 0.721 | 0.786 | 0.849 | 0.824 | 0.859 | 0.791 |
| 6 | 0.1 | 0.9 | 0.7 | 0.764 | 0.915 | 0.761 | 0.786 | 0.919 | 0.845 | 0.93 | 0.862 |
| 7 | 0.1 | 0.9 | 0.8 | 0.821 | 0.907 | 0.751 | 0.803 | 0.896 | 0.803 | 0.902 | 0.804 |
| 8 | 0.1 | 0.9 | 0.9 | 0.87 | 0.911 | 0.743 | 0.784 | 0.899 | 0.797 | 0.905 | 0.828 |
| 9 | 0.05 | 0.95 | 0.6 | 0.76 | 0.946 | 0.711 | 0.756 | 0.888 | 0.885 | 0.896 | 0.847 |
| 10 | 0.05 | 0.95 | 0.7 | 0.858 | 0.946 | 0.711 | 0.745 | 0.94 | 0.878 | 0.95 | 0.818 |
| 11 | 0.05 | 0.95 | 0.8 | 0.901 | 0.949 | 0.739 | 0.768 | 0.967 | 0.885 | 0.966 | 0.825 |
| 12 | 0.05 | 0.95 | 0.9 | 0.934 | 0.971 | 0.74 | 0.759 | 0.968 | 0.934 | 0.969 | 0.876 |
Appendix C
| Scenario Comparations | SVM TF-IDF (%) | DT TF-IDF (%) | KNN TF-IDF (%) | NB TF-IDF (%) | SVM Word2Vec (%) | DT Word2Vec (%) | KNN Word2Vec (%) | NB Word2Vec (%) |
|---|---|---|---|---|---|---|---|---|
| 1-2 | -1.10 | 0.50 | 0.70 | -0.80 | 3.50 | 2.50 | 3.90 | 6.70 |
| 2-3 | 4.60 | 3.00 | 1.20 | -1.10 | 3.00 | 2.20 | 1.40 | 2.80 |
| 3-4 | 2.50 | 0.40 | -2.00 | 0.80 | 1.20 | 2.10 | 1.30 | 0.40 |
| 5-6 | 0.80 | 0.00 | 4.00 | 0.00 | 7.00 | 2.10 | 7.10 | 7.10 |
| 6-7 | 5.70 | -0.80 | -1.00 | 1.70 | -2.30 | -4.20 | -2.80 | -5.80 |
| 7-8 | 4.90 | 0.40 | -0.80 | -1.90 | 0.30 | -0.60 | 0.30 | 2.40 |
| 9-10 | 9.80 | 0.00 | 0.00 | -1.10 | 5.20 | -0.70 | 5.40 | -2.90 |
| 10-11 | 4.30 | 0.30 | 2.80 | 2.30 | 2.70 | 0.70 | 1.60 | 0.70 |
| 11-12 | 3.30 | 2.20 | 0.10 | -0.90 | 0.10 | 4.90 | 0.30 | 5.10 |
References
- Alrehili, A. Automatic Hate Speech Detection on Social Media: A Brief Survey. 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA). IEEE, 2019. [CrossRef]
- Al-Makhadmeh, Z.; Tolba, A. Automatic hate speech detection using killer natural language processing optimizing ensemble deep learning approach. Computing 2019, 102, 501–522. [CrossRef]
- Rajman, M.; Besançon, R. Text Mining: Natural Language techniques and Text Mining applications. In Data Mining and Reverse Engineering; Springer US, 1998; pp. 50–64. [CrossRef]
- Fortuna, P.; Nunes, S. A Survey on Automatic Detection of Hate Speech in Text. ACM Computing Surveys 2018, 51, 1–30. [CrossRef]
- Cahyana, N.H.; Saifullah, S.; Fauziah, Y.; Aribowo, A.S.; Drezewski, R. Semi-supervised Text Annotation for Hate Speech Detection using K-Nearest Neighbors and Term Frequency-Inverse Document Frequency. International Journal of Advanced Computer Science and Applications 2022, 13. [CrossRef]
- Aman, S.; Szpakowicz, S. Identifying Expressions of Emotion in Text. In Text, Speech and Dialogue. TSD 2007; Springer Berlin Heidelberg, 2007; pp. 196–205. [CrossRef]
- Krouska, A.; Troussas, C.; Virvou, M. The effect of preprocessing techniques on Twitter sentiment analysis. 2016 7th International Conference on Information, Intelligence, Systems & Applications (IISA). IEEE, 2016. [CrossRef]
- Savigny, J.; Purwarianti, A. Emotion classification on youtube comments using word embedding. 2017 International Conference on Advanced Informatics, Concepts, Theory, and Applications (ICAICTA). IEEE, 2017. [CrossRef]
- Ningtyas, A.M.; Herwanto, G.B. The Influence of Negation Handling on Sentiment Analysis in Bahasa Indonesia. 2018 5th International Conference on Data and Software Engineering (ICoDSE). IEEE, 2018. [CrossRef]
- Mariel, W.C.F.; Mariyah, S.; Pramana, S. Sentiment analysis: a comparison of deep learning neural network algorithm with SVM and naïve Bayes for Indonesian text. Journal of Physics: Conference Series 2018, 971, 012049. [CrossRef]
- Mao, R.; Liu, Q.; He, K.; Li, W.; Cambria, E. The Biases of Pre-Trained Language Models: An Empirical Study on Prompt-Based Sentiment Analysis and Emotion Detection. IEEE Transactions on Affective Computing 2022, pp. 1–11. [CrossRef]
- Dashtipour, K.; Gogate, M.; Gelbukh, A.; Hussain, A. Extending persian sentiment lexicon with idiomatic expressions for sentiment analysis. Social Network Analysis and Mining 2021, 12. [CrossRef]
- Imran, A.S.; Yang, R.; Kastrati, Z.; Daudpota, S.M.; Shaikh, S. The impact of synthetic text generation for sentiment analysis using GAN based models. Egyptian Informatics Journal 2022, 23, 547–557. [CrossRef]
- Balli, C.; Guzel, M.S.; Bostanci, E.; Mishra, A. Sentimental Analysis of Twitter Users from Turkish Content with Natural Language Processing. Computational Intelligence and Neuroscience 2022, 2022, 1–17. [CrossRef]
- Jain, D.K.; Boyapati, P.; Venkatesh, J.; Prakash, M. An Intelligent Cognitive-Inspired Computing with Big Data Analytics Framework for Sentiment Analysis and Classification. Information Processing & Management 2022, 59, 102758. [CrossRef]
- Kabakus, A.T. A novel COVID-19 sentiment analysis in Turkish based on the combination of convolutional neural network and bidirectional long-short term memory on Twitter. Concurrency and Computation: Practice and Experience 2022, 34. [Google Scholar] [CrossRef]
- Al-Laith, A.; Shahbaz, M.; Alaskar, H.F.; Rehmat, A. AraSenCorpus: A Semi-Supervised Approach for Sentiment Annotation of a Large Arabic Text Corpus. Applied Sciences 2021, 11, 2434. [Google Scholar] [CrossRef]
- Balakrishnan, V.; Lok, P.Y.; Rahim, H.A. A semi-supervised approach in detecting sentiment and emotion based on digital payment reviews. The Journal of Supercomputing 2020, 77, 3795–3810. [Google Scholar] [CrossRef]
- Ibrohim, M.O.; Budi, I. Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter. Proceedings of the Third Workshop on Abusive Language Online. Association for Computational Linguistics, 2019. [CrossRef]
- Zhang, Z.; Robinson, D.; Tepper, J. Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network. In The Semantic Web; Springer International Publishing, 2018; pp. 745–760. [CrossRef]
- Davidson, T.; Warmsley, D.; Macy, M.; Weber, I. Automated Hate Speech Detection and the Problem of Offensive Language. Proceedings of the International AAAI Conference on Web and Social Media 2017, 11, 512–515. [Google Scholar] [CrossRef]
- Cahyani, D.E.; Patasik, I. Performance comparison of TF-IDF and Word2Vec models for emotion text classification. Bulletin of Electrical Engineering and Informatics 2021, 10, 2780–2788. [Google Scholar] [CrossRef]
- Abduljabbar, D.A.; Omar, N. Exam questions classification based on Bloom’s taxonomy cognitive level using classifiers combination. Journal of Theoretical and Applied Information Technology 2015, 78, 447–455. [Google Scholar]
- Soliman, A.B.; Eissa, K.; El-Beltagy, S.R. AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP. Procedia Computer Science 2017, 117, 256–265. [Google Scholar] [CrossRef]
- Kumar, C.S.P.; Babu, L.D.D. Novel Text Preprocessing Framework for Sentiment Analysis. In Smart Intelligent Computing and Applications; Springer Singapore, 2018; pp. 309–317. [CrossRef]
- Ramachandran, D.; Parvathi, R. Analysis of Twitter Specific Preprocessing Technique for Tweets. Procedia Computer Science 2019, 165, 245–251. [Google Scholar] [CrossRef]
- Mohammed, M.; Omar, N. Question classification based on Bloom’s taxonomy cognitive domain using modified TF-IDF and word2vec. PLOS ONE 2020, 15, e0230442. [Google Scholar] [CrossRef]
- Babanejad, N.; Agrawal, A.; An, A.; Papagelis, M. A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. [CrossRef]
- Albalawi, R.; Yeap, T.H.; Benyoucef, M. Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis. Frontiers in Artificial Intelligence 2020, 3. [Google Scholar] [CrossRef]
- Arora, M.; Kansal, V. Character level embedding with deep convolutional neural network for text normalization of unstructured data for Twitter sentiment analysis. Social Network Analysis and Mining 2019, 9. [Google Scholar] [CrossRef]
- Elgibreen, H.; Faisal, M.; Sulaiman, M.A.; Abdou, S.; Mekhtiche, M.A.; Moussa, A.M.; Alohali, Y.A.; Abdul, W.; Muhammad, G.; Rashwan, M.; Algabri, M. An Incremental Approach to Corpus Design and Construction: Application to a Large Contemporary Saudi Corpus. IEEE Access 2021, 9, 88405–88428. [Google Scholar] [CrossRef]
- Rai, A.; Borah, S. Study of Various Methods for Tokenization. In Applications of Internet of Things; Springer Singapore, 2020; pp. 193–200. [CrossRef]
- Manalu, S.R. Stop words in review summarization using TextRank. 2017 14th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). IEEE, 2017. [CrossRef]
- Zeroual, I.; Lakhouaja, A. Arabic information retrieval: Stemming or lemmatization? 2017 Intelligent Systems and Computer Vision (ISCV). IEEE, 2017. [CrossRef]
- AlKhwiter, W.; Al-Twairesh, N. Part-of-speech tagging for Arabic tweets using CRF and Bi-LSTM. Computer Speech & Language 2021, 65, 101138. [Google Scholar] [CrossRef]
- Sharma, A.; Kumar, S. Ontology-based semantic retrieval of documents using Word2vec model. Data & Knowledge Engineering 2023, 144, 102110. [Google Scholar] [CrossRef]
- Liang, H.; Sun, X.; Sun, Y.; Gao, Y. Text feature extraction based on deep learning: a review. EURASIP Journal on Wireless Communications and Networking 2017, 2017. [Google Scholar] [CrossRef]
- Garouani, M.; Ahmad, A.; Bouneffa, M.; Hamlich, M.; Bourguin, G.; Lewandowski, A. Using meta-learning for automated algorithms selection and configuration: an experimental framework for industrial big data. Journal of Big Data 2022, 9. [Google Scholar] [CrossRef]
- Kamyab, M.; Liu, G.; Adjeisah, M. Attention-Based CNN and Bi-LSTM Model Based on TF-IDF and GloVe Word Embedding for Sentiment Analysis. Applied Sciences 2021, 11, 11255. [Google Scholar] [CrossRef]
- Saifullah, S.; Fauziyah, Y.; Aribowo, A.S. Comparison of machine learning for sentiment analysis in detecting anxiety based on social media data. Jurnal Informatika 2021, 15, 45. [Google Scholar] [CrossRef]
- Fauziah, Y.; Saifullah, S.; Aribowo, A.S. Design Text Mining for Anxiety Detection using Machine Learning based-on Social Media Data during COVID-19 pandemic. Proceeding of LPPM UPN “Veteran” Yogyakarta Conference Series 2020–Engineering and Science Series, 2020, pp. 253–261.
- Capelle, M.; Hogenboom, F.; Hogenboom, A.; Frasincar, F. Semantic news recommendation using wordnet and bing similarities. Proceedings of the 28th Annual ACM Symposium on Applied Computing — SAC’13. ACM Press, 2013. [CrossRef]
- Sivakumar, S.; Videla, L.S.; Kumar, T.R.; Nagaraj, J.; Itnal, S.; Haritha, D. Review on Word2Vec Word Embedding Neural Net. 2020 International Conference on Smart Electronics and Communication (ICOSEC). IEEE, 2020. [CrossRef]
- Landgraf, A.J.; Bellay, J. word2vec Skip-Gram with Negative Sampling is a Weighted Logistic PCA, 2017. [CrossRef]
- Alkomah, F.; Ma, X. A Literature Review of Textual Hate Speech Detection Methods and Datasets. Information 2022, 13, 273. [Google Scholar] [CrossRef]
- Saifullah, S.; Drezewski, R. Non-Destructive Egg Fertility Detection in Incubation Using SVM Classifier Based on GLCM Parameters. Procedia Computer Science 2022, 207, 3248–3257. Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 26th International Conference KES2022. [CrossRef]
- Bansal, M.; Goyal, A.; Choudhary, A. A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning. Decision Analytics Journal 2022, 3, 100071. [Google Scholar] [CrossRef]
- Kuchipudi, R.; Uddin, M.; Murthy, T.; Mirrudoddi, T.K.; Ahmed, M.; P, R.B. Android Malware Detection using Ensemble Learning. 2023 International Conference on Sustainable Computing and Smart Systems (ICSCSS). IEEE, 2023, pp. 297–302. [CrossRef]
- Degirmenci, A.; Karal, O. Efficient density and cluster based incremental outlier detection in data streams. Information Sciences 2022, 607, 901–920. [Google Scholar] [CrossRef]
- Kesarwani, A.; Chauhan, S.S.; Nair, A.R. Fake News Detection on Social Media using K-Nearest Neighbor Classifier. 2020 International Conference on Advances in Computing and Communication Engineering (ICACCE). IEEE, 2020, pp. 1–4. [CrossRef]
- Xu, S. Bayesian Naïve Bayes classifiers to text classification. Journal of Information Science 2018, 44, 48–59. [Google Scholar] [CrossRef]
- Mwaro, P.N.; Ogada, D.K.; Cheruiyot, P.W. Applicability of Naïve Bayes Model for Automatic Resume Classification. International Journal of Computer Applications Technology and Research 2020, 9, 257–264. [Google Scholar] [CrossRef]
- Zhang, F.; Fleyeh, H.; Wang, X.; Lu, M. Construction site accident analysis using text mining and natural language processing techniques. Automation in Construction 2019, 99, 238–248. [Google Scholar] [CrossRef]
- Saifullah, S.; Cahyana, N.H.; Fauziah, Y.; Aribowo, A.S.; Dwiyanto, F.A.; Drezewski, R. Text Annotation Automation for Hate Speech Detection using SVM-classifier based on Feature Extraction. International Conference on Advanced Research in Engineering and Technology, 2022.
- Kocoń, J.; Figas, A.; Gruza, M.; Puchalska, D.; Kajdanowicz, T.; Kazienko, P. Offensive, aggressive, and hate speech analysis: From data-centric to human-centered approach. Information Processing & Management 2021, 58, 102643. [Google Scholar] [CrossRef]
- Maniruzzaman, M.; Rahman, M.J.; Ahammed, B.; Abedin, M.M. Classification and prediction of diabetes disease using machine learning paradigm. Health Information Science and Systems 2020, 8. [Google Scholar] [CrossRef]
- Machova, K.; Mach, M.; Vasilko, M. Comparison of Machine Learning and Sentiment Analysis in Detection of Suspicious Online Reviewers on Different Type of Data. Sensors 2021, 22, 155. [Google Scholar] [CrossRef]





| No | Training Data (%) | Unlabeled Data (%) | Threshold |
|---|---|---|---|
| 1 | 20 | 80 | 0.6 |
| 2 | 20 | 80 | 0.7 |
| 3 | 20 | 80 | 0.8 |
| 4 | 20 | 80 | 0.9 |
| 5 | 10 | 90 | 0.6 |
| 6 | 10 | 90 | 0.7 |
| 7 | 10 | 90 | 0.8 |
| 8 | 10 | 90 | 0.9 |
| 9 | 5 | 95 | 0.6 |
| 10 | 5 | 95 | 0.7 |
| 11 | 5 | 95 | 0.8 |
| 12 | 5 | 95 | 0.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
