González, F.; Torres-Ruiz, M.; Rivera-Torruco, G.; Chonona-Hernández, L.; Quintero , R. A Natural-Language-Processing-Based Method for the Clustering and Analysis of Movie Reviews and Classification by Genre. Mathematics2023, 11, 4735.
González, F.; Torres-Ruiz, M.; Rivera-Torruco, G.; Chonona-Hernández, L.; Quintero , R. A Natural-Language-Processing-Based Method for the Clustering and Analysis of Movie Reviews and Classification by Genre. Mathematics 2023, 11, 4735.
González, F.; Torres-Ruiz, M.; Rivera-Torruco, G.; Chonona-Hernández, L.; Quintero , R. A Natural-Language-Processing-Based Method for the Clustering and Analysis of Movie Reviews and Classification by Genre. Mathematics2023, 11, 4735.
González, F.; Torres-Ruiz, M.; Rivera-Torruco, G.; Chonona-Hernández, L.; Quintero , R. A Natural-Language-Processing-Based Method for the Clustering and Analysis of Movie Reviews and Classification by Genre. Mathematics 2023, 11, 4735.
Abstract
The large quantity of information retrieved from communities, public data repositories, web pages, or data mining can be sparsed and poorly classified. This work shows how to employ unsupervised classification algorithms such as K-means proper to classify user reviews into their closest category, forming a balanced data set. Moreover, we found that the text vectorization technique significantly impacts the clustering formation, comparing TF-IDF and Word2Vec. The value for mapping a cluster with movie genre was 81.34% ± 20.48 of the cases when the TF-IDF was applied, whereas Word2Vec only yielded a 53.51% ± 24.1. In addition, we highlight the impact of the removal of stop-words. Thus, we detected that pre-compiled lists are not the best method to remove stop-words before clustering because there is much ambiguity, centroids are poorly separated, and only 57% of clusters could match a movie genre. Thus, our proposed approach achieved a 94% of accuracy. After analyzing the classifiers’ results, we appreciated a similar effect when divided by the stop-words method removal. Statistically significant changes were observed, especially in precision metric and Jaccard scores in both classifiers, using custom-generated stop lists rather than pre-compiled ones. Reclassifying sparse data is strongly recommended as using custom-generated stop lists.
Keywords
Text document clustering; K-means; TF-IDF; NLP; Text vectorization; machine learning; movie reviews
Subject
Computer Science and Mathematics, Artificial Intelligence and Machine Learning
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.