Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Integrating Text Classification in Topic Discovery with Semantic Embedding Models

These authors contributed equally to this work.
Version 1 : Received: 11 May 2023 / Approved: 12 May 2023 / Online: 12 May 2023 (08:52:57 CEST)

How to cite: Lezama-Sánchez, A.L.; Tovar Vidal, M.; Reyes-Ortiz, J.A. Integrating Text Classification in Topic Discovery with Semantic Embedding Models. Preprints 2023, 2023050908. https://doi.org/10.20944/preprints202305.0908.v1 Lezama-Sánchez, A.L.; Tovar Vidal, M.; Reyes-Ortiz, J.A. Integrating Text Classification in Topic Discovery with Semantic Embedding Models. Preprints 2023, 2023050908. https://doi.org/10.20944/preprints202305.0908.v1

Abstract

Topic discovery is finding the main idea of large amounts of textual data. It indicates the recurring topics in the documents, allowing an overview of the texts. Current topic discovery models receive the texts, with or without pre-processing of Natural Language Processing. The processing consists of stopwords removal, text cleaning and normalization (lowercase conversion). A topic discovery model that receives texts with or without processing generates general topics since the input data is many uncategorized texts. The general topics do not offer a detailed overview of the input texts, and manual text categorization is a time-consuming and tedious task. Accordingly, it is necessary to integrate an automatic text classification task in the topic discovery process to obtain specific topics with their top words that contain relevant relationships based on belonging to a class. Text classification performs a word analysis that makes up a document to decide what class or category is being identified; then, integrating the text classification before a topic discovery process will provide latent topics depicted by top words with a high coherence in each topic based on the previously obtained classes. Therefore, this paper exposes a approach that integrates text classification into topic discovery from large amounts of English textual data, such as 20-Newsgroup and Reuters corpora. The text classification is accomplished with a Convolutional Neural Network(CNN) incorporating three embedding models based on semantic relationships. The topic discovery over categorized texts is realized with Latent Dirichlet Analysis(LDA), Probabilistic Latent Semantic Analysis(PLSA), and Latent Semantic Analysis(LSA) algorithms. An evaluation process was performed based on the normalized topic coherence metric. The 20-Newsgroup corpus was classified, and twenty topics with ten top words were discovered for each class, obtaining 0.1723 normalized topic coherence when applying LDA, 0.1622 with LSA, and 0.1716 with PLSA. The Reuters corpus was also classified, obtaining 0.1441 normalized topic coherence when applying the LDA algorithm to obtain 20 topics for each class.

Keywords

Deep Learning; Topic Discovery; Latent Dirichlet Allocation; Latent Semantic Analysis; Probabilistic Latent Semantic Analysis

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.