Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Research on Multilingual News Clustering Based on Cross-Language Word Embeddings

Version 1 : Received: 9 May 2023 / Approved: 15 May 2023 / Online: 15 May 2023 (07:29:01 CEST)

How to cite: Wu, L.; Li, R.; Lam, W. Research on Multilingual News Clustering Based on Cross-Language Word Embeddings. Preprints 2023, 2023050990. https://doi.org/10.20944/preprints202305.0990.v1 Wu, L.; Li, R.; Lam, W. Research on Multilingual News Clustering Based on Cross-Language Word Embeddings. Preprints 2023, 2023050990. https://doi.org/10.20944/preprints202305.0990.v1

Abstract

In today's world, news events are emerging incessantly. Classifying the same event reported by different countries is of significant importance for public opinion control and intelligence gathering. Due to the diverse types of news, relying solely on translators would be costly and inefficient, while depending solely on translation systems would incur considerable performance overheads in invoking translation interfaces and storing translated texts. To address this issue, we mainly focus on the clustering problem of cross-lingual news. To be specific, we use a combination of sentence vector representations of news headlines in a mixed semantic space and the topic probability distributions of news content to represent a news article. In the training of cross-lingual models, we employ knowledge distillation techniques to fit two semantic spaces into a mixed semantic space. We abandon traditional static clustering methods like K-Means and AGNES in favor of the incremental clustering algorithm Single-Pass, which we further modify to better suit cross-lingual news clustering scenarios. Our main contributions are as follows: (1) We adopt the English standard BERT as the teacher model and XLM-Roberta as the student model, training a cross-lingual model through knowledge distillation that can represent sentence-level bilingual texts in both Chinese and English. (2) We use the LDA topic model to represent news as a combination of cross-lingual vectors for headlines and topic probability distributions for content, introducing concepts such as topic similarity to address the cross-lingual issue in news content representation. (3) We adapt the Single-Pass clustering algorithm for the news context to make it more applicable. Our optimizations of Single-Pass include adjusting the distance algorithm between samples and clusters, adding cluster merging operations, and incorporating a news time parameter.

Keywords

news; cross-language word embedding; LDA model; text clustering

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.