Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Ensemble-based Short Text Similarity: An Easy Approach for Multilingual Datasets using Transformers and WordNet in Real-world Scenarios

Version 1 : Received: 7 July 2023 / Approved: 10 July 2023 / Online: 12 July 2023 (03:02:18 CEST)

A peer-reviewed article of this Preprint also exists.

Gagliardi, I.; Artese, M.T. Ensemble-Based Short Text Similarity: An Easy Approach for Multilingual Datasets Using Transformers and WordNet in Real-World Scenarios. Big Data Cogn. Comput. 2023, 7, 158. Gagliardi, I.; Artese, M.T. Ensemble-Based Short Text Similarity: An Easy Approach for Multilingual Datasets Using Transformers and WordNet in Real-World Scenarios. Big Data Cogn. Comput. 2023, 7, 158.

Abstract

When integrating data from different sources, there are problems of synonymy, different languages, concepts of different granularity. This paper proposes a simple but effective approach to evaluate the semantic similarity of short texts, especially keywords. The method is capable of matching keywords from different sources and languages by exploiting transformers and WordNet-based methods. Key features of the approach include its unsupervised pipeline, mitigation of the lack of context in keywords, scalability for large archives, support for multiple languages and real-world scenarios adaptation capabilities. The work aims to provide a versatile tool for different cultural heritage archives without requiring complex customization. The objectives of the paper are to explore different approaches to identifying similarities in 1- or n-gram tags, to evaluate and compare different pre-trained language models, and to define integrated methods to overcome limitations. Tests to validate the approach have been conducted using the QueryLab portal, a search engine for cultural heritage archives, to evaluate the proposed pipeline.

Keywords

semantic textual similarity; pretrained language models; transformers; WordNet; QueryLab; ensemble methods

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.