Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Q8VaxStance: Dataset Labeling System for Stance Detection Towards Vaccines in Kuwaiti Dialect

Version 1 : Received: 10 August 2023 / Approved: 11 August 2023 / Online: 14 August 2023 (09:00:24 CEST)

A peer-reviewed article of this Preprint also exists.

Alostad, H.; Dawiek, S.; Davulcu, H. Q8VaxStance: Dataset Labeling System for Stance Detection towards Vaccines in Kuwaiti Dialect. Big Data Cogn. Comput. 2023, 7, 151. Alostad, H.; Dawiek, S.; Davulcu, H. Q8VaxStance: Dataset Labeling System for Stance Detection towards Vaccines in Kuwaiti Dialect. Big Data Cogn. Comput. 2023, 7, 151.

Abstract

The Kuwaiti dialect is a particular dialect of Arabic spoken in Kuwait; it differs significantly from standard Arabic and the dialects of neighboring countries in the same region. Few research papers with a focus on the Kuwaiti dialect have been published in the field of NLP. In this study, we created Kuwaiti dialect language resources using Q8VaxStance, a vaccine stance labeling system for a large dataset of tweets. This dataset will fill this gap and provide a valuable resource for researchers studying vaccine hesitancy in Kuwait. Furthermore, it will contribute to the Arabic natural language processing field by providing a dataset for developing and evaluating machine learning models for stance detection in the Kuwaiti dialect. The proposed vaccine stance labeling system combines the benefits of weak supervised learning and zero-shot learning; for this purpose, we implemented 52 experiments on 42815 unlabeled tweets extracted between December 2020 and July 2022. The results of the experiments show that using keyword detection in conjunction with zero-shot model labeling functions is significantly better than using only keyword detection labeling functions or just zero-shot model labeling functions. Furthermore, using the Arabic language in both the labels and prompt or a mix of Arabic labels and an English prompt is statistically significant compared to using English in both the labels and prompt for the total number of generated labels evaluation metric. Finally, the best accuracy for Macro-F1 values were found in the experiments KHZSLF-EE4 and KHZSLF-EA1, with values of 0.83 and 0.83, respectively. And, for the total automatically labeled data evaluation metric, experiment KHZSLF-EE4 labeled 42,270 tweets, while experiment KHZSLF-EA1 was able to generate 42,764 labels.

Keywords

Arabic NLP; Kuwaiti Dialect; Dataset Labeling; Stance Detection; Weak supervised learning; Zero-shot learning

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.