Preprint Article Version 1 Preserved in Portico This version is not peer-reviewed

Leveraging Ensemble Method with Transformer for Robust Drug Use Detection on Twitter

Version 1 : Received: 21 July 2023 / Approved: 24 July 2023 / Online: 24 July 2023 (10:57:47 CEST)

How to cite: AlGhannam, R. G.; Ykhlef, M.; Al-Dossari, H. Leveraging Ensemble Method with Transformer for Robust Drug Use Detection on Twitter. Preprints 2023, 2023071577. https://doi.org/10.20944/preprints202307.1577.v1 AlGhannam, R. G.; Ykhlef, M.; Al-Dossari, H. Leveraging Ensemble Method with Transformer for Robust Drug Use Detection on Twitter. Preprints 2023, 2023071577. https://doi.org/10.20944/preprints202307.1577.v1

Abstract

Social media platforms are increasingly enabling the propagation of content from groups related to drug use, thus posing risks for the wider population and, in particular, individuals who are amenable to drug use and drug addiction. The detection of drug use content on social media platforms is a priority for governments, technology companies, and drug law enforcement organizations. To counter this issue, various techniques have been developed to identify and promptly remove drug use content, while also blocking its creators from network access. In this paper, we introduce a manually annotated Twitter dataset, comprising 156,521 tweets published between 2008 and 2022, specifically compiled for the purpose of drug use detection. The dataset underwent annotation by several group of expert annotators who classified the tweets as either drug use or non-drug use. Exploratory data analysis was conducted to comprehend the dataset's characteristics. Various classification algorithms, including SVM, XGBoost, RF, NB, LSTM, and BERT were employed using the dataset. Among the traditional machine learning models, SVM utilizing term frequency-inverse document frequency features achieved the highest F1-Score (0.9017). However, BERT with textual features concatenated with numerical and categorical features in ensemble method surpassed the performance of traditional models, attaining F1-Score of 0.9112. To facilitate future research and enhance English online drug use classification accuracy, the dataset will be made publicly available.

Keywords

deep learning, transformer, drug use detection, exploratory data analysis, natural language processing.

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.