Data Descriptor
Version 1
This version is not peer-reviewed
A Large-Scale Tweet Dataset for Urdu Text Sentiment Analysis
Version 1
: Received: 22 March 2021 / Approved: 24 March 2021 / Online: 24 March 2021 (12:03:46 CET)
How to cite: Batra, R.; Kastrati, Z.; Imran, A.S.; Daudpota, S.M.; Ghafoor, A. A Large-Scale Tweet Dataset for Urdu Text Sentiment Analysis. Preprints 2021, 2021030572 Batra, R.; Kastrati, Z.; Imran, A.S.; Daudpota, S.M.; Ghafoor, A. A Large-Scale Tweet Dataset for Urdu Text Sentiment Analysis. Preprints 2021, 2021030572
Abstract
This article presents a dataset of tweets in the Urdu language. There are 1,140,824 tweets in the dataset, collected from Twitter for September and October 2020. This large-scale corpus of tweets is generated by performing pre-processing which includes removing columns containing user information, retweet’s count, followers information, duplicate tweets, removing unnecessary punctuation, links, symbols, and spaces, and finally extracting emojis if present in the tweet text. In the final dataset each tweet record contains columns for tweet id, text, and emoji extracted from the text with a sentiment score. Emojis are extracted to validate Machine Learning models used for the multilingual sentiment and behavior analysis. These are extracted using a Python script that searches for an emoji from the list of 751 most frequently used emojis. If an emoji is present in the text, a column with the emoji description and sentiment score is added.
Supplementary and Associated Material
https://data.mendeley.com/datasets/rz3xg97rm5/1: Published Dataset
Keywords
Urdu Twitter Dataset; Urdu Natural language processing (NLP); Urdu text Sentiments and Emoticons
Subject
Computer Science and Mathematics, Algebra and Number Theory
Copyright: This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Comments (0)
We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.
Leave a public commentSend a private comment to the author(s)
* All users must log in before leaving a comment