Working Paper Data Descriptor Version 1 This version is not peer-reviewed

A Large-Scale Tweet Dataset for Urdu Text Sentiment Analysis

Version 1 : Received: 22 March 2021 / Approved: 24 March 2021 / Online: 24 March 2021 (12:03:46 CET)

How to cite: Batra, R.; Kastrati, Z.; Imran, A.S.; Daudpota, S.M.; Ghafoor, A. A Large-Scale Tweet Dataset for Urdu Text Sentiment Analysis. Preprints 2021, 2021030572 Batra, R.; Kastrati, Z.; Imran, A.S.; Daudpota, S.M.; Ghafoor, A. A Large-Scale Tweet Dataset for Urdu Text Sentiment Analysis. Preprints 2021, 2021030572

Abstract

This article presents a dataset of tweets in the Urdu language. There are 1,140,824 tweets in the dataset, collected from Twitter for September and October 2020. This large-scale corpus of tweets is generated by performing pre-processing which includes removing columns containing user information, retweet’s count, followers information, duplicate tweets, removing unnecessary punctuation, links, symbols, and spaces, and finally extracting emojis if present in the tweet text. In the final dataset each tweet record contains columns for tweet id, text, and emoji extracted from the text with a sentiment score. Emojis are extracted to validate Machine Learning models used for the multilingual sentiment and behavior analysis. These are extracted using a Python script that searches for an emoji from the list of 751 most frequently used emojis. If an emoji is present in the text, a column with the emoji description and sentiment score is added.

Supplementary and Associated Material

Keywords

Urdu Twitter Dataset; Urdu Natural language processing (NLP); Urdu text Sentiments and Emoticons

Subject

Computer Science and Mathematics, Algebra and Number Theory

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0
Metrics 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.