Stock Values and Earnings Call Transcripts: a Dataset Suitable for Sentiment Analysis

The dataset reports a collection of earnings call transcripts, the related stock prices, and the related sector index. It contains a total of 188 transcripts, 11970 stock prices, and 1196 sector index values. Furthermore, all of these data originated in the period 2016-2020 and are related to the NASDAQ stock market. The data have been collected using Yahoo Finance and Thomson Reuters Eikon. Specifically, Yahoo Finance offered daily stock prices and traded volume. At the same time, Thomson Reuters Eikon has been used as source for the earnings call transcripts. The dataset can be used as a benchmark for the evaluation of several NLP techniques as well as machine learning algorithms for understanding their potential for financial applications. Moreover, it is also possible to expand the dataset by extending the period in which the data originated following a similar procedure.


Specific subject area
The specific subject area of this research is Sentiment Analysis. Sentiment analysis is a natural language processing (NLP) technique to determine the sentiment (positive or negative) behind data. To elaborate, NLP is a field of research that investigates the ability of computers to understand and manipulate natural languages, such as English.
A crucial step of textual sentiment analysis is to pre-process the text documents. This pre-processing phase consists of multiple 'preprocessing techniques' of which the effects were studied. Table  Text How data were acquired

Type of data
The stock values and sector index were acquired through Yahoo Finance (website). The earnings call transcripts were acquired through Thomson Reuters Eikon (software).

Parameters for data collection
The related companies of the stock values and earnings call transcripts were chosen based on the condition of being NASDAQ listed. Furthermore, the date range for the stock values and earnings call transcripts is 2016-2020.

Description of data collection
The stock values were acquired by using Yahoo Finance. Yahoo Finance provides news, information, commentary, and reports on the subject of finance. This website lets users search for specific companies with its search bar. When entering a company such as "Apple Inc.", the website will direct the user to a summary of general financial information about the company.

Methods for data acquisition
All of the stock values and the sector index were acquired by utilizing the Yahoo Finance search bar.
Searching for a company such as Apple Inc. results in a summary of financial information about this company. However, the stock values and sector index within the dataset are presented in the "Historical Data" tab. Selecting this tab and specifying the time period January 1 st , 2016 -October 1 st , 2020 and selecting "Apply" will show the data presented in this dataset. Lastly, selecting "Download" provides a CSV file containing all of this data.
The earnings call transcripts were acquired through Thomson Reuters Eikon. Selecting the "advanced event search" option shows unfiltered financial information about many different sorts of events. Specifying the event type by selecting "Earnings Conference Call" will filter this information by only showing information about earnings calls. Additionally, selecting "Transcript" from the "Content Type" selector will show only earnings calls that can be provided together with a transcript of the earnings call. Lastly, specifying the company and time period will show a list with the earnings call transcripts contained in this dataset. For efficiency purposes, the save batch icon makes it possible to download this whole list of transcripts.

Results: Data Description
The folder named "Stock Values and Sector Index" in the dataset contains all of the CSV files that were acquired through the before mentioned method. These files consist of individual tables for each NASDAQ Company and the NASDAQ sector index. The folder is structured as portrayed in table 1.

Determining positive and negative transcripts
Formulas are used to determine whether a transcript is positive or negative. Firstly, the stock ratio formula, which has the following form: stock ratio = stock value one day after earnings call / stock value n days before earnings call The stock ratio shows the percentage increase or decrease of the stock value. However, a percentage increase in stock value does not immediately imply that the earnings call is positive as there are other variables to consider. To factor in an additional variable called investor mood, a second formula is defined: sector ratio = sector value one day after earnings call / sector value n days before earnings call Sector refers to the NASDAQ composite. The sector ratio is taken into account to consider the mood of the sector index. If the increase in stock ratio turns out to be higher than the increase of the sector ratio, the earnings call can be determined positive. If not, the transcript is deemed negative.

Discussion
The collected dataset provides the following value:  These data can prove useful as they may help to further uncover dynamics related to correlational relationships between stock values and earnings call transcripts.  Furthermore, the data can easily be expanded by i.e. extending the date range. Additionally, the data is easy to use and readable by multiple programming languages.  Both practitioners at companies as well as scholars can benefit from the use of these data. Every company and scholar uses homemade datasets with consequential discrepancies. The adoption of a shared dataset for benchmarking analysis will promote a homogeneous evaluation of the results.  The data was used primarily for the application of a limited amount of NLP techniques and machine learning algorithms. Consequently, this dataset offers the possibility to explore different approaches.

Conclusions
In this preprint we presented a dataset that has been designed for performing sentiment analysis in the stock market. Information regarding daily price and volume has been collected using yahoo finance. At the same, Thomson Reuters has been used for collecting earning transcripts. Details about the procedure has been described and presented in the previous sections. The dataset contains 11970 stock prices, and 1196 sector index values. Furthermore, all of these data originated in the period 2016-2020 and are related to the NASDAQ stock market. The dataset can be used for developing and benchmarking NLP techniques and machine learning algorithms.