Shared Data Set for Free-Text Keystroke Dynamics Authentica- tion Algorithms

Identifying or authenticating a computer user are necessary steps to keep systems secure on the network and to prevent fraudulent users from accessing accounts. Keystroke dynamics authentication can be used as an additional authentication method. Keystroke dynamics involves in-depth analysis of how you type on the keyboard, analysis of how long a key is pressed or the time between two consecutive keys. This field has seen a continuous growth in scientific research. In the last five years alone, about 10,000 scientific researches in this field have been published. One of the main problems facing researchers is the small number of public data sets that include how users type on the keyboard. This paper aims to provide researchers with a data set that includes how to type free text on the keyboard by 80 users. The data were collected in a single session via a web platform. The dataset contains 410,633 key-events collected in a total time interval of almost 24 hours. In similar research, most datasets are with texts written by users in English. The language in which the users wrote for this research is Romanian. This paper also provides an extensive analysis of the data set collected and presents relevant information for the analysis of the data set in future research.

conclusion and future works. The paper has also five appendices that contain details about the shared keystroke dynamics data set.

The evolution of research in the field
Only in the last 5 years over 10,000 scientific papers have been published about keystroke dynamics. Also, survey papers have been published as keystroke dynamics biometrics has drawn intense research interest the past couple of decades [26]. In Table 1 is the number of scientific papers in the field of "keystroke dynamics" and also in the field of "free text keystroke dynamics". The graphic represented in Figure 1(a) illustrates the growing interest in the field of "keystroke dynamics" and also in the field of "free text keystroke dynamics" [2]. The number of scientific publications in this field was counted by searching for the two text sequences on scholar.google.com, filtered on 5-year intervals [2]. It is observed that in the last 5 years over 10,000 scientific papers have been published with the topic "keystroke dynamics", and scientific papers that have addressed the branch "free text keystroke dynamics" represent about half of these, reaching about 5,000 papers published in the last 5 years [2]. In the Figure 1(b) is also a hierarchy chart with the volume of publication about "keystroke dynamics".

Development of the platform for the acquisition the data
The first step in this research was to create a web platform for the acquisition of input data necessary for research. For this, the website from https://sites.google.com/view/cataliniapa was created, a form was created that would take over, besides the text typed by the users, the way of typing on the keyboard. A program in JavaScript language was written to take over the keystroke times. In order to be able to download the necessary information, a Google Sheet file was configured, and the information collected using the web form was transmitted using the platform https://api.apispreadsheets.com/. The platform for acquiring input data has been completed and functional by integrating the script written in JavaScript with the data transfer application in the Google Sheet file. The steps described can be followed in the graph in Figure 2. Steps taken to create the platform for retrieving data on how users type.

Acquisition and initial processing of the data
The acquisition and initial processing of the input data went through the following steps: Data were collected from 80 users using a web program written in JavaScript. It was collected from the 80 volunteers, through a form, the keys typed on the keyboard but also the times at which they were typed. The collected data was initially stored in a Google Sheet file via the https://api.apispreadsheets.com/ platform. With a program written in the C programming language, the data collected in the Google Sheet file was processed and transformed into key events in the following form: where on the first column is the key code of the pressed key, on the second column is 0 or 1, 0 represents the pressed key, and 1 represents the raised key, and the third column represents the timestamps at which the key event occurred. The file with the form presented above is the input file for the continuous authentication algorithm developed in this thesis using the keystroke dynamics method. The steps described above are summarized in the graph in Figure 3.

The platform for data acquisition
To research in the field of keystroke dynamics biometrics the researchers need input data obtained from computer users in different real situations. The necessary data are represented by the keys typed on the keyboard but also by the times at which they are pressed. The difference between the time when a certain key is pressed, respectively the time when a certain key is raised is the keystroke time. Another important piece of information is the time between two keys. The difference between the time a key was released and the time a next key was pressed [7].
This information can only be obtained in a restrained or controlled environment, with the consent of those participating to this experiment. The agreement of the participants is necessary because it exists a possibility to form the initial text that the user typed on the keyboard with access to this data, and if, for example, a user is monitored while sending e-mails or doing other activities, the information may be confidential.
For the purpose of the research, the authors developed their own environment to obtain data from volunteers. The authors have created a web environment for taking over keys and typing times in JavaScript. A form is created that takes over the keys and typing times while completing a form on a web page [7]. The website was created on the sites.google.com platform. The web platform can be accessed at https://sites.google.com/view/cataliniapa.
To capture the keys and typing times the authors created a web form through which users were invited to answer several generic questions. The text entered from the keyboard by each user should be written freely by each user, without the need to reproduce a specific predefined text. At each text box, a series of generic questions were formulated to guide the user to a certain topic in the text he completed. The questions asked were about the weather, the ideal day or the educational system. To form the database for research is not relevant the topic of the text, but the way it is written.
The text written by users is in Romanian. Most datasets in the literature are texts captured from users who have written in English [7]. After completing all the fields in the form, in order to send the captured data, the consent regarding the takeover for the purpose of scientific research of the participants was obtained. Two questions answered by users from the form are in Figure 4. First one is about weather and second one about the ideal day. Each user was instructed to pursue the following rules when filling out the form: "1. You have to write a free text about the subject managed by guidance questions; 2. You have to write a text of about 500 characters for each question (this means that all the lines in a text window should be filled); 3. Do not copy the answer from other sources; 4. You have write the answer to the questions on the spot, without consulting external sources; 5. You have to write ideas fluently, as they come to mind; 6. Do not do other activities while completing the answer to the questions. The request is to allocate about 15 minutes to complete the form; 7. The written text must be in Romanian; 8. The written text should be as generic as possible, not personal; 9. The text should be written from a physical keyboard, computer or laptop, not a touchscreen device (not a phone or tablet); 10. Please take about 15 minutes to complete the form to answer all questions without being interrupted by other activities;" In one of the questions on the form, users were asked to describe the scene in the Figure 5, in as much detail as possible. The form included statistical questions about the user's age, gender and whether he uses a computer or laptop keyboard. The questions with statistical purpose were followed by four questions to which the way of typing the answer was captured. The four questions asked in the form are presented in Table 2. How did people live then? Why were they worried? How did they spend an ordinary day? (The painting from the Figure 5)

Experiments and results
In order to compare the data set obtained in the present research with data sets from other research, we implemented a continuous authentication algorithm. The algorithm used as input data collected from the 80 users. We used the Equal Error Rate (EER) to quantify the performance of the algorithm. The results obtained are comparable to the results obtained in similar research.
The distance between the users was calculated using the information obtained from the di-graphs and building the user's pattern with a sample size of 1000 keys. Two distinct methods for calculating distances were used: Manhattan distance and A distance, proposed by Gunetti & Picardi in [27].
The results obtained are EER = 13.89% when using the Manhattan distance. This performance result was obtained after analyzing the most common 12 di-graphs. The distance was calculated using the total time of the di-graph.
The results obtained are EER = 6.55% in the case of using distance A. This performance result was obtained after analyzing all the collected diagrams. The distance was calculated using the total time of the di-graph.

Discussion
Each of us has a rhythm, a certain speed, a typing pattern, formed in time and unique while typing on a keyboard. We can differentiate the users of a computer, can identify them or authenticate them in a system only by capturing these details. To analyze a user's typing pattern, we need to capture and process it using an algorithm.
In order to be able to identify a certain user who would now be in front of a computer, using a keyboard, it is necessary, beforehand, to have his typing characteristics in a database. The database is needed in order to compare the typing mode captured live with the patterns of the users enrolled in the respective system, thus, helping to be identified. In other words, the mode of operation is similar to the username and password authentication. The computer users enter their username and their password, and the system searches them in the database to compare what the user entered with what he has previously registered, in order to make a decision.

Analysis of user information
The dataset contains keyboard typing data collected from 80 users. Of the 80 users, 35 said they were male and 44 said they were female, while one user did not report gender. The age of the users is in the range of 16-59 years. The average age of the 80 users is 28.19 years. Data was collected from users who used the keyboard from a computer or laptop. A total of 64 users used a laptop keyboard to complete the form, while only 15 used a computer keyboard and one user did not state which keyboard he used. Information about each user regarding these statistic data can be found detailed in Appendix A, Table A1.

Analysis of time and key events collected from users
The form created to purchase data sets for research purposes was completed by a number of 80 users. They handed over data for 410,633 key-events [7]. The comprise time used by all 80 users to complete the form was 23 hours, 28 minutes and 19 seconds.
The average time spent by users on the data collection platform was 17 minutes and 36 seconds. In Appendix B, the Table A2 shows the completion times of the form for each user, as well as the average and the total time spent by users to complete. In this regard, the time is expressed not only in milliseconds revealed in the second column of the table, but also in minutes show in the third column of the table. The fourth column of the table shows the total number of key events collected from each of the 80 users who filled out the form. The total number of key events collected from all users is 410,633. The average number per user is 5132 key events. Each key event contains Key Code, Down Event or Up Event and the Time Stamp.

Keys distribution analysis
A total of 100 different keys were monitored. The key that was pressed most often by users in the experiment was the SPACE key. The SPACE key has been pressed 32,387 times in total. Of the total keys, it represents the percentage of 16.17%. The next 3 frequently used keys are the vowels A, E and I. The A key was used 20,965 times and represents 10.47% of the total keys. The E key has been pressed 18,256 times and represents 9.11% of the total keys. The I key has been pressed 15,994 times and represents 7.99% of the total keys. The BACKSPACE key is also frequently pressed, which has been pressed 12,195 times.
In Table A3 from Appendix C are all the keys pressed by users in the order of their frequency in the data set collected. The most common 30 keys used by users are represented graphically in Figure 6. The first 30 keys represent 98.73% of the keys used. Analyzing studies carried out regarding the use of characters in Romanian, the conclusion is that the database collects respect the general rules, this database accurately reproduces the general characteristic of the Romanian language. According to the study conducted in [28], the most used consonants in Romanian are the consonants R and T, while the least used are X and J, except for the letters K, Q, W and Y, which are not specific to the language. The data set falls within these rules, the most used consonants being T and R, and the least used consonants being J, X, K, Q, W and Y.
The distribution of letters of the English alphabet (a-z) in the dataset is shown in Table A4 from Appendix D.
Each user has his own unique way to type text on the keyboard. This pattern is specific and does not change during a writing session or short term. The typing pattern may change over time or may differ if the same user uses different keyboards. The differences between different users, on the other hand, can be analyzed even visually, as for example in the Figure 7(a). The graph shows the typing times for two users from the database. The graph shows how the differences between the typing times for user0001 are larger, both the average of the times and the standard deviation. Most of the time intervals for us-er0001 are between 50 and 150 milliseconds. Instead, user0002 has a smaller difference between keystrokes. At user0002 most of the time intervals are in the range of 50-75 milliseconds [7].
The Figure 7(b) shows the first 1000 time intervals between two consecutive keys, flight time (UD time). This time interval can also have negative values, while the pressing time of a single key cannot have negative values. A negative value is taken when the second key in a di-graph is pressed before the first key is raised. The figure shows the times for three users. We can see how user0001 has the most negative time values, while user0003 has the most time values close to 0. The time value can be close to 0 when the second key is pressed exactly when the first he gets up. User0002 has the fewest negative time intervals, even their average being the highest of the time averages of the 3 users analyzed. In the analysis of the typing pattern, both the times when the keys are pressed and the times between two consecutive keys are analyzed. Di-graph analysis takes into account the order in which characters are typed by users. From the database collected from users, a total number of 200,227 di-graphs could be created and analyzed. The total number of unique di-graphs is 1,530. This means that there are only 1,530 unique 2-character combinations. The most used di-graphs in the text are presented in Table A5 from Appendix E. These are di-graphs that appear in texts taken from users more than 1000 times each.
A user's profile in terms of testing can be achieved based on the most frequent time intervals. The Figure 8(a) graphically represents the modes of distribution of typing time (Down-Up time) for a number of 7 users. The time distribution is a normal distribution, close to a Gaussian distribution or a Laplace distribution. In contrast, both the mean and the standard deviation differ from user to user.
The distribution of time intervals between two consecutive keys is represented for a total of five different users in the Figure 8(b). It is observed for two users, user0056 and user0059, a maximum of number of key intervals at the value 0 on the graph. Also, us-er0056 has the most negative intervals. A distribution of time intervals totally different to the other four users has user0055. Times are distributed at higher values. This means that user0055 is typing at a slower pace.

Comparison of the related works
The Table 3 shows the characteristics of the databases used in previous scientific research, in order to be able to compare them with the characteristics of the data set obtained in this research. The characteristics were being published and centralized in the paper [29]. In the last line of the table are the characteristics of the data set in this paper. This paper data set 2020 80 1 Session

Contributions of the paper and future works
The objective of this paper was to collect a database with the typing mode from 80 users and to make it available to other interested researchers. It was created a database with typing mode from 80 users, 410.000 key events and total time of approximately 24 hours for the acquisition of the necessary data.
There are new possibilities to continue research in new directions, such as: • Expanding the keystroke dynamics database by collecting data from a larger number of users; • Expanding the database by collecting data from the 80 users in new sessions in order to research the evolution of the typing pattern over time

Conclusions
Authentication via keystroke dynamics is a topic of interest in the field of security and privacy, especially in authentication and access control. It is also a topic addressed in the field of human computer interaction (HCI), especially in interaction paradigms and interaction devices. Keystroke dynamics involves in-depth analysis of how you type on the keyboard, analysis of how long a key is pressed or the time between two consecutive keys. This field has seen a continuous growth in scientific research in the last years. One of the main problems facing researchers is the small number of public data sets that include how users type on the keyboard. The objective of this paper was to collect a database with the typing mode from 80 users and to make it available to other interested researchers. It was created a database with typing mode from 80 users, 410.000 key events and total time of approximately 24 hours for the acquisition of the necessary data. The data set is available at https://sites.google.com/view/cataliniapa/timisoara-kd-data-set.

Data Availability Statement:
Publicly available datasets were analyzed in this study. This data can be found here: https://sites.google.com/view/cataliniapa/timisoara-kd-data-set Acknowledgments: We thank the 80 volunteers who responded positively and filled out the data acquisition form so that we have this complete and available data set for future research in the field of keystroke dynamics authentication.

Conflicts of Interest:
The authors declare no conflict of interest.