Feature Selection Model based on EEG signals to Assess the Cognitive Workload in Drivers

: In recent years, research has focused on generating mechanisms to assess the levels of subjects' cognitive workload when performing various activities that demand high concentration levels, such as driving a vehicle. These mechanisms have implemented several tools to analyze cognitive workload where the electroencephalographic (EEG) signals are the most used due to its high precision. However, one of the main challenges in the EEG signals implementing is finding the appropriate information to identify cognitive states. Here we show a new feature selection model for pattern recognition using information from EEG signals based on machine learning techniques called GALoRIS. GALoRIS combines Genetic Algorithms and Logistic Regression to create a new fitness function that identifies and selects the critical EEG features that contribute to recognizing high and low cognitive workload and structures a new dataset capable of optimizing the model's predictive process. We found that GALoRIS identifies data related to high and low cognitive workload of subjects while driving a vehicle using information extracted from multiple EEG signals, reducing the original dataset by more than 50%, maximizing the model's predictive capacity-achieving a precision rate greater than 90%.


Introduction
Driving a vehicle is a complex activity exposed to demands that continually change due to different factors, such as the speed limit, obstacles on the road, traffic, among others. When performing this activity, drivers must have a high degree of concentration, increased demand on the cognitive workload, or cause vehicle accidents to the minimum carelessness [1]. In recent years, various tools have been used to assess the demand for cognitive workload generated in drivers, such as subjective measures [2,3], vehicle performance measures [4,5] and physiological measures [6,7] where the electroencephalographic (EEG) signals being the most used to identify cognitive states due to their high precision [8].
EEG signals allow analyzing the behavior of a person's brain activity in real-time. However, this type of physiological signals generates so much information per second, which increases proportionally according to the collection time and the number of sensor channels, which produce large volumes of data resulting in complex and robust treatment [9,10].
One of the main challenges facing EEG signals is finding the right information to identify cognitive states. In this tonic, feature selection methods have been developed for pattern recognition using physiological signals. The feature selection algorithms (FS) aim to find a set of features with relevant information or data that can identify or describe an event allowing maximize the performance of the prediction models [11].
Many investigations have developed models implementing FS to identify cognitive workload using the physiological signals' information. In [12], an emotion recognition system for affective states is developed based on EEG signal using support vector machine (SVM) classifier. The classifier obtained 75% and 71.21% performance and presented problems identifying negative emotions. In [13], the authors present a quaternion-based signal analysis technique based on EEG signals to extract the registered cognitive activity features. The model achieved 86.44% accuracy and required a minimum limit of samples to obtain better results, increasing analysis, and information processing time. In [14] presents an on-line classification method based on common spatial patterns for feature extraction, using SVM as a classifier achieving 86.3%, 91.8%, and 92.0% accuracy. In [15], developed different classifiers using linear discriminant analysis, quadratic discriminant analysis, k-nearest neighbor, SVM linear, SVM radial basis function (RBF), and naive Bayesian based on EEG signals. SVM obtained the best accuracy of 82.14%. In these systems, the strategy to extract the information could cause a loss of vital data. In [16], the authors propose a system to detect vigilance levels using EEG signals and combine SVM algorithms with multi-particle optimization obtaining 84.1% accuracy. The model presented a low prediction performance in some predictions due to the complexity of the data. In [17], the authors developed a model to predict mental workload based on a linear discrimination function achieving 85% accuracy. In this model, some physiological measures used could not effectively reflect the mental workload, which could affect the model's prediction precision. In [18], the common spatial pattern algorithm was used to extract information from EEG signals and developed a classifier using the extreme learning algorithm obtained 87.5% accuracy. The model presented a high sensitivity to the kernel configuration, affecting its performance. In [11], used the evolutionary computing algorithm to find an optimal dataset obtained 96.97% accuracy. This model presented premature convergence problems in the evolutionary algorithm. Finally, in [19], implemented the Bayesian low learning algorithm to select a dataset. They used SVM with RBF achieving 89.7% accuracy. This model presented problems in the data collection phase.
In summary, although in recent years, several models have been proposed for pattern recognition using FS algorithms based on physiological signals. The models described above have been based on traditional techniques that reduce a minimum percentage of the original dataset features obtaining robust prediction models or analyzing the information from a specific signal to compare only some features using small datasets and discarding relevant information. We propose a new feature selection model for pattern recognition using information from EEG signals called GALoRIS (Genetic Algorithms and Logistic Regression for the Structuring of Information). GALoRIS combines genetic algorithms (GA) and logistic regression (LoR), to create a new fitness function based on machine learning techniques and statistical analysis to explore the fusion of EEG information, identifying the critical features that contribute to recognizing cognitive states, structuring a new dataset capable of optimizing the classification process. The dataset obtained from applying the features selection algorithm is used as the index for recognizing cognitive states in the predictive model.
GAs are adaptive and robust computational procedures based on the mechanism of natural genetic systems inspired by natural evolution theory of Charles Darwin [20]. GA is used to solve complex models' optimization problems, looking for the best feature set, especially when the search space is large and complex [18].
GA consists of selecting and combining different features evaluating the data to obtain the most suitable dataset. This process is defined as an evolutionary process and continues iterating until the data meets the established conditions [21]. Traditional models search for the optimal solutions in parallel in multiple directions by creating large populations, ruling out possible solutions, and generating computationally expensive and time-consuming models [22]. To deal with this problem, a particular fitness function is designed based on the LoR classifier's performance to determine search directions to an optimal solution. LoR is a technique characterized by its effectiveness, simplicity, and use of a low computational resource during training and execution. LoR models each element's probability of belonging to a group, obtaining the feature's weight to evaluate their competition level with the rest of the possible solutions, storing the features with the best value to create new better populations.

of 24
The results obtained from GALoRIS model are implemented as indexes of EEG signals for pattern recognition in four classifiers developed with SVM with a linear kernel and RBF, linear regression (LiR) and k-nearest neighbors (k-NN) and predict two cognitive states: low and high cognitive workload.
The rest of the work is organized as follows. Section 2 describes the methodology. Section 3 presents the experimentation implemented. The results are presented in section 4. Finally, the conclusions and discussions are in section 5.

Methodology
In this investigation, EEG signal information is collected from subjects while facing a real driving scenario. Also, information on subjective measures (NASA-TLX (Task Load Index), ISA (Instantaneous Self-Assessment)) and vehicle performance measures (error rate (ER)) are collected to evaluate the cognitive states of the subject presented during the experiment.
To analyze the collected information, the t-student is used to identify statistically significant differences in the data collected during the experiment and establishing a collection criterion to discard information to construct a new dataset defined as search space that GALoRIS uses to explore the data. A Pearson's correlation is also implemented to identify the association between ISA, NASA-TLX, ER with the EEG signal to assess whether the subject experienced an internal cognitive workload during the different phases of the experiment [24].
GALoRIS is developed to recognize the most representative features that identify the subject's low and high cognitive workload states while driving. GALoRIS selects and evaluates the features, identifying the key elements that contribute to the recognition of cognitive states and restructuring a new dataset that is implemented in four classifiers developed with the supervised algorithms SVMRBF, SVMLinear, k-NN, and RiL.
The general architecture of the cognitive workload prediction model is shown in Figure 1

Statistical Analysis
T-student calculated for ISA, NASA-TLX, ER, and the EEG frequency bands signal the p-values where each measure is contrasted with two cognitive workload states. The established hypotheses are H0< There is no significant difference between the information obtained during the two experiment H1> There is a significant difference between the information obtained during the two experiment where if the value of the error probability ( ) of the samples is greater than the significance level of = 0.05, the hypothesis established in 1 is rejected.
T-student results of the EEG signals are used to establish an EEG information collection criterion to construct the search space with relevant information that GALoRIS will use to explore the EEG signals' information. The criterion is defined as

≤ ∴ ℎ
where samples EEG with a value of ≤ are set within the search space. Additionally, as in [25,26,27,28,29], we used the Pearson's correlation between the implemented measures to determine the association between measures and cognitive states as a validation's method of the internal state of the subject. A hypothesis is defined, where if the EEG signals are correlated with the subjective and vehicle performance measures, the subject experienced the same level of cognitive workload internally and externally.
Pearson's correlation identifies the relation of one variable over another by calculating an index that measures the degree of connection between the variables. Pearson is applied between the ISA, NASA-TLX, TE, delta, theta, alpha, beta, and gamma measurements. The analysis is performed by correlating the average of the value obtained from each session per measurement (8 measures * 2 tasks) where if the correlation range is 0, there is no correlation. If it is -1 or +1, there is a perfect correlation [30].

GALoRIS
In this section, the architecture of the GALoRIS model is presented, as shown in Figure 2 GALoRIS proposes a new design in the chromosome's structure and the fitness function based on LoR to model the feature's weight and determine the direction of search. Also, GALoRIS proposes a new selection technique to identify the best dataset of features efficiently. The model consists of six phases, and they are presented below.

Population
The population is a set defined as individual or chromosome that represents a possible solution to the problem. The chromosome comprises elements known as genes selected as first instant randomly, and then they are modeled through the fitness function.
To create the chromosomes, a matrix defined as a feature space is built where each element of the matrix presents a gene that the algorithm selects to build a chromosome. The search space is defined as presented in Equation (1).
where the channels of the delta band are organized first, followed by the channels of the theta, alpha, beta, and gamma bands following the frequency range order. ℎ represents the channels of each band defined as presented in Equation (2). ℎ = [ 3,4,3,7,8,5,2,8,8] (2) where ℎ must meet the collection criterion ≤ ∴ ∈ ℎ . The dataset format for the search space is frequency bands*channel*sample number (5*9*8210). All the information is standardized in a range of {0,1}.
Also, a new chromosome's structure is defined. The structure contains the features and the parameters evaluated chromosome to direct the search of elements. The general form of the structure is presented in Figure 3. where are the genes of the chromosome encoded in a binary chain ∈ {0,1} = 1,2, … , whenever the gene's value is 1 the feature is selected to form the new chromosome and continue the evolutionary process.
are the adaptation parameters used as evaluation criteria to determine whether the chromosome continues in the evolutionary process.

Fitness Function
The fitness function (FF) evaluates each proposed chromosome's quality to find the best combination of genes while preserving a high genetic diversity in the population. FF calculates for each chromosome generated the adaptation parameters ( ) based on the logistic regression algorithm's performance. The parameters explore the properties of the chromosome to determine its ability to compete with other chromosomes or not. The chromosome's features are divided into two sets. The first set builds the LoR model, and the second set is used to assess the quality of the chromosome and explore the effectiveness of the features according to the criteria. Equation (3) presents the general logistic regression model to calculate the where 0 is the intercept, ℎ represent the observations of the channel, and is the estimation coefficient calculated with the logit function for each variable Ban to determine the global fit of the model and the individual significance level of each variable. ℎ represents the general model and is defined as in Equation (4 where the values of de 0 and are estimated from each frequency band or element integrating the chromosome.
are calculated from the chromosome generated and they are: accuracy between the adjustment of the elements of the chromosome, error rate in the adjustment, number of genes of the chromosome, and the significant elements of each chromosome. Each of the parameters are explained below.
The accuracy in the adjustment of the elements evaluates the performance of the generated chromosome and is calculated as presented in Equation (5) where the number of correctly predicted divided by the total number is evaluated. The range of values is [0,1], where 1 indicates a high level of accuracy.
The error rate in the adjustment of the elements quantifies the error in predicting each chromosome, evaluating the number of predictions made incorrectly. It is calculated as presented in Equation (6) = − ′ (6) where the differences between the actual values Y and the predicted values Y' is calculated. The range of values is [0,1], where values close to 0 will indicate that the chromosome obtained a lower error fit.
The number of genes on the chromosome evaluates the number of selected elements to build the chromosome. This parameter aims to obtain a chromosome with fewer components capable of describing the data's behavior, reducing the probability of error, analysis time, and algorithm execution.
The significant element evaluates each of the chromosome gene's contributions by comparing the gene's p-value with the significance level of = 0.05, if the -value is less than or equal to the significance level, the evaluated variable is relevant and should remain on the final chromosome.

Selection
Consists of building a list of chromosomes using the criteria established in the , as described in Equation (7) This process begins by comparing the values of each chromosome where the chromosome with a higher adjustment rate with a lower error rate is positioned at the top of the list. If these parameter's values match when comparing the chromosomes, the chromosome with the fewest number of elements will have the highest priority.
The elements with a value of < are united in the same vector to create a new chromosome and inherit it to the next generation, as shown in Figure 4 This process directs the selection of elements to form new chromosomes with better properties, selecting those features with relevant information.

Crossing
Once the best chromosomes are selected based on FF, the reproduction process begins with the crossing between chromosomes, as observed in the Equation (8) This phase consists of cutting the chromosome at two selected points to generate new segments where one of the parent's central segment and the other parent's lateral segment are chosen to create the descending chromosomes [31]. The crossing gives the possibility of combining all the chromosome parts to generate chromosomes that are not created in the initial population. [

Mutation
The mutation generates a new chromosome different from the parents to maintain diversity within the population and avoid premature convergence. Consists of randomly inverting part of a gene on the chromosome to obtain variability within the population and discard chromosomes from the rest of the new population [31].

Detection Rules
To stop the evolutionary process of the model, two rules of detention are defined, of which at least one of them must be fulfilled to stop the evolutionary process. The first rule is when the number of established chromosome generations is completed. This number is defined based on experimentation and the number of features within the search space. The second rule is when the evaluation criteria of the fitness function are fulfilled (accuracy=1, error rate=0).

Information Structuring
A new dataset is constructed based on the feature selection results, integrating the generated chromosome elements to implement it as input index to recognize patterns in the prediction model. In Equation (9) the general structure to build the new dataset is presented where the chromosome represents the new dataset defined as ℎ = { } =1 where are the selected features, is the categorization of data and is the number of samples. To organize large amounts of EEG information from multiple channels, and are structured as presented in Equation (10) = [ 1 ℎ11, ℎ12,…, ℎ1 , 2 ℎ21, ℎ22,…, ℎ2 , where contains the EEG signal's information following the order of the frequency range and contains the information of two cognitive states. In total, 8210 samples are implemented.

Classifiers
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 22 September 2020 doi:10.20944/preprints202009.0521.v1 In this investigation, four classifiers are developed implemented the new dataset generated by GALoRIS. The classifiers are designed in three steps using the algorithms of SVM = [Linear: RBF], LiR, and k-NN. The first step consists of pre-processing the information where the data is divided into two groups: training and testing, where 90% of the samples are used to train the model, and 10% is used to perform the tests. The second step consists of building the model with a data destined to train the model. The parameters and configurations of the model are adjusted. The last step is to evaluate the trained model using a data dedicated to testing the model.
To divide the information into training and testing set, -fold cross-validation ( = 10) is implemented. -fold is characterized by avoiding the overfitting of the data during the model's construction, being the most used technique in prediction studies [32]. -fold randomly divides the data into subsets of equal size where the -1 subset is used during the validation step, and the rest of the subsets are used in the training step. The process is repeated = 10 times when performance metrics are calculated to evaluate each cycle model. The results are averaged to obtain a single estimate. The technique's advantages are that all test sets are independent, and the result's reliability is improved times [22,33].
The metrics used to evaluate the performance of the model are sensitivity and precision. The sensitivity metric evaluates cases that are correctly classified as true and is calculated with predictions made correctly as low cognitive workload (CLCW) and predictions made incorrectly as high cognitive workload (IHCW) as shown in Equation (8) = + * 100 The accuracy metric is related to the total number of predictions made correctly and is calculated with CLCW, predictions made correctly as high cognitive workload (CHCW), predictions made incorrectly as low cognitive workload (ILCW), and IHCW as shown in Equation (9) = + + + + * 100 (9)

Label
In the real world, the data are not labeled, therefore in recent years, labeling indices have been developed implementing the frequency bands δ, θ, α, β, γ to identify different states as shown in Table 1. However, these indices only use some bands and/or channels to evaluate people's states.
In this research, a labeling technique is developed to identify low and high cognitive workload levels to categorize EEG information by implementing the generated chromosome.
The labeling technique consists of defining the upper and lower threshold of the dataset, calculating the sample's average to obtain a vector. Afterward, the vector's maximum and minimum value is calculated and divided between the cognitive states, obtaining the interval's size for each state as shown in Equation (10) where maxvalue and minvalue represent the minimum and maximum value of the samples of the vector, cognitive states represent the number of states to evaluate and ℎ ( ) is the size of the interval by state. The values of each sample are compared where < ℎ = 0 or > ℎ =1. This technique finds the peaks in the timeline defined as moments with high cognitive workload during the experiment. The Lane Change Test (LCT) version 1.2 simulator is used to perform the experiment, which simulates the most frequent driving conditions of a vehicle [43]. LCT is designed to quantitatively measure the level of degradation of the subject's performance while driving and performing other secondary tasks [44,45,46]. LCT consists of driving on three-lane highway with a length of 3000 m., at a maximum speed of 60 km/h. Along the way, instructions are presented to change lanes through traffic signs that appear next to the highway every 150 m. The signals are activated at 40 m. away between the vehicle and the signage sign. The participant must carry out the activity indicated by the sign respecting the traffic rules [47]. The experiment lasted 80 min. approx. divided into three phases:

Design of the Experiment
Baseline: The participant takes a seat and puts the Emotiv EPOC sensor on the head [48]. The subject kept the eyes closed and acoustically isolated for 10 minutes, where the sensor is activated to collect information.
First Task (Task_1): The participant starts driving the vehicle without any distraction. During driving, the EEG signals, ISA, and ER are collected. At the end, NASA-TLX is applied.
Second task (Task_2): In order to increase the subject's cognitive workload levels, the stress induction protocol proposed in [7] is applied as a second task. The task consists of mentions a series of digits randomly that the participant has to repeat following the order of the set of numbers given. All measurements are collected.
The study was conducted in accordance with the Declaration of Helsinki, and the experimental protocol was developed following the ethics committee's regulations of the Polytechnic University of Catalonia and the Governing Council Commission of Inquiry (Agreement no. 45/2015). All methods were performed in accordance with relevant guidelines and regulations. All the patients signed informed consent for a research protocol.

Subjective Measures
ISA is a questionnaire applied every 2 minutes during the development of the activity. The participant must provide the number that best describes their stress level following the scale of 1 to 5: (1) boring, (2) relaxed, (3) comfortable, (4) little busy, and (5) very busy [49]. To calculate the questionnaire's weighting, a weight ranging from 1 to 10 is assigned according to the task's level of difficulty during the experiment, where 1 represents a task with low difficulty, and 10 is a task with high difficulty. The assigned weight is multiplied by the number provided and averaged between the activities to obtain the ISA weighting ranging from 1 to 100. NASA-TLX is a post-exercise application that evaluates six factors defined as dimensions that characterize the subjective workload [50]. The methodology proposed in [17] is used to obtain the scale, getting a value ranging from 1 to 100.

Measurement of Vehicle Performance
Vehicle performance is associated with the ability to keep the vehicle within safety margins. To assess this capacity, ER is implemented in this investigation. ER evaluates the total activities performed incorrectly concerning the entire activities presented during the experiment. In [15], The authors explain the relationship between ER and high levels of cognitive workload. The greater the number of activities carried out during a task, the higher the cognitive workload, increasing the error rate. To estimate ER of each subject, Equation (11) is defined, where the sum of the activities carried out erroneously ( ) is calculated among the total activities ( ) presented during the task.

Collection and Extraction of EEG Signals
To analyze the collected information, a feature extraction process is implemented. This method consists of transforming the original signals into a vector of features representing the signal's behavior. In the literature, features in the time domain, frequency domain, and time-frequency domain are distinguished [51]. In this investigation, the signal is analyzed in the frequency domain using spectral power density (PSD). PSD determines the distribution of the signal power in a frequency range, facilitating the extraction of the most popular features in the context of cognitive workload [52]. These features are defined as frequency bands and are Delta (0.5-4 Hz), Theta (4-8 Hz), Alpha (8-12 Hz), Beta (12-30 Hz) y Gamma (30-100 Hz) [23,53,54].
The signals are sensitive to activities called artifacts generated by the body's movement altering the quality of the signal [55]. To remove artifact is implemented the Butterworth filter of order 5 with a cutoff frequency of 1 to 100 Hz based in [29,56,57]. Butterworth has a greater linear response than other filters allowing efficient filtering and decomposition of EEG signals [58].
Fast Fourier Transform (FFT) is calculated with a Hanning window of 128 samples at a length of T = 5s to convert the signal from the time domain to the frequency domain and extract the magnitude of the power spectrum of the delta, theta, alpha, beta and gamma frequency bands.
The data format is channel*sample_number*frequency_bands (9*8210*5). All information is standardized.  To obtain the EEG signal's data, an interface is developed using LabVIEW to collect the EEG signals and extract the frequency bands implementing PSD. In Figure 5 shows the interface where the distribution of the frequency of the signal extracted from each of the bands is observed. The maximum value of the power spectrum's magnitude is stored in a file with extension *.csv [59].

Dataset and Parameters
In [8,46,47] suggest the combination of the information of the bands helps to identify cognitive states, obtaining better results in the classifier. In this research, seven subsets are built based on four principles to analyze the information's behavior, the relationship between the features, and the prediction model's performance, as shown in Table 2. First, a dataset with all the data is built to analyze the data. Second, a dataset is constructed with the alpha band's information characterized by efficiently recognizing cognitive states [28]. Third, a dataset is built with the beta and gamma bands information related to a single cognitive state [60,61]. Finally, four datasets are constructed with information related to two cognitive states [62,63]. The band's information is combined. All datasets followed the criterion of statistical selection where ℎ ≤ ∴ ∈ ℎ . The configurations of the parameters defined in this works are based on [31,64,65,66] and the tests performed during the development of the classifier. For GALoRIS, the number of generations is 30, with a population size of 100 genes for each generation. A tournament selection of size t = 5 is configured where individuals are "turned" t times to be selected. The two-point crossover is established with a probability of crossing of 0.8 to perform the mating between two individuals. The mutation is simple, with a probability of to mutate of 0.1.

Subjective and Vehicle Performance Measures
The results obtained from ISA, NASA-TLX, and ER in experimental are presented in Table 3. The results obtained in task_2 are greater than Task_1 in all measures, where the subjects showed an increase in cognitive workload during the experiment's phases. The data of the subject_2 are deleted because the subject presented sickness problems during the experiment.  Table 4 presents a descriptive analysis of each of the frequency bands extracted from the EEG signals. The results show the values of the alpha, beta and gamma bands in task_2 are higher than in task_1. Furthermore, the results of the delta and theta band increases during tarea_1. These results are due to the fact each band is related to a cognitive state [8,67,68,69,70]. For example, the increment in delta [71] or theta [61,72] wave activity is associated with low cognitive workload, fatigue, or relaxation state. The increment in alpha [28,73], beta [71] o gamma [68,74] wave activity is associated with high cognitive workload, stress state, or overload of mental effort.  Table 5 shows the results obtained from the t-student where the mean, standard deviation, and the p-value of each measure obtained during task_1 and task_2 are observed.   Table 6 presents the correlation index between subjective, vehicle performance, and EEG signal where the correlation is generally medium-high. Of the examined measures, ISA and RT presented a medium-high correlation with alpha (r2=0.3, r2=0.6), beta (r2=0.4, r2=0.6), delta (r2=-0.5, r2=-0.7), and gamma (r2=0.6, r2=0.8), suggesting a convergence between these measures. NASA-TLX is an independent measure of physiological measures as in [75], which may be due to a post-exercise measure. Also, theta band demonstrated independence with subjective and performance measures.

Labeling Results
The  Table 7 presents the GALoRIS results, where the obtained from each dataset created can be observed. For example, in subset_1, the proposed method reduced the number of attributes from 36 to 13 features on average, representing 64% less of the original data and obtained a 97% of performance between the adjustment of the elements. A considerable reduction in the original dataset's dimensionality generates a more efficient model and ideal in real-time applications.
The   The results obtained from GALoRIS are compared with the most used feature selection algorithms in the literature to analyze EEG signals Mutual Information (MI) and Principal component analysis (PCA) [76]. MI and PCA are evaluated using the seven datasets proposed in this research, and the results are presented in Table 9.

Conclusions and Discussions
In the literature, work related to this research has been found, as shown in Figure 7. In [77], explored a feature extraction method based on rhythm entropy to classify the EEG signals. The classification rate achieved 89.7% using SVM with the leave-one-out-cross-validation (LOOCV). In [22] propose a model with GA and SVM to classify several databases. The model obtains, on average 91%. In [78], an algorithm to stable EEG signal patterns based on graph regularized extreme learning machine is proposed. They achieved 69.67% and 91.07% accuracy. In [79], propose an algorithm to select features based on the mutual partial information algorithm that eliminates the less significant information of the EEG signals and develops a classifier using the linear discrimination analysis algorithm, obtained a result of 88.7 % accuracy. In [80], implemented the granger causality algorithm to extract the most relevant EEG signals features and developed a classifier with SVM obtaining 82.66% accuracy. GALoRSI-SVM obtains an accuracy percentage of 96.14% in the data classification, significantly improving classifier performance. In this study, we have introduced a new feature selection model for pattern recognition called GALoRIS. GALoRIS selects EEG features based on exploring the fusion of information and identifying the principal features that contribute to recognizing cognitive states and structure a new dataset capable of optimizing the classification process to build a robust and powerful learning model.
The results of this research demonstrate several aspects. First, the measures proposed in this research allow evaluating the level of cognitive workload of the subjects while driving a vehicle. Second, statistical tests evaluated the relation between measures and cognitive states to observe the subject's internal behavior and determine whether they can obtain different cognitive workload levels during the experiment. With the statistical results, it is observed that when the level of difficulties increased, the drivers perceived an increase in the cognitive workload demand, affecting their concentration and increasing the errors. Third, combining features from multiple sources can improve the model; in fact, an improvement in classification performance is observed by 10% to 20% compared to using features from a single data source. Finally, the main objective of GALoRIS is to propose a new search strategy to more efficiently explore the information of the EEG signals and identify the features that can help describe cognitive states while driving a vehicle. The GALoRIS results show that feature selection algorithms for pattern recognition are fundamental to obtain high percentages of precision in the prediction models. Besides, GALoRIS proved to support datasets of various sizes, selecting the attributes with relevant properties, reducing 64% the original dataset, and maximizing the predictive capacity in the prediction models to achieve a 98% accuracy in the classification of the information. The features used in this research work can be considered as the reference point to identify high and low cognitive workload of vehicle drivers.
As future work of this research, it is to implement a new dataset to assess the model's predictive ability developed in this research.
Data Availability: The datasets generated during and/or analysed during the current study are available from the corresponding authors on reasonable request.
Author Contributions: E.B., and A.R. defined the experimental setup and acquired the experimental data. E.B., A.R., and J.G. processed and analysed data. All the authors co-wrote the manuscript and approved the final text.