Machine Learning for Stress Detection from Electrodermal Activity: A Scoping Review

Early detection of stress can prevent us from suffering from a long-term illness such as depression and anxiety. This article presents a scoping review of stress detection based on electrodermal activity (EDA) and machine learning (ML). From an initial set of 395 articles searched in six scientific databases, 58 were finally selected according to various criteria established. The scoping review has made it possible to analyse all the steps to which the EDA signals are subjected: acquisition, preprocessing, processing and feature extraction. Finally, all the ML techniques applied to the features of this signal have been studied for stress detection. It has been found that support vector machines and artificial neural networks stand out within the supervised learning methods given their high performance values. On the contrary, it has been evidenced that unsupervised learning is not very common in the detection of stress through EDA. This scoping review concludes that the use of EDA for the detection of arousal variation (and stress detection) is widely spread, with very good results in its prediction with the ML methods found during this review.


Introduction
Stress usually produces a mental state of tiredness that may result in physical and mental disorders as a major depression, insomnia or generalised anxiety [1][2][3][4]. From an emotional point of view, calm and stress situations cause a variation in arousal. Arousal is the increase of activation or deactivation produced in the brain in the presence of a stimulus [5]. Stress detection is a very current topic in many clinical and educational areas. For this reason, there is a growing interest in developing methods that make automatic detection possible. In addition, in order to detect emotions in all places and at all times, the technologies used are focusing on the adoption of wearable devices and the implementation of machine learning (ML) techniques.
These technologies usually work with the physiological conditions of the human body [6,7]. The acquisition, processing and monitoring of physiological variables allow the creation of a map of the physical, mental and cognitive state of a subject [8,9]. There are numerous physiological variables that are used for stress detection. In particular, we will focus on the analysis of electrodermal activity (EDA), as it has demonstrated to be very effective in the estimation of arousal. However, we are not only interested in the topic of stress detection, but we will focus on the different ML methods used to detect stress from EDA. For this reason, a scoping review is conducted that focuses on the use of EDA for stress detection through ML techniques.
The remainder of the article is structured as follows. Section 2 provides a short explanation about the methodology followed to do the review, pointing out the most important steps. Section 3 introduces a brief summary about the status of the topic addressed in the review. Section 4 describes the most relevant results and provides a discussion about the most important studies found. Finally, Section 5 offers the conclusions of this work. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2020 doi:10.20944/preprints202011.0043.v1

Review Strategy
A scoping review aims to quickly map out the concepts in a given area [10]. Such a review allows to rely on the main sources and evidence available. It should be done especially when an area is complex or has not been subject to a thorough and complex examination before. This type of review has been widely used in certain research areas to systematically find their central body. As a result of work conducted in a scoping process, it has become possible to systematically review and classify a considerable volume of peer-reviewed literature in a specific field [10,11].
In this article, we have undertaken a scoping review that connects the concepts of bio-signals and ML techniques for the detection of physiological states in people. Within this more generic field, the specific scoping review shown in the present article is to study all the previous works dedicated to the detection of human stress using any type of ML technique on collected electrodermal activity (EDA) signals.

Search Strategy and Study Selection
A total of six scientific databases have been selected for the most comprehensive search possible on ML and EDA. The selected databases have been ScienceDirect, Scopus, IEEE Xplore, PubMed, ACM Digital Library and Web of Science. Articles labelled as "electrodermal activity", "bio-signals", "machine learning", "deep learning", "stress detection", "support vector machine", "artificial neural network" and "deep neural network", among others, have been searched. As mentioned above, the examination has been carried out since when the records are kept in each of the databases until June 30, 2020. 395 articles were returned from the initial searches of the above databases. Table 1 shows the terms searched for each of the databases. In order to have as small a set of search terms as possible without losing the scope of the review, the queries have been refined by successive searches. This has allowed us to keep a reasonably manageable number of keywords without losing the perspective and focus of the scoping review.

Inclusion and Exclusion Criteria
A series of inclusion and exclusion criteria have been established to help us refine the results of the review and arrive at the final number of selected articles. The criteria used to include or exclude each of the publications are listed below: • Only one copy of each of the duplicate papers initially collected in the different databases should be kept. • Publications that evaluate the design and performance of new ML-based methods and algorithms should be included.
• Excessively general revisions must be excluded. Only reviews focused on the search topic should be included. • Articles with few participants or with a small or non-representative dataset should be excluded.
• Articles that do not include any type of system or method for the detection of emotional state based on ML techniques will be excluded.

Study Selection
Figure 1 details the scheme followed to obtain the final selection of the articles through the scoping review. This search process has carried out in the stages of identification, screening and inclusion of the articles based on the given criteria. Within the identification phase are the direct searches in the various databases. A total of 395 papers have been found, of which 46 papers have been detected in Scopus, 52 in IEEE Xplore, 63 in ScienceDirect, 30 in PubMed, 97 in ACM Digital Library, and 107 in Web of Science. In the screening phase, the papers have been selected and eliminated according to the inclusion and exclusion criteria mentioned above. 89 duplicates have been removed from the various databases. Moreover, 144 articles have been removed as they were noticed not to be within the scope of the review after reading their abstract. The criterion here has been to only select papers using EDA signals, individually or in conjunction with other signals, and employing ML techniques. Finally, in the last stage (inclusion), from the remaining 162 articles a total of 104 articles have been removed after a thorough reading of the complete content. This way, 58 articles have been studied in depth in this scoping review.

Paper Classification Categories
The analysis of literature studied in this scoping review has identified different criteria to organise the articles and their approaches. The papers selected in the search can be classified into three main categories, as can be seen in Figure 2. The first category, Bio-signal, is grounded on the different bio-markers used for the detection of calmness/stress. The specific bio-signals used to detect calm/stress can be easily identified within this group. The dimensionality of the data source can also be determined, i.e. whether a single source or multiple indicators are used for detection. In addition, the type of data used for detection can be identified, differentiating between raw data, processed data and two-dimensional matrices. This category will be extensively covered in Sections 3.3 and 3.4.
The second category, Learning Method, bases on the use and implication of different learning methods for the detection. The majority of the papers analysed root their learning capacity on supervised classification algorithms, while the use of unsupervised classifiers is minor. This second category will be expanded in Section 3.7.
The last category, Application, focuses on the creation of applications that employ different types of classifiers for a specific use. It centres on the goals that are to be achieved, focusing on the creation of applications for the detection, grouping, diagnosis and future prediction of emotional or calm/stress states. Other basic classification principles are whether the application works with a large or small number of participants and signals, and whether the system is intended to be used offline or in real-time. The category on applications is not covered in depth in this article, as it is somehow out of the objectives posed. Table 2 has also been added, listing all the documents used in this scoping review, grouped according to the classification formulated. Naturally, there are repeated documents between several columns in the table. This is because there is no unambiguous separation in most of them. For example, the creation of an application requires ML methods and bio-signals to be processed.

Data Analysis
After the final selection of the papers, different measurement indicators have been studied as part of the scoping review. Clearly, with the increase of the varied devices of use and also with the growth of the popularity and the easiness of use of the various ML techniques, the number of articles in the studied field has increased significantly. Evidently, one determinant analysis to know the importance of the topics studied is the scientific production of articles over time. As can be seen in Figure 3 the growth is almost exponential (R 2 = 0.80). Presumably, and given that the search in year 2020 has been limited to half the year, the production of scientific literature will increase and surpass that in previous years.

Methods in Machine Learning for Stress Detection
The human body can be seen as an electromechanical system composed of perceptive, affective and cognitive processes. The inherent dynamic changes allow different measurements to be made on various bio-signals. These time-changing signals allow to establish with enough precision the physical, psychological and cognitive state of the human being [79,80]. Most biological signals involve electrical activity and conductivity, as well as changes in flow, temperature, volume, pressure, sound and acceleration [76,[81][82][83].
There are many physiological variables that can be collected from the human body. The most common are the following. (a) The electrocardiogram (ECG) measures any change in the heartbeat and its pattern of beating makes it possible to diagnose various coronary diseases. (b) Electromyography (EMG) monitors changes in neuromuscular activity. (c) Blood volume pressure (BVP) measures changes in blood volume, which affects blood pressure by changing the cardiac output. (d) Electrooculography (EOG) allows monitoring of eye movements. (e) Pupilography or pupillometry (PUP) is based mainly on the measurement of the pupil diameters under basal conditions and after applying different stimuli. (f) Electroencephalography (EEG) measures the variation of electrical signals produced in different areas of the brain. (g) The inter-breath (IBR) measures the rate of breathing. (h) Acceleration (Acc) monitors body movements. (i) Skin temperature (TMP) is used to quantify temperature changes. (j) The electrodermal activity (EDA) is used to check the state of activation (arousal) and, consequently, one important component of the emotional state of the person. In Table 3 the main characteristics and properties of these bio-markers can be appreciated. Correct monitoring of these signals provides us with a map of the physical, cognitive and emotional state of a person. It must be considered that the process that has taken place since the bio-signals are known and until it has been made possible to use them depend to a great extent on the technology available. The evolution of the technology is influenced by the historical use of "handmade" signals obtained by the criteria of an expert technician. The specialists were in charge of processing the signals and transforming them into data used by the detection system. Obviously, this procedure has clear limitations: (a) it is an inefficient process, since all the weight falls on the efficiency of the professional and, (b) if the expert does not have much experience, the signals may present deficiencies and cause the system to not operate correctly. Currently, this procedure can be performed without almost any expert intervention. ML contributes to automation and analysis based on biomedical signals [84].

Acquisition Model
The acquisition of the signal is one of the most important processes in using EDA (or any other bio-signal). Most authors related in this review, agree that a good acquisition process is crucial for the subsequent recognition system to work properly. Most acquisition systems have as their main scheme the model shown in Figure 4. As usual, these systems have three distinct phases. The first phase is the acquisition of the raw signal. The signal originates from the device that measures the physiological variable. The next phase of the treatment of the signal is preprocessing. This stage is designed to eliminate from the signal all the defects that have interfered during the acquisition process. As part of this operation, artefacts are removed and the signal is filtered, making it softer and eliminating possible noise. Finally, the next phase consists of signal processing. In this stage, as a rule, a series of characteristics of the signal are obtained that will allow it to be used later by means of a classifier.

Datasets and Experimental Designs
According to the outcomes of our scoping review, the authors always choose between two different procedures to acquire the raw signals. The first one is to create an experimental design to obtain the EDA signal that will be processed later. As a rule, this phase is done as shown in Figure 5. A first step is to start the experiment, then the baseline recording of the input data is started. Next, the person is subjected to a sensory stimulus, most commonly visual and auditory, and the individual's reaction is recorded. The process is labelled and repeated as many times as necessary.
A second alternative procedure to the previous one uses several datasets already validated by the scientific community. These datasets were also recorded in the same manner, usually containing a number of other physiological signals recorded in addition to the EDA signal for use in multi-class classifiers. The most common datasets used for EDA analysis are MAHNOB [85], DEAP [86], BioVid [87] and UT Dallas Database [88].

Signal preprocessing
As mentioned above, at this stage the signal is preprocessed for cleaning. Preprocessing procedure is used to clean, adapt and prepare the obtained signals for further processing. This process is also critical as many authors agree that the effectiveness of a classification system begins at this stage. As a rule, this process comprises two different steps, the detection and elimination of artefacts, and the filtering of noise.

Artefact detection and removal, and noise filtering
An artefact is defined as a distortion, addition or error in a signal in an uncontrolled manner. An artefact may be due to changes in pressure or disconnection of the electrodes that capture the signal during the acquisition of the EDA signal. This type of motion artefact (MAt) could degrade the signal very quickly and, as a result, make the signal unusable [63]. With the MAt it is really important to know if it is better to just detect it or, on the contrary, suppress it. Artefacts are suppressed by deflecting the signal through the application of various softening filters [89,90]. This procedure causes a loss of information in the EDA signal in most cases.
Moreover, the detection of the MAt consists of identifying each of the signal segments in which the artefact appears and thus being able to eliminate it in later phases. Often this type of removal must be done manually. The human operator must be expert in the application of the filters and remove the MAt without losing information. Fortunately, there are studies that automatically detect and remove artefacts in EDA [63,91].
Closely associated with the artefact detection and/or removal process is noise reduction or elimination. Noise in a signal can be defined as an unwanted disturbance in an electrical signal. In signal acquisition systems, noise is an unwanted random error or disturbance of a useful information signal such as an unwanted or disturbing sum of energy from natural and sometimes artificial sources. In the EDA signal, due to the slow evolution of the same, the noise that should concern us is high frequency [83]. Therefore, the EDA signals are filtered to remove artefacts and noise recorded during the acquisition period. Generally, two different types of filters are used; firstly, a low pass filter with a 4 Hz cutoff frequency, and secondly, a Gaussian filter to attenuate the signal, artefacts and noise.

Signal Processing
The next step in the process that the signal goes through is the processing phase. In this stage, the EDA signal (already preprocessed) is treated to obtain its features. These features will be in charge of quantifying in a unique way each of the obtained segments. This process of treatment of the EDA signal is known as deconvolution process. Although some of the authors studied do not use this process, convolutional analysis is very common because this method allows to standardise and treat a large group of signals from different participants [17,92].

EDA Deconvolution
As already stated in this revision, the processing of EDA signals consists of different phases. Typically, these phases are preprocessing, filtering, artefact removal and discrete deconvolution. The measurement of EDA signals are usually made on the palms of the hand, the fingertips or the wrist. These measurements are composed of the convolution of two signals: a first signal, which varies slowly, called the skin conductance level (SCL), and a second signal that varies rapidly, the skin conductance response (SCR). The SCL signal establishes the base level of the signal while the SCR is closely related to the activity of the sweat motor system which, at the same time, is closely associated with the parasympathetic nervous system [93].
The deconvolution procedure consists of separating the SCR signal from the SCL one. This approach makes it possible to minimise any effect that race, sex, and age might have on the SC signal. Figure 6 illustrates the process. As you can see, the SCR "driver" is the one that can be used to detect the level of excitation of the individual. Mathematically, the sudomotor nerve function may be considered a driver that has a train of impulses that evolve over time. This response is embedded in the SCR and SCL signal. The outcome is presented by a convolution (*) of the driver with the impulse-response function (IRF), describing the impulse response flowing through time as shown in equation 1).
The SC signal is formed by the SCL and SCR signals, as displayed in equation (2).
Thus, by deconvolution of equation (3), the tonic signal driver is obtained as: After this stage, the signals obtained can be used in the feature extraction process.

Feature Extraction
Many of the bibliographic references consulted in this survey coincide in using different features to quantify the signal. Four main groups of features can be distinguished: time domain features, which refers to all the variables that can be defined in terms of time; frequency domain features, which refers to all the parameters that can be defined in terms of frequency or based on them; statistical features defined as variables that belong to the statistical field; and, morphological features that quantify the shape of the signal. Table 4 details several features that use to characterise the different segments of the SC, SCL and SCR. It should be noted that these features are used to characterise the signals more accurately. It is a good practice to use the best features that are most suited in relation to their contrasting performance. In the time domain, the following features are commonly used: mean amplitude (Mean); amplitude standard deviation (SD), the SD first and second derivative (D1, D2), the SD means (D1M, D2M) and their standard deviations (D1SD and D2SD) [29]; sum rise time (SRT), sum fall time (SFT), rise rate mean (RM), rise rate standard deviation (RRSTD); decay rate mean (DCRM), decay rate standard deviation (DCRSD); phasic value mean (PHVM), phasic value standard deviation (PHVSD); startle time mean (STM), startle time standard deviation (STSD), startle RMS mean (STRMS), startle RMS standard deviation (STRMSSD); startle RMS overall (STRMSOV); electrodermal level (EDL), electrodermal response (EDR); cumulative maximum (CMax), cumulative minimum (CMin); smallest window elements (SWE); dynamic range (DR); root-mean square level (RMS), peak-magnitude-to-RMS ratio (PMRMSR); root-sum-of-squares level (RSSL); peak (P), peak location (PLoc), peak to peak time (PPT), peaks intervals differs 50ms (pNN50) [24,40,41,53,65,75].
According to the morphology of the signal we can find these different features: epoch-capacity (EC) is a relation between the number of epochs and the total number of them; epoch-peak (EP); epoch peak counter (EPC) is a number of epochs in all times; entropy (EN) [18]. On the other hand, there are features that result from different measurements such as arc length (AL), integral area (IN), normalised mean power (AP), root mean square (RMS), perimeter to area ratio (IL) and energy to perimeter ratio (EL) [29]. These parameters are due to the need to understand the morphological differences in the shape of the SCR Driver . Not only is it a matter of studying signal peaks, but changes in the general morphology of the signals are of interest.
Finally, in the frequency domain, the following parameters can usually found: sum spectral components (SSP), spectral power (SP), mean spectral components (MSSP), median spectral components (SSPMed), frequency non-specific of skin conductance response (NSSCRs), fast

Machine Learning for Stress Detection
As a general rule, these kinds of signal-based experiments yield a large amount of data that has to be processed further. It is necessary to use techniques that help us to classify this enormous amount of data. Most authors prefer ML-based techniques as opposed to other techniques based on pure statistical analysis (ANOVA, MANOVA and Student's t-test). For this reason, a comprehension of existing ML models, their main characteristics and methods of evaluation, and their most relevant results is essential. Therefore, next we will study the metrics used, the different classification methods employed, as well as the most relevant results described in each of the publications reviewed.

Evaluation Metrics
According to the literature studied, the process of stress and calm detection is limited to a binary classification problem. The different metrics used to measure performance have been identified. They are presented below, but to understand what they mean, it is necessary to introduce the terms true positive (TP), true negative (TN), false positive (FP) and false negative (FN). On the one hand, TP and TN are used to indicate whether an element of a set has been classified as corresponding to its category. On the other hand, it is possible to define FP and FN when an element belongs to a class completely contrary to that in which it was classified.
The metrics used to assess the performance of the detection are shown as follows: • The accuracy (ACC) is the simplest metric one can imagine. It is defined as the number of correct predictions(TP and TN) divided by the total number of predictions (TP, TN, FP and FN).
• The precision (P) refers to the ratio of successful positive predictions (TP). It is measured as the number of true positives (TP) divided by the total number of positive cases.
• The recall (R) is defined as the proportion of positive cases captured. Recall describes the sensitivity of the model for identifying the positive class. It is computed as the number of true positives divided by the sum of true positives and false negatives.
• The specificity (S), also named False Negative Rate (FNR), is defined as the ratio between the true negatives and the total number of of true negatives and false positives (negatives that have been labelled as positive). This implies that there will be another proportion of actual negative, which got predicted as positive and could be termed as false positives. This proportion could also be called a false positive rate. The sum of specificity and false positive rate would always be 1.
• The F1-score, also called F-measure, is a measurement of the accuracy of a test. It is well defined as the harmonic mean between accuracy and recall. It is applied as a statistical assessment of performance.
• The area under the curve (AUC) or receiver operating characteristics (ROC) curve is a performance measurement for classification problems at various thresholds settings. ROC is a probability curve and AUC represents a degree or measure of separability. It tells how much a model is capable of distinguishing between classes. The higher the AUC the better the model is at predicting. Also, the higher the ROC the better the model is at distinguishing between classes. This chart is constructed by plotting sensibility in the y-axis and specificity in the x-axis. • The confusion matrix (CM), also known as an error matrix, is a specific table disposition that allows to visualise the performance of an algorithm. Every row of the matrix is an instance of an expected class, while every column is an instance of an actual class (or vice versa). • The kappa-coefficient is a measure of how closely the instances classified by the machine learning classifier match the data labelled as ground truth, controlling the accuracy of a random classifier as measured by the expected accuracy.

Classification Methods
In the several studies analysed from this scoping review, different classification methods have been found. These methods are based on different parameters. In first place, there is a direct classification vs a hierarchical classification. Furthermore, if considering the duration of the classification, there is long-term vs short-term. Finally, we can distinguish between supervised vs and unsupervised learning methods. Another aspect that must be considered is that ML models have some limitations due to the large number of parameters handled. Consequently, it is necessary to know how to implement methods that will help us reduce the number of redundant or non-relevant parameters. Dimensionality reduction techniques are becoming significant in the areas of ML, data mining and bio-informatics.
Therefore, one of the main objectives of classification is to eliminate redundant and interdependent features, reducing the dataset to a smaller dimensional space. The number of dependent variables must be reduced, but the bulk of the data must be maintained. In the literature related to signal processing methods, the feature reduction methodologies detailed next are usually used. Principal component analysis (PCA) is a standard statistical data analysis, which attempts to explain observable signals as a linear mixture of the orthogonal principal component that optimises the variance between the different components. Linear discriminant analysis (LDA) is typically used to reduce the dimensionality by maximising the space between the different classes. Finally, independent component analysis (ICA) is an analysis and data processing strategy that attempts to recover unobservable signals or sources of monitored mixtures only under the assumption of mutual independence between them.
The use of these feature reduction techniques allows the computational cost to be much lower, since the resulting classifier is simpler and only attends to the important features of the signal. Many of the papers studied in this overview use these techniques and the results are really good compared to others who do not use these techniques. Below is an explanation of the different methods used.

Direct vs Hierarchical Classification
In many of the articles analysed in this review it is possible to find these two types of classification: direct and hierarchical. A direct classification consists in classifying the emotional state of the person in a direct way considering one or more physiological variables. This process is quite common when dealing with simple emotional models or classifiers that only seek to find a state, as in our case with the use of EDA for stress detection (increase or decrease of arousal). On the other hand, when a hierarchical classification is proposed, there are two distinct stages. In the first stage, the emotional state is established (positive or negative) and in a second stage, the more complex emotional state is classified [15].

Long-term vs Short-term Affective State Classification
Another aspect to consider is whether a classification of the emotional state should consider the duration of the experiment, as well as the evolution of the signals over time. The first aspect to be pointed out is the need for a classifier that works quickly and is consistently robust over a long period of time. In this sense, such a classification can be defined as short-term and long-term. The former is aimed at almost instantaneously determining results, while the latter is oriented towards long-term applications. In the context of stress detection, a long-term classification is usually recommended [29].

Supervised vs Unsupervised Learning
Within the different learning methodologies, there are (apart from reinforcement learning and stochastic learning) other two main groups, namely supervised and unsupervised learning [94]. From the perspective of classification, these are opposite approaches. The first one emphasises and uses the different labelled datasets to be able to make the predictions accordingly. The second option uses unlabelled datasets to analyse whether there are large groups of data that reflect a situation (i.e. stress detection or emotional state detection).

Supervised Learning Methods
As mentioned above, supervised learning techniques are based on training a classifier from a dataset that is already labelled. Once the system has learned to identify the different patterns, the classifier is able to effectively distinguish between the different classes. In our particular, it has to distinguish between calm and stress states.
There is a wide range of classifiers with supervised learning. Below you can find the most commonly used classifiers, as well as a small summary of their internal operation. The following are worth mentioning: support vector machines, auto-hidden Markov models, discriminatory analysis, decision trees, naive Bayes, logistic regression, k-nearest neighbours, artificial neural networks, long short-term memory and recurrent neural networks: • Support vector machines (SVMs) were originally conceived by Vapnik in 1995 to solve a dual classification challenge [95]. The algorithm has been evolving to solve the multi-classification problem [96]. This algorithm seeks a hyperplane that optimally divides elements of different classes. The hyperplane is the one with the minimum distance (margin) to the closest points to itself. Depending on the number of features and their dimensions, sometimes a transformation in vector space is required to achieve the best possible separation. The search for the separation hyperplane in the transformed spaces, usually of very high dimension, is based on the kernel function. From the point of view of emotion detection by EDA it is one of the most used algorithms, since it responds very well to the different features obtained during the acquisition process. The most used kernels are linear [24,65,71,72]; quadratic [35,65,75]; polynomial [65,71,75]; Gaussian [65,71]; and radial [17,19,[26][27][28][33][34][35][36][38][39][40][41]44,47,49,53,55,58,60,63,64,67,69,72,74,78,97]. • An auto-hidden Markov model (AHMM) is defined as a discrete-time Markov chain whose states are hidden. In each state there is an observed output in accordance with a state-dependent distribution. In an auto-regressive HMM (AR-HMM) the outputs to be monitored are generated by an auto-regressive model with coefficients that depend on the current state of the HMM [15,16]. This algorithm have been widely used for classification purposes using three different criteria: (a) the sequence of the observations and the number of different HMMs determine the most likely HMM from which they are derived; (b) for each sequence of events that came from a particular HMM, determine the most likely sequence of events that gave rise to these observations; (c) for all sequences of observations identify the most likely parameters of the HMM that gave rise to the sequence [15,16]. The different approaches are used to determine the status of each person from the EDA signals. • Discriminant analysis (DA) groups those methods based on Bayes theorem to estimate probabilities. A DA makes predictions based on the probability that a new set of input data will belong to each class. The class that has the highest probability is considered the output class. It also considers the possibility of each class and the probability of the data belonging to each class.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2020 doi:10.20944/preprints202011.0043.v1 • Decision trees (DTs) are described as algorithms that perform repeated splits in the dataset to provide maximum data separation. The resulting structure is similar to a tree [98]. The most frequently used criterion is information gain, This implies that the entropy reduction caused by dataset division is maximised in every split. Within this type of classifier the most used are tree medium, regression tree [13,18,27,55,64,68] and other ensemble methods like random forest and bagged tree [18,75]. • Naive Bayes is defined as a type of probabilistic classifier that aims to process and categorise data.
The operation of this classifier is simple. It is essentially a technique for assigning probability theory to classify data. Naive Bayes classification algorithms utilise the Bayes theorem. The central idea is that the probability of an event may be adjusted as new data are entered. This classifier is not a single algorithm, but a family of automatic learning algorithms that makes use of statistical independence. These are relatively easy to write and perform more smoothly than the most complex algorithms. A Bayes classifier is defined as naive if all the properties of a particular data item are independent of each other. This classifier fits well for calm/stress detection as it has two classes. During this study it has been found that the most used naive Bayes methods in our scoping review are naive-  [20,41], and, finally, convolutional NN (CNN) [17,34,35,67,69,78]. • Long short-term memory (LSTM) and recurrent neural networks (RNNs). A RNN is a generalised version of the neural network which has an internal memory. On the other hand, a LSTM is a customised model of recurrent neural networks which facilitates the memory of past data. The LSTM is very adequate to classify, process and predict time series given time intervals of unknown duration [100]. Due to the properties of this type of network for predicting signals over time, many authors use them to work with physiological signals. In this scoping review they have been used as LSTM [62] and ensemble-based methods like CNN+LSTM [62] and adaptive neuro-fuzzy inference system (ANFIS-based short-term) [53].

Unsupervised Learning Methods
The second group of learning methods addressed is unsupervised learning [101]. This type of methods is based on learning by using an unlabelled dataset. The model obtained is automatically adapted to the observations. The model is mainly created through the use of clustering methods. According to the literature found in the scoping review the following unsupervised methods have been used: • K-means is a clustering method, aimed at splitting an unlabelled dataset of n observations into k groups in which every single observation belongs to the group whose mean value is the closest. The algorithm explores a predefined number of clusters in an unlabelled multi-dimensional dataset, and concludes with an easy interpretation of how an optimised cluster might be formulated. Primarily the approach would be a two-step process. First, the centre of the cluster is the arithmetic mean of all data points associated with the cluster. Second, each point is contiguous to its cluster centre compared to other cluster centres. The fundamental idea is to find the clusters that cover the greatest number of elements while minimising inertia [40]. • K-medoids is a grouping approach for the partitioning of a dataset into k groups or k-clusters, each group being represented by one of the group data points. These points are called cluster medoids. The term medoid describes an item within a cluster for which the average difference between it and all other members of the cluster is minimal. It corresponds to the most central point of the cluster. The K-medoid is a strong option to the K-media cluster. This implies that the algorithm is less sensitive to noise and outlines [40]. • A self-organising map (SOM) is a type of ANN that is formed by the use of unsupervised learning to generate a low-dimensional map, typically two-dimensional [102]. SOMs are different from ANNs in applying competitive learning unlike the error corrective learning (such as gradient descending feedback), and in the way that it uses a neighbourhood function to maintain the topological properties of the input space. In the selected literature we have found the use of SOMs for the detection of arousal [40,44].

Results and Discussion
This section presents the different results obtained during this scoping review. As it has been commented throughout the paper, different analyses of the data obtained are carried out in this type of review. Our analysis focuses on determining which are the best classifiers (supervised and unsupervised) for calm/stress detection. Complementary, an analysis has been carried out to evaluate if the EDA as a single variable or in combination with other physiological signals is a estimator estimator of arousal (stress) variation.

Bio-markers Used in the Papers
One of the considerations taken during this study was to analyse the number of articles that only use EDA to make the different classifications, as well as those in which, in addition to the EDA signal, other biological markers are used to reinforce the classification results. As can be seen in Table 5, the publications have been grouped according to the number of variables used. Note that the EDA variable is present in all eligible documents. As already mentioned throughout the article, this is because EDA is a very good estimator of the activation level due to its connection with the parasympathetic nervous system through the sweat-motor system. In the works found, a minimum of 5 participants and a maximum of 260 have been counted, having used other variables besides EDA like BVP, TMP, EEG, EOG, EMG, Acc, PUP and IBR.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2020 doi:10.20944/preprints202011.0043.v1 Another variable that helps to determine different emotional states in the participants is BVP, which combined with EDA gives very good results in the prediction as well [12,15,20,22,38,48,[56][57][58]. Table 5 shows additional variables.The following major cluster (besides EDA and BVP) includes TMP. These articles focus on the integration of this variable for stress detection. On the other hand, when adding the EMG signal, the results are slightly improved. This may be due to the fact that this physiological signal comes from the central nervous system and complements itself very well with EDA. Another variable used for stress measurement is EEG mixed with EDA. This type of signal is widely used individually and provides good results in stress detection. Nonetheless, EEG requires very expensive and precise devices and quite specific knowledge to set up the acquisition of the signals. Finally, IBR also provides additional information to improve the classifiers, but without achieving great improvements.

Supervised Learning Methods
A large number of the papers found use supervised learning methods (see Table 6. Below are the results obtained for each of the ML algorithms used.

Support Vector Machines
SVMs are the most widely used classification methods for detecting stress. Throughout the present survey, results have been obtained for each of the existing configurations. SVMs with linear, quadratic, cubic, polynomial, Gaussian, radial and radial with PCA kernels have been proposed for stress detection.
In the case of the linear kernel, the average accuracy in the articles studied is 55.64%. This value can be considered bad, but, as a rule, linear classifiers do not work very well with this type of physiological variable. On the other hand, for the quadratic kernel we have results of 85.26%. This type of kernel already shows very good results for EDA. In this sense, the cubic kernel also works very well, giving an accuracy of 82.86%, which is still in the same range as the quadratic kernel. The worst classifier is the SVM with polynomial kernel. In most works accuracy is applied it is very poor with an average of 46.56%. For a medium Gaussian kernel we have that the papers that apply it have a performance of 76.33%.
As the most used kernel in literature, the radial obtains average results of 75.34%. This result achieves an acceptable performance, because other estimators such as the ROC curve or the sensitivity and specificity values are very high, in many cases approaching 1 (maximum achievable level). This makes us think that it is a more robust classification than the rest, and it is normal that it is used frequently. In addition, it should be noted that these classifiers present values higher than 90% at the prediction level, only comparable with the performance of the different topologies and configurations of ANNs [41]. Finally, when a feature reduction analysis (PCA) is applied to the previous approach, the average result of the classification is 82.24%.

Auto-Hidden Markov Models
There are two types of algorithms used within the Markov chains. The first one, the auto-hidden Markov chains have an associated result of 88.6% with a PCA and non-PCA approach. On the other hand, if we use a more classic approach, the standard Markov chains, we have a value of 68.7% taking into account the baseline, while for an approach where the baseline is not considered, the accuracy increases to 79.83%.

Discriminant Analysis
In the case of discriminant analysis, the mean value of the prediction for the linear and quadratic configurations is 73.54% and 64.81%, respectively. As can be seen, a higher order configuration worsens the results. On the contrary, if we apply a feature reduction algorithm to the linear discriminant, the results obtained are 71.09%. On the other hand, for a Gaussian type configuration, the results obtained are 71%. These results suggest that the only method that can be used with acceptable results is the configuration of the linear discriminant. This is largely due to the internal workings of the classifier, as well as its ability to remove those features that do not provide relevant information. In papers where feature removal is performed, such as in the case of LDA, something similar occurs as will be seen below.

Decision Trees
Other classifiers commonly used for EDA (and other bio-signals) processing are decision trees. This type of classifiers works by creating smaller and smaller sub-parts to classify an element into a particular class. This method is very effective when we have a model with multiple features. In most of the articles selected in the scoping review, the implementation of this type of algorithm has used the Matlab library called "App learner" with standard configurations (Gini criterion). The average results obtained for the regression trees are 73.30%. Another configuration used in this type of tree is the "medium" configuration, which has given a result of 70.8%. If we now look at the ensemble methods, the results for random forest are 77.85%. On the other hand, if we use the "bagged trees" ensemble method, the precision of the classifier is 78.35%. This configuration is widely used for classification as it gives better results than many other ensemble methods.

Naive Bayes
As for the Bayes classifier, the results obtained of 49.46% for the Gaussian configuration are quite poor. This is because this type of classifier assumes that there is independence in the variables (which is not the case for EDA signals). You can see how the results improve considerably up to 70.8% when this method is combined with a PCA.

Logistic Regression
The average precision for the standard configuration are 75.01% for the classifier based on logistic regression, while they are 57.54% for a zero-regression configuration. It seems that if you change the configuration to one that seems better, the results do not improve. In this case they even get noticeably worse. This type of classifier is not widely used with biological signals, compared to others found in this study; so the results are in line with expectations.

K-Nearest Neighbours
The KNN is one of the most widely used classifiers for physiological classification (also widely used for EDA). This type of classifier performs a grouping of data without making a model first. There are many configurations that can be used in this classifier. The most widely used according to the literature reviewed are the KNN-Medium. This type of configuration uses a not very large cluster size. Having a size like this makes it more immune to noise produced by outliers. In our scoping review study, these classifiers have an average performance of 79.04%. Apart from this, the approaches with KNN-Weighted show a precision of 76.53%. On the other hand, for the cubic configuration the classifier has almost the same level of prediction as the weighted configuration with an average precision of 76.85%. KNN-Fuzzy, which uses fuzzy logic to operate, yields a precision of 78.65%. Finally, for the configuration that uses a smaller cluster size, KNN-Fine, although it is more sensitive to noise, we have very good results of 82.8%.

Artificial Neural Networks
From the point of view of the use of artificial neural networks for the detection of stress through EDA, it must be recognised that there are many configurations, topologies and types. Therefore, on many occasions a configuration that works fine in certain cases simply obtains normal results (unremarkable) in other studies. The studied authors do agree that ANNs provide very good results in many different configurations. Thus, a convolutional analysis with a 1-D-CNN configuration obtains a F1-score of 85.57%, and a 1-D-CNN-E network assembly achieves a F1-score of 92.2%.
For ANNs there are many configurations in the literature, the results of which are as follows. Adaboost achieves an accuracy of 99.69%. The 3 and 5-layers configurations provide a precision of 95.02% and 98.81%, respectively. For multi-layer perceptron in the feed-forward configuration, the accuracy drops to 81.50%. Similarly, for the same perceptron, but with the back-propagation algorithm, the classification accuracy is 79.13%. As for the implementation of new methods in ANNs, there is a configuration that uses the LUCCK method (concave and convex kernel) with a result of 89.23%, in line with those obtained previously. Also the use of cellular networks has brought results of 89.38%.However, for the rest of the configurations, the results do not improve with respect to the previous ones, being a precision of 73.13% and an accuracy of 80.5% on CNN (MLP-BP) and CNN-neuro-fuzzy (MLP-NF), respectively.

Long Short-Term Memory and Recurrent Neural Networks Results.
Another great field that has opened up in signal processing, detection and classification is networks that have memory. As is the case of ANN and its variant the LSTM. From the point of view of the different configurations, this type of network may be used alone or also by assembly method in other configurations. For a LSTM network the F1-score is 81.4%, while with the CNN+LSTM obtains a F1-score of 79.13% and the ANFIS configuration variant gets 95%. Although there is little literature on this type of classifier, it should be regarded as a good alternative when it is desired to use a dataset in the time domain based on the processed skin conductance response (SCR).

Unsupervised Learning
In terms of unsupervised learning there is very little literature on activation detection or analysis of emotional states (see Table 7). Below are the most used methods detected throughout this review and their most important results. We believe that the lack of development of these techniques is not due to the fact that they are more difficult to implement than the supervised learning methods mentioned above. The reason is that most experiments are directed at data collection and labelling, so it is much easier with an already labelled dataset. It should be noted that since there are not many studies of this type, they cannot be compared with others and we have to take these results with caution. One of the unsupervised learning algorithms used is K-means. This algorithm has a precision of 77.5%. In this case, as an advantage we can point out that it is very easy to implement and usually performs well with large data sets. But on the contrary, it is very sensitive to the number of clusters created. In many cases, you have to look for this number of clusters by hand and then let the system learn. It is quite sensitive to noise, so outliers can be a determining factor in achieving a successful detection. In order to minimise the effects of noise produced in outlier data on a dataset, the K-medoids approach has been tested. The result of 75.5% precision is at the same level as those obtained for K-means. Finally, as an alternative method to the previous methods, within the unsupervised learning techniques, there are the methods based on Self-Organising Maps. These variants of ANNs are widely used because they allow us to generate bi-dimensional maps like in convolutional-ANN analysis for later classification. In this case, the results obtained for this classifier are at the same level as the previous ones (77.5%).

Conclusions
This paper has presented a scoping review of the use of physiological signals for stress detection, focusing on the use of electrodermal activity (EDA) and various machine learning (ML) techniques. An initial number of 395 documents has been considered, from which 58 have been selected to carry out in depth the scoping review process. These articles have allowed to provide a global perspective on a specific topic such as the use of EDA, individually or in conjunction with other variables, for the detection of stress (variation of arousal) using ML techniques.
Several important issues have emerged during this study that should be considered by researchers interested in signal acquisition in general and EDA processing in particular. The first point to note is that the classification process must be kept in mind from the moment the signals are obtained (acquisition process). Without a robust acquisition process the signals become useless for a correct later classification. All operations carried out in later stages will be useless if the first step of the whole process is not carried out correctly.
In addition, most of the authors studied in this scoping review emphasised that this process is not exempt from dealing with interference, artefacts or noise in the signals. A correct application of the different filters during the preprocessing stage will be crucial for the following phases. When working with EDA signals, all the articles surveyed highlighted that the signals must go through a deconvolution process to homogenise and normalise them. The standardisation process allows to use a dataset that has a large amount of data coming from people without being affected by their race, sex or age. In fact, those studies in which there was no deconvolution process have been discarded, due to the poor results obtained by any classifier.
Once the signals have been preprocessed, the next very important step is to obtain the different features that allow us to quantify them. This will be done in a further process which is the classification. At this point, most authors agree to use features from different domains, generally the temporal domain and the frequency domain. There are also approximations that analyse the shape of the signal (morphological) and those that analyse the signal statistically (statistical features). Nobody agrees on the number of variables or the minimum number of functions that should be used. The general approach is to use different types and fit the model using LDA, PCA, or ICA analysis to perform dimensionality reduction.
Furthermore, during the review, two well differentiated pathways for the estimation of the participant's emotional state have been found. A first approach that aims to use only the EDA for detection, while, the second approach bets for use EDA signal mixing with other physiological signals such as BVP, ECG, EMG, among others. Both methods have their advantages and disadvantages. One advantage of using EDA individually is the possibility to use very small non-invasive devices with great autonomy. Another advantage is that the results using only EDA are quite good. A disadvantage found is that the use of this variable alone limits us to the problem of stress (arousal variation) detection. This is because the response is limited by the parasympathetic nervous system (sudomotor nervous system) response. On the contrary, using more physiological signals has the advantage that it enables to monitor several systems and, as a consequence, several types of responses provide a better mapping of the physical, psychological and cognitive state of a subject. Nonetheless, a disadvantage is that the use of different signals makes the system more complex, more difficult to maintain and it has a higher computational cost when using a classifier.
If we now consider classification methods based on ML techniques, it is clear that there are two main groups of methods, those based on supervised and unsupervised learning. Throughout the literature, most authors are committed to the use of techniques based on supervised learning. This is largely due to the fact that the experiments and datasets are labelled for each of the states. For this reason, few articles use unsupervised techniques.
Among the supervised learning methods, SVMs and many of the ANN topologies report better classification results than the rest, followed closely by KNN algorithms. This is because in most cases these algorithms work very when facing many features of the signals (EDA and others). For SVMs those implementing quadratic, cubic and radial kernels outperform with accuracy of 85.26%, 82.86% and 82.4%, respectively. ANNs, on the other hand, highlight for their accuracy in different configurations, especially ANN-Adaboost with 99%, and different configurations of the MLP with 95% and 98% for the 3-layer and 5-layer, respectively. Moreover, from the perspective of unsupervised Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2020 doi:10.20944/preprints202011.0043.v1 learning, the different results are at a similar level. K-means, K-medoids and SOMs are acceptable classifiers showing around 75% accuracy. As a final conclusion we can say that the individual use of EDA for the detection of the excitation variation is very widespread, having achieved very good results in its prediction. Furthermore, its use in combination with other physiological signals and with the help of robust and novel ML techniques has increased over time. For this reason, emotion detection is being integrated in a non-invasive manner into user-centred devices, and at the same time, the robustness and precision of to date's systems and applications has been complemented. This use allows us to continuously monitor the emotional state of the subject in a continuous manner. Although it is not a perfect detection model, combined with other variables, it is possible to work towards the design of increasingly useful and interesting applications in health and education. Specifically, stress detection will be gaining an important place in manifold areas. Acknowledgments: This work has been partially supported by Spanish Ministerio de Ciencia, Innovación y Universidades, Agencia Estatal de Investigación (AEI) / European Regional Development Fund (FEDER, UE) under EQC2019-006063-P and DPI2016-80894-R grants, and by CIBERSAM of the Instituto de Salud Carlos III. Roberto Sánchez-Reolid holds BES-2017-081958 scholarship from Spanish Ministerio de Educación y Formación Profesional.

Conflicts of Interest:
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: