EEG signal classification: an application to the emotion-related brain anticipatory activity.

: Machine Learning (ML) approaches have been fruitfully applied to several classification problems of neurophysiological activity. Considering the relevance of emotion in human cognition and behaviour, ML found an important application field in emotion identification based on neurophysiological activity. Nonetheless, the literature results present a high variability depending on the neuronal activity measurement, the signal features and the classifier type. The present work aims to provide new methodological insight on ML applied to emotion identification based on electrophysiological brain activity. For this reason, we recorded EEG activity while emotional stimuli, high and low arousal (auditory and visual) were provided to a group of healthy participants. Our target signal to classify was the pre-stimulus onset brain activity. Classification performance of three different classifiers (LDA, SVM and kNN) was compared using both spectral and temporal features. Furthermore, we also contrasted the classifiers performance with static and dynamic (time evolving) features. The results show a clear increased in classification accuracy with temporal dynamic features. In particular, the SVM classifiers with temporal features showed the best accuracy (63.8 %) in classifying high vs. low arousal auditory stimuli


Introduction
In last decades, the vision of the brain has moved from a passive stimuli elaborator to an active reality builder.In other words, the brain is able to extract the information from the environment, building up inner models of external reality.These models are used to optimize the behavioural outcome to react to upcoming stimuli [1][2][3][4].
One of the main theoretical model assumes that the brain, in order to regulate body reaction, runs an internal model of the body in the world, as described by embodied simulation framework [5].'An increasingly popular hypothesis is that the brain's simulations function as Bayesian filter for incoming sensory input, driving action and constructing perception and other psychological phenomena, including emotion' [6].In light of this, it is possible to consider emotions, not only as a reaction to the external world, but also as partially shaped by our internal representation of the environment, which help us to anticipate possible scenarios and therefore to regulate our behaviour.
The construction model of emotion [7] argues that the human being actively builds-up his/her emotions in relation to the everyday life and social context in which is placed.We actively generate a familiar range of emotions in our reality, based on their usefulness and relevance in our environment.In this scenario, in a familiar context we are able to anticipate which emotions will be probably elicited, depending on our model.As a consequence, the study of the anticipation of/preparation to forthcoming stimuli, may represent a precious window on the understanding of the individual internal model and emotion construction process, resulting in a better understanding of human behaviour.
A strategy to study preparatory activity could be related to experimental paradigm in which cues are provided regarding the forthcoming stimuli, allowing in this way the investigation of the brain activity dedicated to the elaboration of incoming stimuli [8,9].Cue experimental paradigm predicting the emotional valence of the forthcoming stimuli showed that the brain anticipatory activation facilitates for example successful reappraisal via reduced anticipatory prefrontal cognitive elaboration and better integration of affective information in paralimbic and subcortical system [10].Furthermore, preparation to forthcoming emotional stimuli has relevant implication also for psychological clinical conditions, as mood disorders or anxiety [11,12].
Recently the study of brain anticipatory activity has been extended to statistically unpredictable stimuli [13][14][15], providing experimental hints of specific anticipatory activity before stimuli randomly presented.Starting from the abovementioned studies, we focused on the extension of brain anticipatory activity to statistically unpredictable emotional stimuli.
According to the so-called dimensional model, the emotion can be defined in terms of 3 different attributes (or dimensions): valence, arousal and dominance.Valence measures the positiveness (ranging from unpleasant to pleasant), arousal measures the activation level (ranging from boredom to frantic excitement) and dominance measures the controllability (i.e. the sense of control) [16].
Emotions can be estimated from various physiological signals [17], such as the skin conductance, the EEG and the EEG.The latter has been received a considerable amount of attention in the last decade, introducing in the emotion recognition field several Machine Learning (ML) and signal processing techniques, originally developed in other contexts, such as the Brain-Computer Interfaces [18].Emotion recognition has been re-drawn as a Machine Learning problem, where proper EEG-related features are used as input to specific classifiers.
The most common features belong the spectral domain, in the form of spectral powers in delta, theta alpha and gamma bands [19], as well as Power Spectral Density (PSD) bins [20].The remaining belong to the time domain, in the form of Event-Related De/Synchronizations (ERD/ERS) and Event-Related Potentials (ERP) [19], as well as shape-related indices such as the Hjorth Parameters and the fractal dimension [20].
The most commonly used classifier is the Support Vector Machine (SVM) with Radial Basis Function (RBF) kernel, followed by the k-nearest neighbour (kNN) and the Linear Discriminant Analysis (LDA) [19,20].Finally, most of the classifiers are implemented as not-adaptive (i.e.static) [19], in contrast to the dynamic versions that take in account the temporal variability of the features [21].
The classification performances are very variable, because of the different features and classifiers adopted.The following examples are taken from [19] -in particular, from the subset (17 over 63) of reviewed papers that focused on the arousal classification.Using an SVM (RBF kernel) and spectral features (e.g.STFT), Lin and colleagues obtained 94.4% of accuracy (i.e. percentage of corrected classification) [22], while using similar spectral features (e.g.PSD) and classifier (SVM with no kernel) Koelstra and colleagues obtained 55.7% [23].Liu and Sourina obtained 76.5% using temporal features (e.g.fractal dimension) with a SVM (no kernel) [24], while Murugappan and Murugappan obtained 63% using similar temporal features and a SVM with polynomial kernel [25].Finally, Thammasan and collegues obtained 85.3% using spectral features (e.g.PSD), but with a kNN (with k=3) [26].All the classifiers were static.
The purpose of the present work is to provide new methodological advancements on the ML classification of emotions, based on the brain anticipatory activity.For this purpose, we compared the performances of tree different classifiers (namely LDA, SVM, kNN) trained using two types of EEG features (namely, spectral and temporal).In addition, each classifier was dynamically-trained, to take in account the temporal variability of the features.The results provided useful insights regarding the best classifier-features configuration to better discriminate emotion-related brain anticipatory activity.
The paper is organized as follows: in Section 2, we provided a brief but comprehensive and self-contained starting point on Machine Learning; in Section 3, we described experiment design, the data processing and the classification processes; finally, in Sections 5-6 we presented, respectively, the results and the discussion.

The Machine Learning (ML)
In this section we briefly discuss the Machine Learning (ML) and the general classification problem, as well as some classification algorithms and feature selection methods.We are aware that the treatment is far from being fully exhaustive, but we hope it will serve as a comprehensive and self-contained starting point for novice readers.A more complete and precise treatment can be found in textbooks, such as [27,28].

The classification
The classification aims to assign a discreate labels (also known as class) to a m-dimensional numeric instance (also known as feature vector or, simply, feature)  ∈ ℝ  by means of a function : ℝ  →  ∈ ℤ (also known as classifier).
Without loss of generality, will concentrate only on binary classification (class  1 versus  2 ).The multiclass classification can be performed using a cascade of binary classifications, following either the OVR (one-versus-rest) or the OVO (one-versus-one) strategies.In the OVR, each class   is classified against the corresponding aggregated class  ̃ = ⋃   ≠ , while in the OVO each class   is tested against each class  ≠ [28] (pp.182-183).
The classes  1 and  2 are usually numerically coded as, respectively, +1 and −1 .The classification can be, thus, reformulated into finding a discriminant function ℎ: ℝ  → ℝ such that: The discriminant function can be either linear or not-linear, corresponding to, respectively, linear or not-linear classifiers.
The so-called Bayes classifier represent the optimum (i.e. the best of all the possible classifiers), since gives the lowest conditional classification risk [27] (pp.24-25) (i.e. the effects of a wrong classifications) and classification error (i.e. the probability of wrong classifications) [27] (pp.[26][27].It is a non-linear classifier with a discriminant function ℎ given by [27] (p.31): where Pr{ 1 |} and Pr{ 2 |} are the posterior probabilities of, respectively, classes  1 and  2 .Thanks to the Bayes theorem and applying a logarithmic transform, eq. ( 2) can be equivalently re-written as [27] (p.31): The functions  1 () ≐ (| 1 ) and  2 () ≐ (| 2 ) are the two class-conditional probability density functions (PDF), namely the PDFs of the feature vectors belonging to classes  1 and  2 , while the quantities ℘ 1 and ℘ 2 are the a piori probabilities for classes  1 and  2 .
The knowledge of both the class-conditional PDFs and the a priori probabilities, or the posterior probabilities allows a direct implementation of the Bayes classifier using eq.(2-3) and it is a statistical problem.Unfortunately, in practice, the available information is rarely (if ever) such complete.
The theory of Machine Learning (ML) addresses this problem using a "learning from examples" approach: the classifier is built (or trained) using a so-called training set, a finite collection of prototypical and annotated (i.e. with the class information) features.
The so-called informative (or discriminative) classifiers estimate from the training set the class-conditional PDFs or prior probabilities [29].More in dept, the so-called informative parametric classifiers assumes a specific class-conditional PDF (e.g.multivariate gaussian PDF) whose parameters (e.g.mean and covariance matrix) are estimated from the training set.The informative non-parametric classifiers, on the contrary, do not assume any specific PDF [27] (pp.84, 161).
Once found, the class-conditional PDFs or prior probabilities are inserted (plugged) into, respectively, eq. ( 2) or (3) in order to directly implement the Bayes classifier: for this reason, informative classifiers are also called plug-in classifiers [30].
Finally, the so-called generative classifiers directly implement a discriminant function by maximizing a proper criterion (e.g. the so-called Marginal Risk function, as shown the SVM section) on the whole domain of the training set [29].

LDA
The linear classifier, also known as Linear Discriminant Analysis (LDA) [27] (p.117), is an example of informative parametric classifier.It corresponds to a discriminant function ℎ that is a linear-affine combination of the feature, in the form: where  ∈ ℝ  and  ∈ ℝ are the classifier's parameter, also known as projection vector and threshold, respectively.
Assuming the class-conditional PDFs as multivariate gaussian with mean values  1 ,  2 and equal covariance matrices  1 =  2 ≡ , as well as equal a priori probabilities (℘ 1 = ℘ 2 ), the Bayes classifier of eq. ( 2) reduces to a linear classifier, with parameters [27] (p.39): Both sample means  1 ,  2 and the sample covariance matrix  of eq. ( 4) are estimated using a maximum-likelihood approach [27] (p.89): where   is the number of training features belonging to the class   and  =  1 +  2 .
As shown in eq. ( 4), the covariance matrix  ̃ must be invertible (i.e.non-singular) but, unfortunately, this requirement is not always met.Regardless of the training set, for example, when the number of training features are less than the feature dimensionality (i.e.  <  + 2),  ̃ is certainly singular.To overcome to this problem, several LDA variations have been proposed.In the Regularized LDA (RLDA),  ̃ is replaced by its shrinkage estimation (also known as ridge estimation) defined as  ̃() =  ̃+ , where  is the shrinkage parameter and  is the identity matrix.In the pseudo-inverse LDA, the inverse  ̃−1 is replaced by the pseudoinverse  ̃+ [31].

SVM
The Support Vector Machine (SVM) is an example of discriminative classifier with a linear discriminant function (a linear discriminative classifier).Before going into the classifier's details, we need to explain some introductory concepts.
Geometrically, a discriminant function    +  defines a hyperplane in the -dimensional space with equation Π = { ∈ ℝ  |   +  = 0}.For each training feature   , a signed distance from the hyperplane Π is defined as   =     + .The minimum positive distance  + refers to the minimum distance of the feature vectors belonging to  1 , while the minimum negative distance  − refers to the minimum distance of the feature vectors belonging to  2 .Training features with distances equal to the minimum distance (either positive of negative) are called support vectors.Finally, the margin of the classifier is defined as the sum of the minimum positive and negative distances [27] (pp.259-263).
In the so-called hard margin SVM, the training set is assumed linearly separable, that is, there exist a proper linear discriminant function that correctly classify each feature vector   .The projection vector and the threshold are obtained by maximizing the margin, solving the following quadratic optimization problem [32]: where   = {−1, +1} is the class of the -th feature and  is the cardinality of the training set.
Margin maximization is equivalent to the minimization of the criterion , called Expected Risk function, defined as [32]: The expected risk function represents the mean classification error.It depends from both the training set and the chosen hyperplane and does not require any probabilistic assumption on the training set.
Hard margin SVM can be modified (the so-called SVM with soft margins) to work also with not linear-separable dataset, by adding a set of non-negative slack variables {  ≥ 0} =1  .
They take in account each misclassification error: if the feature   is correctly classified, the corresponding slack variable   is set to zero, otherwise its value is changed accordingly.The optimization problem for SVM with soft margins is the following [32]: where  is an additional parameter that control the penalty of the classification errors.SVM with soft margins is not the only solution to treat not linear-separable data sets.By projecting the features into a potentially much higher dimensional space, the dataset can be made linear-separable.The projection is obtained by using a not-linear function : ℝ  → ℝ > .In estimating the classifier parameters, the optimization problems are thus solved in the (higher dimensional) transformed space and the following classification is performed on the (higher dimensional) transformed feature vectors ().
Working with high dimensional data generally increase both the complexity of the estimation problems (the so-called curse of dimensionality) and the computational cost of the classification.If the former problem (estimation complexity) is mitigated by the fact that in the higher dimensional space the selected classifier is much simpler (linear), the latter remains [33].In fact, as shown in eq. ( 4), the linear classification of each feature vector requires () additions and multiplications, corresponding to the scalar product   .
Fortunately, for certain mapping , the scalar product ()  () can be equally expressed in terms of a kernel function : ℝ  × ℝ  → ℝ such that [33]: The property described in eq. ( 13) is often called kernel trick.Once defined a proper kernel , the scalar product can be implicitly computed in the higher dimensional space without explicitly using (or even knowing) the mapping function .This allows a straightforward re-formulation of all the linear classification algorithms in the higher dimensional space: SVM, for example, generalize to the so-called Nonlinear SVM (also known as SVM with kernel).A commonly used kernel is the gaussian kernel (or Radial Basis Function, RBF), defined as [32]: 2.1.3.kNN k-Nearest Neighbours (kNN) is an example of informative non-parametric classifiers.It estimates the posterior probabilities Pr{ 1 |} and Pr{ 2 |} in order to implement the Bayesian discriminant function of eq. ( 2).
The -neighborhood of a feature , ℬ  (), is defined as the first  training features closest (according to a proper distance function) to the feature  .In kNN, the posterior probabilities Pr{ 1,2 |} are estimated from every neighbourhood ℬ  (), ∀ ∈ ℝ  as the ratio of neighbours belonging to class  1,2 [34]: where the notation ~  stands for " belongs to class   ".
As described in eq. ( 13), kNN training (i.e.posterior probabilities estimation) coincide also with the kNN classification: an unknown feature  is assigned to the class most frequently represented among all its  nearest sample features [27] (182).
The proper choice of  influences the kNN's performances.Intuitively, as  increases, the classification error gets closer to the minimum (i.e. the Bayes error), since the probability estimation becomes globally more accurate.On the other hand, as  increases, also the risk to include into ℬ  () very distant training features increases and the estimation becomes locally less accurate [27] (p.184).
For the special case of  = 1, the classification error of the 1NN classifier can be bounded by the Bayes error   as:   ≤  1 ≤ 2  [27] (p.179).
As previously mentioned, kNN build the k-neighbourhood according to a proper distance or, more correctly, a proper metric.For this reason, kNN is also defined a metric classifier.A commonly used class of metrics is the Minkowsky distance   , defined as [27] (p.188): (14) where  ∈ ℕ + .In particular,  1 is called Manhattan (or city block) distance,  2 is called Euclidean distance, while  ∞ ≐ max =1,… {|  −   |} is called infinite-norm distance.

Classification performances
In introducing the Pattern Recognition, we underlined that the classifiers are built using a set of previously annotated class-prototypical features, the training set.It is common practice to extract from the training set a subset of annotated feature (the test set) and use it to evaluate the performances of the trained classifiers -but not to train it.
Since the training set is limited, the specific train/test splitting introduce a bias in both the training and performance evaluation.This can be avoided following the so-called  -fold cross-validation scheme.The original training set  is partitioned into  disjoint and equal sized sets,  = ⋃    =1 . The classifier is, then, trained -times using, each time, as test set a different partition   and as training set the remaining ⋃   ≠ .Finally, the overall performance are computed as the average over the  single performances [27] (pp.483-485).
With the general term performance, we mostly refer to the classification accuracy , defined as the ratio between the number of correctly classified features and the total number of features.Introducing the chance level accuracy  0 as the ratio between the number of features for each class (i.e.how much balanced is the training set), we can additionally define as performance the Kappa statistic:  = ( −  0 )/ 0 [35].Compared to ,  is a more robust performance measure, since it is normalized by the class unbalances.Another solution to take in account the class unbalances is to compare (using e.g. a t-test) the  cross-validated accuracies against  random accuracies, obtained from a random classifier [35].

Dynamic classifiers
To classify a time-varying signal (i.e. to perform a dynamic classification), an ordered sequence of features {  } =1  (i.e.temporal features), corresponding to  adjacent temporal windows, is extracted.The temporal features are fed into either a "dynamic" classifiers, such as Hidden Markow Model (HMM) [21], or an ordered sequence of "static" classifiers {  } =1  [36][37][38][39].The former fully takes in account the signal's temporal variability, since it uses the entire sequence during the training phase.The latter train each static classifier   , using only the corresponding features   , but provides an ordered sequence of accuracies {  } =1  , where each   corresponds to   .

Feature selection
As stated in the previous sections, the curse of dimensionality arises when the number of available training features is small, compared to the feature dimension .In such situations, the parameter estimation becomes problematic (see e.g. the problem of the singularity of the estimated covariance matrix described in section 2.1.1)and the trained classifier usually underperforms.
As a rule of thumb, the number of training features  should be an exponential function of the dimensionality (e.g. = 10  ), with a ratio growing with the complexity of the classifiers [30]: by fixing the feature dimension , linear classifiers requires, for example, a less numerous training set.Additionally, even with an adequate training set, feature dimensionality impact on both the training and classification speed.In fact, as stated in sect.(2.1.2),linear classification requires () multiplications and sums to compute each scalar product.Reducing the feature dimensionality by means of so-called feature selection algorithms, a classifier can be made more robust (i.e. less sensitive to the curse of dimensionality) and efficient (in terms of computational speed).
Feature selection can be broadly described as a mapping function : ℝ  → ℝ  such as: where  <  and { 1 ,  2 , … ,   } ⊂ {1,2, … , }.In other words, a feature selection algorithm performs a projection of the original feature vector onto a lower dimensional subspace defined by a subset of scalar features.The best subspace, as selected among all the possible 2  , should not significantly decrease the classification performances, both globally (i.e.how features are classified, overall) and locally (i.e.how the single feature is classified) [40].
Features selection algorithms can be broadly grouped according to the following criteria [41]: 1. Label information.Supervised algorithms take in account the class information, while unsupervised algorithms do not, the training features as belonging to a same class.2. Search strategy.Filter algorithms (also known as classifier-independent) are based on a two-step "ranking and selecting" criterium: scalar features are first ranked according to a proper criterion; then only the "best" ones are selected.Wrapper methods (also known as classifier-dependent methods) uses the selected classifier, following an "ad hoc" approach: the selected scalar features are those that gives the best classification performance An example of supervised filter algorithm is the biserial correlation coefficient.Given a training set  composed by  + features belonging to the class +1 and  − features belonging to the class −1, the biserial correlation coefficient for the -th scalar feature   is given by [42]: where (⋅), (⋅) are, respectively, the sample mean and sample standard deviation operators and   + ,   − are the subset of   belonging to, respectively, the classes +1 and −1.score is obtained summing the  coefficients of each scalar feature   .Once sorted the scores   2 in descendent order, the feature selection is made simply by selecting the first scalar features whose summing score get a percentage (e.g.95%) of the total feature score.

Stimuli and experimental paradigm
In the present experiment, we used two sensory categories of stimuli (i.e., visual and auditory), which were extracted from two standardized international archives.Visual stimuli consisted of pictures of 28 faces extracted from the NIMSTIM archive [43], whereas auditory stimuli consisted of 28 sounds chosen from the International Affective Digitized Sounds (IADS) archive [44].
For each sensory category, the stimuli were further extracted according to their arousal value.We selected 14 neutral faces and 14 fearful faces from the NIMSTIM inventory, and 14 low-and 14 high-arousal sounds were selected from the IADS repertoire and balanced by arousal with the NIMSTIM stimuli set.These materials are available at: https://doi.org/10.6084/m9.figshare.6874871.v3[37] All participants were presented with two different experimental tasks, which were delivered in separate blocks (see Figure 1).We defined the two condition as Passive and Active Task.Both the block presentation order and the response button were counterbalanced between subjects to avoid possible response biases.The two tasks are described in the Figure 1.The figure illustrates the sequence of events and the temporal trial structure relative to the passive (top) and the active (bottom) tasks, which were delivered in blocks.Within each task, the stimuli were randomly presented and equally distributed according to either sensory category (faces or sounds) and arousal level (high or low) Figure 1.Experimental tasks.

Passive task
As shown in Figure 1, at the beginning of each trial, participants were presented with a warning signal, a fixation cross presented centrally on the screen for 300 ms.After that, a fixed 1000ms blank inter-stimulus interval (ISI) was delivered, followed by a 500ms target stimulus.The target stimulus could be either the picture of a face presented on the centre of the screen or a sound delivered bilaterally through two loudspeakers, with a 50% distribution.Half of the stimuli within each category were low-arousal and the other half were high-arousal, equally distributed.Participants were told that they had to guess which kind of stimulus they would be presented with.No behavioural responses were required until they actually received the stimulus target.At target onset, participants had to discriminate between visual or auditory stimuli by pressing two different buttons on the response box.The response buttons were counterbalanced across participants.After the response, the stimulus target disappeared and a blank screen was presented for a jittered duration between 1000 and 1200 ms (inter-trial interval) before the beginning of the next trial.

Active task
In the active task, event sequence and timing were the same as those in the passive task.As illustrated in Figure 1, the only difference in comparison with the passive task was that, after the prestimulus ISI, participants were presented with a slide showing a central question mark.They were then asked to make an explicit choice about the sensory category of the upcoming stimulus by pressing the response box.This allowed us to obtain an overt behavioural measure of the anticipation of random events as the percentage of correct responses compared to chance-level for each stimulus category.Immediately after participants' response, stimuli were presented for 500 ms.As with the passive task, the response buttons were counterbalanced across participants.
A total of 200 trials sorted by stimulus category and arousal was presented in each task, for an experiment duration of about 18 minutes.In both tasks, stimuli presentation was fully randomized.Specifically, the trial-type randomization was generated online during the ISI by using a true random number generator (TrueRNG-2™).The TrueRNG hardware uses the avalanche effect in a semiconductor junction to generate true random numbers.Randomization via an external TrueRNG device does not rely on seed-based randomization algorithms, but on current fluctuations within the device, assuring a true random distribution.The RNG was interfaced with the stimuli presentation software E-Prime™ 2.0.8.90.

EEG recording
During the entire experiment, the EEG signal was continuously recorded using a Geodesic high-density EEG system (EGI GES-300) through a pre-cabled 128-channel HydroCel Geodesic Sensor Net (HCGSN-128) referenced to the vertex (CZ), with a sampling rate of 500 Hz.The impedance was kept below 60kΩ for each sensor.To reduce the presence of EOG artefacts, subjects were instructed to limit both eye blinks and eye movements, as much as possible.

EEG preprocessing
The continuous EEG signal was off-line band-pass filtered (0.1-45Hz) using a Hamming-windowed sinc FIR (finite impulse response) filter (order = 16500) and then downsampled at 250 Hz.The EEG was epoched starting from 200 ms before the cue onset and ending at the stimulus onset.The initial epochs were 1300 ms long from the cue onset, including 300 ms of cue/fixation cross presentation and 1000 ms of interstimulus interval (ISI).
All epochs were visually inspected to remove bad channels and rare artefacts.Artefact-reduced data were then subjected to the ICA (independent component analysis) [45].All independent components were visually inspected, and those related to eye blinks, eye movements, and muscle artefacts, according to their morphology and scalp distribution, were discarded.The remaining components were back-projected to the original electrode space to obtain cleaner EEG epochs.
The remaining ICA-cleaned epochs that still contained excessive noise or drift (±100 μV at any electrode) were rejected and the removed bad channels were reconstructed.Data were, then, re-referenced to the CAR (common average reference) and the epochs were baseline-corrected by subtracting the mean signal amplitude in the pre-stimulus interval.From the original 1300 ms long epochs, final epochs were obtained only from the 1000 ms long ISI.

Static Spectral Features
From each epoch and each channel , the PSD (Power Spectral Density) was estimate by a Welch's periodogram using a 250 points long Hamming's windows with 50% overlapping.PSD was first log-transformed to compensate the skewness of power values [46], then the spectral bins corresponding to alpha, beta and theta bands -defined, respectively, as 13~30, 6~13 and 4~6 [47] -were summed together.Finally, alpha, beta and theta total powers were computed as: As a measure of the emotional arousal, we computed ratio between beta and alpha total powers    /   [48], while to measure the cognitive arousal we computed the ratio between beta and theta total powers    /   [49].For each epoch, the feature (with a dimensionality of 256) was obtained concatenating the betaover-alpha and beta-over-theta ratio of all the channel:

Static Temporal features
It has been previously showed that arousal level (high or low) can be estimated from the Contingent Negative Variation potentials [37].The feature extraction procedure, therefore, follows the classical approach for the Event-Related Potentials [50].Each epoch from each channel was first band pass filtered (0.05~10) using a zero-phase 2 nd order Butterworth filter and decimated to a sample frequency of 20.EEG signal was thus normalized (i.e.z-scored) according to the temporal mean and the temporal standard deviation: (  ) = ( ̃(  ) −   )/  where  ̃(  ) is the raw signal from i-th channel at time point   ,   and   are, respectively, the temporal mean and the temporal standard deviation of the i-th channel.For each epoch, the feature (with a dimensionality of 2560) was obtained concatenating all normalized signal from each channel:  ≐ [ 1 ( 1 ),  1 ( 2 ), … ,  1 ( 20 ), … ,  128 ( 1 ),  128 ( 2 ), … ,  128 ( 20 ) ] (16)

Dynamic Features
Each epoch was partitioned into 125 temporal segments, 500 ms long and shifted by 1/250 s (1 sample).Within each time segment, we extracted the dynamic spectral and temporal features, following the same approaches described in, respectively, sect.3.4 and 3.5.Dynamic temporal features had a dimensionality of 1280, corresponding to 0.5 × 20 = 10 samples per channel.Dynamic spectral features had the same dimensionality of their static counterparts (256), but the Welch's periodogram was computed using a 16 points long Hamming's window (zero-padded to 250 points) with 50% of overlapping.

Feature reduction and Classification
The extracted features (both static and dynamic) were grouped according to the stimulus type (sound or image) and the task (active or passive), in order to classify the group-related arousal level (high or low).A total of 4 binary classification problems (high arousal VS low arousal) were performed: active image (Ac_Im), active sound (Ac_So), passive image (Ps_Im) and passive sound (Ps_So).
Static features features were reduced by means of the biserial correlation coefficient  2 with the threshold set at 90% of the total feature score.In order to identify the discriminative power of each EEG channel, a series of scalp plots (one for each feature type and each group) of the coefficients were drawn.Since each channel is associated with  > 1 features (as well as   2 coefficients), the coefficients (one coefficient for each channel) are calculated as mean value.In other words, spectral and temporal features had, respectively, 2 and 20 scalar features for each EEG channel: to compute their scalp plots, we averaged, respectively 2 and 20  2 coefficients of each channel.To enhance the visualization of the plots, the coefficients were finally normalized to the total score and expressed as percentage.
Each classification problem was addressed by mean of 3 classifiers: LDA with pseudo-inverse covariance matrix, soft-margin SVM with penalty parameter  = 1 and RBF kernel, kNN with euclidean distance and k=1.Additionally, a random classifier, giving a uniform pseudo-random class (Pr{HA} = Pr{LA} = 0.5), served as a benchmark [35].The accuracy of the classifiers was measured repeating for 10 times a 10-folds crossvalidation scheme.The feature selection was computed within each crossvalidation step, to avoid overfitting and reduce biased results [42].
For each group (Ac_Im, Ac_So, Ps_Im, Ps_So) and each feature type (static spectral, static temporal), the classification produced a 10 × 4 matrix containing the mean accuracies (one for each of the 10-folds crossvalidation repetitions) of each classifier.
Dynamic features were reduced and classified similarly to the static ones.For each temporal segment, the associated features were reduced by means of the biserial correlation coefficient (threshold at 90%) and the classifiers (SVM, kNN, LDA and random) were evaluated using a 10-folds crossvalidation scheme -repeated for 10 times.
For each group, each feature type (dynamic spectral, dynamic temporal), each temporal segment and each classifier, the classification produced 10 sequences of mean accuracies {  } =1 125 -one for each repetition of the 10-fold crossvalidation scheme.

Statistical Analysis
The results of the static classifications were compared against the benchmark classifier by means of 2-samples t-test (right tail).
The results of dynamic classifications (i.e. based on dynamic spectral or dynamic temporal features) were compared following a segment-by-segment approach.For each group, the accuracy sequences of the dynamic classifiers (SVM, kNN and LDA) were compared with the benchmark accuracy sequence.Each sample    , with  = {SVM, kNN, LDA}, was tested against   Random by means of 2-sample t-tests (right tail).The corresponding p-value sequences {   } =1 125 were Bonferroni-Holm corrected for multiple comparisons.Finally, the best accuracy point was detected as the left extreme of the temporal window corresponding to the highest significant accuracy.

Static Features
In Figures 1-2 are shown the scalp distributions of  2 coefficients for each binary static classification problem, grouped for feature (spectral, temporal) and groups (Ps_Im, Ps_So, Ac_Im, Ac_So).
The temporal feature gave the most consistent topographical pattern, showing that the regions that best differentiate between high vs low stimuli (auditory and visual) were located over the central-parietal electrodes, whereas a more diffuse pattern in the scalp topography emerged for the spectral features.
In Figures 3-4 are shown the box plots of the accuracies of static temporal and spectral classifications, grouped for condition.From left to right: LDA, SVM, kNN and random.Note that SVM accuracies (the 2 nd boxplot from the left) are always shown as a line because its accuracies were constant within each crossvalidation step (see also Tables 1-2).
Note that all the accuracies refer to the same static classification problem (High arousal VS Low arousal), performed using different classifiers (SVM, LDA, kNN) and features (spectral, temporal), on different groups (Ps_Im, Ps_So, Ac_Im, Ac_So).
The following Table 1 report the accuracies for static features, ordered in descendent order and grouped for classifier, feature and group.Note that all the accuracy plots refer to the same dynamic classification problem (High arousal VS Low arousal), performed using different classifiers (SVM, LDA, kNN) and features (spectral, temporal), on different groups (Ps_Im, Ps_So, Ac_Im, Ac_So).

Discussion
The aim of the study was to provide new methodological insights regarding the ML approaches in the classification of the anticipatory emotion-related EEG signal, by testing the performance of different classifiers on different features.
From the ISIs (Inter Stimulus Intervals, i.e. the 1000 ms long window preceding each stimulus onset) we extracted two kind of "static" features, namely spectral and temporal, the most commonly used features in the field emotion recognition [19,20].As spectral features we used the beta-over-alpha and the beta-over-theta ratio, whereas for the temporal feature we concatenated the decimated EEG values Additionally, we extracted the temporal sequences both static spectral and temporal features, using a 500 ms long window moving along the ISI to build, respectively, "dynamic" spectral and temporal features.This step is crucial for our work since, considering the temporal resolution of the EEG, an efficient classification should take in account the temporal dimension, to provide information about when the difference between two condition are maximally expressed and therefore classified.
We trained and tested three different classifiers (LDA, SVM, kNN, the most commonly used in the field of emotion recognition [19,20]) using both static and dynamic features comparing their accuracies against a random classifier, that served as benchmark.
Our goal was to identify the best classifier (static VS dynamic) and the best feature type (spectral vs temporal) to classify the arousal level (high VS low) of 56 auditory/visual.The stimuli, extracted from two standardized datasets (NIMSTIM [43] and IADS [44]), for visual and auditory stimuli, respectively) were presented in a randomized order, triggered by a hardware true RNG device.
Considering the number of group ( 4), the number of classifiers (3) and the number of feature types (2), each classification (static or dynamic) produced a total of 24 accuracies, whose significances were statistically tested (using a 2 samples t-test and the benchmark's accuracies).
Within the 9 significant accuracies obtained using static features, the classifier that obtained the highest number of accuracies was the SVM (6 significant accuracies), followed by kNN (2 significant accuracies) and LDA (1 significant accuracies).The most frequent feature was the temporal (5 significant accuracies).Finally, the best (static) feature-classifier combination was the SVM with spectral features (51.8%), followed by LDA with spectral features (51.4%) and kNN with temporal features (51%).
Within the 13 significant accuracies obtained using dynamic features, the classifier that obtained the highest number of accuracies was the SVM (6 significant accuracies), followed by LDA (4 significant accuracies) and kNN (3 significant accuracies).The most frequent feature was the spectral (8 significant accuracies).Finally, the best (dynamic) feature-classifier combination was the SVM with temporal features (63.8%), followed by kNN with temporal features (63.70%) and LDA with temporal features (63.68%).Spectral features produced only the 5th highest accuracy (53.12% with LDA).The 3 best accuracies were all within the first 100ms of the ISI, albeit a non-significant Spearman's correlation between accuracy and time was observed (r=-0.308,p=0.306).
Our results show that globally the SVM presents the best accuracy, independently from the feature type (temporal or spectral), but more importantly the combination of SVM with dynamic temporal feature produced the best classification performance.This finding is particularly relevant, considering the application of EEG in cognitive science.In fact, due to its high temporal resolution, EEG is often applied to investigate the timing of neural processes most of the time in relation to behavioural performance.
Our results therefore suggest that, in order to best classify emotions based on the electrophysiological brain activity, temporal dynamic of the EEG signal should be taken in account with a dynamic feature and consequently with a dynamic classifier.In fact, by including also time evolution of the feature in the ML model, it is possible to infer when two different conditions maximally diverge, allowing possible interpretation of the timing of the cognitive processes and the behaviour of the underlying neural substrate.
Finally, the main contribution of our results for the scientific community in ML is that they provide a methodological advancement generally valid both for the investigation of emotion based on ML approach with EEG signal and also for the investigation of preparatory brain activity.

Figure 1 .
Figure 1.Spectral features.Scalp distribution of the  2 coefficients (normalized to the total score and expressed as percentage) grouped for tasks and stimulus type.(a) Active task: left Image, right Sound; (b) Passive task: left Image, right Sound.

Figure 2 .
Figure 2. Temporal features.Scalp distribution of the  2 coefficients (normalized to the total score and expressed as percentage), grouped for tasks and stimulus type.(a) Active task: left Image, right Sound; (b) Passive task: left Image, right Sound.

4. 2 .
Dynamic featuresIn the following Figures 5-11 are shown the results of the significant dynamic classifications.In the upper plot is shown the mean (bold line) and the standard deviation (shaded) of the accuracy sequence.In the lower plot (black dashed line) it is shown the Bonferroni-Holm corrected p-values sequence, discretized (as a stair graph) as significant (p<0.05) or not-significant (p>0.05).

Figure 5 .
Figure 5. Spectral dynamic features.Accuracy (mean value, coloured line; standard deviation, shaded line) and p-values (black dotted line) in Ac_Im group for LDA (a) and SVM (b) classifiers.

Figure 6 .
Figure 6.Spectral dynamic features.Accuracy (mean value, coloured line; standard deviation, shaded line) and p-values (black dotted line) in Ac_So group for LDA (a) and SVM (b) classifiers.

Figure 7 .
Figure 7. Spectral dynamic features.Accuracy (mean value, coloured line; standard deviation, shaded line) and p-values (black dotted line) in Ps_Im group for LDA (a) and SVM (b) classifiers.

Figure 8 .
Figure 8. Spectral dynamic features.Accuracy (mean value, coloured line; standard deviation, shaded line) and p-values (black dotted line) in Ac_So group for SVM (a) and kNN (b) classifiers.

Table 1 .
Static features.Ordered accuracies grouped for classifier, feature and group.

Table 4
reports the accuracies for dynamic features, ordered in descendent order and grouped for classifier, feature group and time.

Table 4 .
Dynamic features.Ordered accuracies grouped for classifier, feature and group.