Using a Combination of Electroencephalographic and Acoustic Features to Accurately Predict Emotional Responses to Music

Music has the ability to evoke a wide variety of emotions in human listeners. Research has shown that treatment for depression and mental health disorders is significantly more effective when it is complemented by music therapy. However, because each human experiences music-induced emotions differently, there is no systematic way to accurately predict how people will respond to different types of music at an individual level. In this experiment, a model is created to predict humans’ emotional responses to music from both their electroencephalographic data (EEG) and the acoustic features of the music. By using recursive feature elimination (RFE) to extract the most relevant and performing features from the EEG and music, a regression model is fit and accurately correlates the patient’s actual music-induced emotional responses and model’s predicted responses. By reaching a mean correlation of r = 0.788, this model is significantly more accurate than previous works attempting to predict music-induced emotions (e.g. a 370% increase in accuracy as compared to Daly et al. (2015)). The results of this regression fit suggest that accurately predicting how people respond to music from brain activity is possible. Furthermore, by testing this model on specific features extracted from any musical clip, music that is most likely to evoke a happier and pleasant emotional state in an individual can be determined. This may allow music therapy practitioners, as well as music-listeners more broadly, to select music that will improve mood and mental health.


I. INTRODUCTION
Music is known to be an extremely powerful tool that can make listeners feel pleasure, happiness, sadness, and even fear (Fritz et al. 2009). Music therapy is also a health-intervention that has subsequently proven to be an effective treatment for poor mental health, mood disorders, and depression (Maratos et al. 2008).
For example, music therapy was determined to significantly improve mood when compared to treatment as usual (Maratos et al. 2008; Ramirez et al. 2018). Furthermore, Chen (1992) determined antidepressant drugs coupled with music therapy were more effective than antidepressant drugs alone in improving mental health. Also, Koelsch & Jäncke (2015) concluded that music can reduce pain and anxiety in patients with heart disease by lowering heart rate and blood pressure.
In music therapy, a therapist prescribes music selections to a patient to listen to based solely on the therapist's expertise, experiences, and evaluation of the patient (Tamplin & Baker 2006;Maratos et al. 2008). However, to determine the optimal music to prescribe to the patient, the individual's emotional reaction to the signal must be predicted. Predicting a human's emotional response to a musical selection they have never heard is an extreme challenge because all humans experience music-induced emotions differently depending on their life experiences, moods, influences, gender (Hunter, Schellenberg & Schimmack 2010;McRae et al. 2008), and a wide variety of other factors. Despite music therapy's promising results as a healing agent for depression and mental health disorders, music therapy does not currently rely on a systematic method to predict emotional responses to music on an individual level.
Many models for the relationship between induced emotions in humans and attributes of music have been created. For example, Schubert (2004) studied the effects of musical features and music theory attributes such as dynamics and melodic contour on emotions. Gabrielsson & Lindström (2001) subsequently studied the effects of variations of tempo, articulation, dynamics, and intonation on perceived emotion by using a model in which emotional responses ranging from strong to weak were plotted across two axes: one for pleasantness and one for excitement. This model also featured plots of tempo and other musical descriptors. This model may provide information about how individual music descriptors affect emotional responses. However, because all people perceive emotions differently and therefore individual-level emotional predictions are necessary, using physiological data may be of significant value.
Other researchers have used physiological measurements as correlates for induced emotional responses. For example, Etzel et al. (2006) tested the effect of music meant to evoke different moods including happiness, sadness, and fear on cardiovascular activity. Similar research has found that certain selections of pleasant and happiness-inducing music may increase heart rate (Brouwer et al. 2013 On the other hand, researchers have also turned to studying brain activity to measure music-induced emotions. Such experiments have been performed using functional magnetic resonance imaging (fMRI) for example (Koelsch et al. 2006;Brattico et al. 2011 That being said, using EEG to predict musicinduced emotions has shown to be a difficult feat because EEG is non-stationary and very noisy ( Therefore, by selecting only the most important and relevant descriptors, a combination of both acoustic and EEG features may be used to train a model to predict music-induced emotions at a high accuracy on the individual level. Since classical-style music has been shown to induce emotions at a strong level (Schaefer 2017; Kreutz et al. 2007), this study utilizes the dataset created by Daly et al. (2015), in which classical music clips are played to participants while their EEG is recorded. Descriptive features from the musical clips (from the set of stimuli from Eerola & Vuoskoski 2010) are then extracted and brain activity to train a Lasso regression model to predict each participant's emotional responses to music.

Methods and Experimental Data
This analysis utilized the EEG and emotional response data gathered from Daly et. al (2015). The data consisted of thirty-one individuals, thirteen males and eighteen females, whose ages ranged from 18 to 66. Each participant's electroencephalogram (EEG) was recorded from 19 electrode channels positioned from the nasion to inion as according to the 10-20 EEG Placement System (Milnik 2006).
This study also featured classical music stimuli drawn from a dataset of 360 excerpts from film scores including titles such as Psycho (1960), Gladiator (2000), and Big Fish (2003). The musical stimuli in this dataset were specifically chosen to induce specific emotions in human listeners (Eerola & Vuoskoski 2010).
The participants were told to stay still and each listened to 40 randomly drawn musical selections from Eerola & Vuoskoski (2010) for 15 seconds. Immediately after, they responded to a series of eight Likert-scale questions on a scale of strongly disagree to strongly agree to identify their emotional response across eight axes: happiness, sadness, pleasantness, fear, anger, tenderness, tension, and energy-level (Daly et al. 2015).
Because some of these emotion categories are likely to be highly correlated (Larsen & McGraw 2001), a Principal-Component Analysis (PCA) was used to reduce the eight axes into three principal components (PCs), which explains a large amount of the variance (75%) of the participant's music-induced emotional responses (Daly et al. 2015). These three PCs represent valence-arousal, a measurement of pleasure and happiness; energy-arousal, a quantification of liveliness; and tension-arousal, a measurement of tenseness ( (Schimmack & Grob 2000). It is these three PCs that are ultimately subjected to further analysis.
Acoustic features from the music played to each participant from the musical dataset and various anatomical features from the participants' EEG data are then extracted. Through recursive feature elimination (RFE), the most performing features are selected to accurately predict a participant's self-reported emotional response to music along each of the three axes of the PCs.

EEG Features
Because EEG signals are so noisy (Daly et al. 2012; Jiang et al. 2019), the EEG signals were first preprocessed to remove artefacts and noise. Discrete Wavelet Transform (DWT) was then used to decompose the EEG signals into alpha ( ) [12-24Hz), beta ( ) , gamma ( ) [48-80Hz), delta ( ) (0-6Hz), and theta ( ) [6-12Hz) wavelet bands. DWT provides high-frequency resolution at the high and low EEG frequencies. DWT is an extremely convenient tool for processing non-stationary EEG and rendering these signals more suitable for feature extraction than just standard EEG alone (Qazi et al. 2016). From the , , , , and wavelet bands, a total of 285 features were extracted from the EEG signals using Minimum Norm Estimates (MNE). RFE was further utilized to determine that the features below were beneficial to training the model.
The energy of each wavelet band, or strength of the signal at any time interval as it represents the area under the curve (AUC), was computed. The mean and standard deviation of each wavelet band was also extracted and used to train the model. The standard deviation supplies information about how close the features are to the signal's mean and is mathematically represented as shown below: Where xi represents the random variable with a mean signal of μ The entropy of an EEG signal represents the outcome uncertainty measurement and is calculated through the following formula: Where x represents the random variable with possible outcomes x(i=1), …, xn, which occur with a probability P(xi) The Hjorth parameter of mobility is a normalized slope descriptor that represents the square root of the variance of the first derivative of the signal divided by the variance of the signal (Nascimben, Ramsøy, Bruni 2019). It is otherwise stated as the proportion of standard deviation to a signal's power spectrum and has been noted to be advantageous to emotion recognition (Li et al. 2018). Results from the addition of the Hjorth parameter to the model significantly improved the model's accuracy, as hypothesized. This mobility parameter is mathematically represented as shown below: Hm(t) equates to the square root of the variance of the first derivative of signal y(t) divided by the total variance of signal y(t) Power frequency bands extract features by using frequency bands to compute the power spectrum of an EEG signal (Al-Fahoum & Al-Fraihat 2013). The five frequency wavelet bands extracted the power of each of the 19 electrodes' signals, resulting in 95 features.

Acoustic Features
From each of the 360 musical clips used as stimuli, a range of specific acoustic features was extracted. These extracted features included both spectral and temporal types, ranging from describing each piece's key and tonal qualities to the speed. A total of 352 musical features were selected. RFE was also used to make sure the acoustic features were beneficial to the model. Furthermore, for each of the eleven acoustic feature types (except key and bpm), the kurtosis, maximum, mean, median, minimum, skewness, and standard deviation were also extracted. The acoustic feature types are described below.
Chroma and chromagrams can be described as the transformation of a musical signal's pitches (twelve possible pitches in total) through time. From chroma, both the intensity and certain pitch are extracted -a total of 84 features. BPM or beats per minute provides the number of beats in the signal per minute. A low tempo of 40 BPM, for example, is very likely to indicate a slower, solemn musical stimulus while a tempo of 160 BPM is likely to indicate an upbeat, faster-paced musical piece. BPM proved to be an extremely important feature in determining the emotional responses to a musical selection.
Zero-crossing rate (ZCR) can be described as the rate at which the signal of a music stimulus may change from positive to negative or negative to positive, thus crossing zero (zero-crossings) in a fixed amount of time; ZCR amounted to a total of seven features. Major and minor keys turned out to be an extremely significant feature in determining the emotional responses to an audio signal. A major key versus a minor key is determined by analyzing the notes or pitches present in a signal. It is also widely acknowledged that songs in major keys tend to sound bright and cheerful while songs in minor keys are more melancholy.
MFCC or Mel-frequency cepstral coefficients are computed by taking the logarithm of the Fourier coefficients of an audio signal that has been converted to the Mel-scale. MFCCs ultimately represent the timbre of the audio signal. (Stevens 1937 Fig. 1, the redder the pitch is at a certain time, the more intense that coefficient is.
Tonnetz is a feature that determines the tonal centroids, or harmonic components of the signal when extracted.
The spectral centroid of a musical signal defines which frequency the energy of a spectrum is centered upon or where the center of mass of the spectrum is located. The spectral roll-off is the frequency below which a percentage (normally 0.85) of the spectral energy of the signal lies (Jang et al. 2008). The spectral contrast is the difference in level between the crest and troughs of the spectrum. The spectral flatness can determine how noisy a signal is in decibels (with 1.0 indicating the spectrum is white noise). Finally, the spectral bandwidth, or the extent of the power transfer around the center frequency of the audio signal, is extracted (Theimer et al. 2008). Altogether, the spectral features contribute 77 features to the model.

Training the Model
A total of 637 extracted features (285 EEG features and 352 acoustic features) were used to train a Lasso-regularized linear regression model (with an alpha parameter set to 10) to predict the participants' emotional responses to music in terms of the PCs. To best fit the data and generate accurate predictions, a multi-task, cross-validated, Lasso regression model with five folds was used. In each of the five folds, the features were split into a training and testing set. Combinations of the training set features were then related to the PCs to create a linear regression model that would fit the emotional responses of the participants as noted by their recorded PCs.
After being trained on the said training set, the model attempted to predict the participant response PCs from the previously unseen features in the testing set. By identifying how close the model's predicted response PCs are to the participants' actual, recorded PCs for each data point in the testing set of each crossfold, the model's performance and statistical can be determined.

Recursive Feature Elimination (RFE)
As first utilized for gene selection, RFE is an effective method that determines the importance of features. The RFE algorithm was trained on the whole set of features initially (637 features: 285 from the EEG and 352 from the music). The algorithm could then determine the importance of each feature to the correlation model by assigning weights and eliminating the lowest-ranking features (Li et al. 2018). This process occurred recursively for several rounds until all the features had been selected. Using an RFECV (crossvalidation) selector, the 17 feature types (comprising the 637 features) were ranked to determine which features were most beneficial to obtain the regression model's correlation. RFE was also used to determine that the features would only improve the model's prediction accuracy.

Correlation Analysis
The model's prediction performance is first evaluated on the EEG and acoustic features separately and then combined. These feature subsets are then used to evaluate the response PCs individually. For each PC, the mean correlation (r), between the actual recorded PC and the predicted PC is calculated. The results of these correlations are displayed in Table 1. Given the noise and non-stationarity of the EEG data (Hassani & Karami 2015) and the 75% variance of the participant's emotional responses (Daly et al. 2015), training the model on each feature subset resulted in a predicted response PC of high correlation (p<0.001). As displayed in Table 1, the model's correlation was highest when both the EEG and acoustic features are combined.
The results of this correlation are unparalleled to previous works dealing with predicting musicinduced emotions and EEG. For example, our r-values are almost 370% better than a comparable study by Daly et al. (2015). Table 1: The mean correlation performance of the regression model at predicting the response PC from both acoustic and EEG features, acoustic features alone, and EEG features alone. All results are highly statistically significant (p<0.001), but the model performs notably well when trained on both acoustic and EEG features.

RFE Analysis
The RFE selector's ranking of the 637 features determined which of the music and EEG feature types were the most important to the regression model. For the acoustic features, the key and tempo of the song performed best for the model. Similarly, the Hjorth parameter of mobility was the most vital EEG feature for training the model. Additional rankings are summarized below in Table 2. Table 2: A ranking of the acoustic and EEG features. The acoustic features were ranked in the selector from 1 to 352. The mean feature score of each of the 11 acoustic feature types is notated by μ, in which the lower the μ value, the more important the feature type was to the regression model. Similarly, the EEG features were ranked in the selector from 1 to 285. For the EEG, μ' denotes the mean feature score of each of the 6 EEG feature types.

Demographic Analysis
Because the subjects of this study had a variety of age and gender differences and age and gender have been noted to convey differences in affective responses to emotions conveyed by music (Vieillard & Gilet 2013;Hunter et al. 2011), we tested if participant gender and age affected our model's results. Therefore, predictions were calculated on an individual, perparticipant level. Correlations between participant gender and ages and the predictions were then noted.
T-tests proved that the model's ability to predict emotional responses was not influenced by gender or age (p=0.413 and p=0.278 respectively).
The strength of the subjects' emotional responses to the music as taken from the Likert-scale was also tracked. On a scale from 0 to 4, with 4 being an extremely strong emotional response and 0 being a very weak emotional response, analyses found that female participants had an average emotional response strength of 2.48 while males' strengths were 1.99. This difference indicates that the female participants were 24.2% more emotionally expressive towards the music than the males (p<0.001). However, in comparing the emotional response strengths for ages 18-66, no statistically significant differences were found (p=0.250).

Pop Songs (Acoustic Test) For Valence
The need for predicting music-induced emotions at an individual level is vital to music therapy and prescribing people music to improve mood and mental health. Because this model achieved such a relatively high correlation between the predicted emotions and actual responses, we tested if the model could benefit music therapy by predicting which songs would elicit the most positive emotional response in the participants. Although the combination of EEG and acoustic features performed the best, the acoustic features alone yielded a relatively high correlation in comparison to previous research (Daly et al. 2015;Song & Dixon 2015). Thus, the acoustic features were tested to see if they could be used to predict responses to pieces of music not previously listened to by the participants.
As stated in 2.1, the first PC, valence, describes a musical stimuli's positiveness (McConnell & Shore 2010). For example, music with higher valence induces more happy emotions than clips with low valence.
Given that most people listen to pop music over classical music and because music therapy most likely requires music longer than 15 seconds, the model was tested on a random set of longer, three-minute pop songs, as opposed to the original, short classical music excerpts. The model was trained on the acoustic features extracted from the classical music in 2.3 and tested on the held-out, acoustic features extracted from the pop songs. The model was then isolated for valence and ordered the songs from largest to smallest valence, thus indicating which songs would elicit the most cheerful emotional responses in the 31 participants of this study.
The model's performance in predicting participant's valence responses to pop songs they've never heard, based on their emotional responses to short classical music clips is summed up in Table 3. Table 3: The model was trained on acoustic features extracted from short classical music clips and ordered a random selection of 10 pop songs from highest to lowest. As valence rating increases, the song's level of induced-happiness increases.

IV. DISCUSSIONS
Although music therapy is an effective treatment for poor mental health disorders and depression (Maratos et al. 2008), it does not currently employ a systematic method to predict emotional responses to music on an individual level. Predicting human emotional responses to music at high accuracy is a challenging problem because factors such as a person's age, gender, mood, and memories may affect how people emotionally respond to music. This challenge is heightened by combining the acoustic features of the music with the non-stationariness and noisiness of EEG.
Results from this study show that by combining EEG-derived features with acoustic features, emotional responses to music can be predicted at a significantly higher accuracy. This suggests that emotional responses to music are not just based on the musical properties of the music, but also the listener's idiosyncrasies and internal processes.
By training a regression model with the most performing EEG and acoustic features, music-induced emotions can be predicted at a higher accuracy than previously reported. For example, this experiments' outcomes were compared to Daly et al. (2015), which used the same dataset. By processing the EEG and audio differently and extracting unique and performing features only, our model achieved much better correlations. For example, for valence-arousal, they achieved a correlation of 24.3% ± 0.5%, while we achieved a correlation of 77.4%. They achieved a mean correlation of 15.8% ± 0.6% for energy-arousal, while our model reached a correlation of 79.1%. Finally, for tension-arousal, while they only achieved an accuracy of 10.2% ± 0.5%, we met a mean correlation of 79.8%. This represents a 370% increase in accuracy over Daly et al. (2015).
Other researchers have analyzed emotions elicited via the DEAP dataset, using EEG to predict music video-induced emotions (Nascimben et al. 2019; Kumar et al. 2016). However, these experiments extracted features solely from the EEG data and used music videos as opposed to just music as stimuli. Ultimately, Nascimben et al. (2019) achieved a cross-validation accuracy of 65.4%. Additionally, (Kumar et al. 2016) used a single cross-validation run to achieve accuracies of 57.6% for valence-arousal and 62.0% for arousal. Our model is 32% more accurate.
Age and gender differences have been noted to affect emotional responses to music (for example, Vieillard & Gilet (2013) and Hunter et al. (2011)). While there were no differences in emotional responses across age, emotional responses across gender varied heavily, thus supporting Hunter et al. (2011). The strength of female responses to music was 24.2% more intense than those of males. This indicates that females are more emotionally expressive to music than men. That being said, our model's prediction accuracy for music-induced emotions was still consistent across gender and age.
Results from testing our model on a variety of pop songs show that this model can successfully be used in a clinical setting to improve music therapy techniques. For example, by giving people small selections of songs and recording their emotional responses to the clips on eight axes, our model can prescribe them a list of songs targeting specific emotions, e.g. happiness. Ideally, these treatments would also use patients' EEG as a parameter, but given that EEG may be expensive and infeasible, it is not necessary (as demonstrated in 3.4). Ultimately, the model created in this experiment can predict emotional responses to music on an individual level at significantly higher correlations than previously recorded. Our model has the potential to work as a music system to prescribe songs to patients to induce happier and more pleasant emotional states and advance music therapy as a treatment for mental health disorders and the emerging field of brain-computer music interfaces. Future work will strive to use these findings to create machine-generated music with acoustic properties that induce certain emotional states.

ACKNOWLEDGMENTS
The author would like to thank Tyler Giallanza for his support and contributions.