Recognition of emotional states using EEG signals based on time-frequency analysis and SVM classifier

ABSTRACT


INTRODUCTION
Emotion, being a non-verbal vital approach for social interaction, is a psychological and mental state of mind that provokes us to effectively react to a certain situation based on past experience [1]. The implementation of emotion recognition can be applied to various fields such as education, cognitive science, entertainment, machine learning, self-control and security, biomedical engineering, marketing and production. According to Russell, any discrete emotion can be deducted from their level of arousal and valence using his Circumplex Model of Emotion [2].
Many existing methods implemented facial expression, speech signals, and self-ratings to classify emotions [3], [4]. However, the systems used in these existing methods usually fail to acknowledge all the detailed emotional inputs for processing, such as the hand gestures or the tone of the voice, thus leading to vague and biased outcome [3]. Some approaches used subjective measurement that can affect the end result as the presence of anomalous trials can be significant [4]. After the high influence of Electroencephalogram (EEG) signals on the field of research, it was observed that human emotion can be represented more accurately with EEG signals than with facial gestures, speech signals, or self-reporting information [5]. Emotional activities cause the brain to generate signals in the forms of waves and the technique used to record these signals is known as EEG [5]. Predicting emotions using EEG signals was first introduced by Musha et al. [6]. From then onward, studies on this area is bringing more interest in the current years for machine learning purpose and the provision of less expensive EEG measuring devices. In numerous investigations, experiments were restricted to very few subjects, but the methods implemented for a single subject were inadequately simplified and could not be utilized broadly for multiple subjects [7]. Many studies were limited to few channels in order to avoid high complexity and computational costs nevertheless, this can lead to feeble result due to lack of adequate information [8]. For the feature extraction process, different methodologies were used to retrieve features in previous studies and these included Sample Entropy [9], Autoregressive (AR) Model [10], Discrete Wavelet Transform (DWT) [11], and Fast Fourier Transformation (FFT) [12]. Classification of these extracted features was done using various classifiers by the researchers such as Support Vector Machine (SVM) [13,14], Neural Network [15], and k-nearest neighbor (KNN) [16].
Referring to the previously specified issues [3], [4], [7], [8], a novel methodology, for emotion recognition based on time-frequency analysis, is proposed and evaluated with EEG signals from DEAP dataset [17]. In the proposed model, initially, the EEG signals from only the prefrontal cortex are retrieved for further work since emotional activities occur mainly in the frontal and temporal lobe of the brain [18]. Additionally, this also reduced the total number of channels that are to be used in the feature extraction method, as the irrelevant channels are already discarded beforehand. This lead to less computational cost and thus increased the efficiency of the algorithms used in our technique [19]. Existing methods, working with frequency bands, extracted all the five types of frequency bands, namely delta (0.5-4 Hz), theta (4-7 Hz), alpha (7-13 Hz), beta (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29)(30), and gamma (30-60 Hz) [20]. According to the state-of-the-art methods, the emotional and cognitive activities of the brain can be well signified using the alpha, beta, and theta frequency bands [8], [21]. Therefore, these specific bands were considered in this paper. The dataset is distributed into four emotional quadrants, which are high arousal -high valence (HAHV), low arousal -high valence (LAHV), low arousal -low valence (LALV), and high arousal -low valence (HALV). The dataset in each specific quadrant was then averaged for all participants for the specific emotion. The reason for averaging the samples according to the quadrant is due to the inconsistency in emotions felt by the participants. Not all the participants feel the same emotion for a particular video which indicates that some samples in the EEG signals are anomalous and thus can greatly affect the end result. The premise of averaging is to reduce data deviation and to statistically reach close to the actual value. This hypothesis might improve the accuracy of the classification process. Finally, the statistical features, extracted in the frequency domain, were then fed into the SVM classifier in order to classify the emotions.
The subsequent sections of the paper have been organized as follows. Section 2 introduces the dataset used in this paper, as well as the proposed approach for recognizing emotion. Section 3 provides the experimental results along with their analysis. Finally, Section 4 concludes the paper.

PROPOSED METHOD
The proposed model illustrated in Figure 1 represents the overall flow of our work. For this research, the EEG signals were first accumulated followed by data preprocessing. Next, bands of specific frequencies were extracted from the preprocessed data. Subsequently, suitable features were extracted and selected to be fed into the classifier. Finally, SVM classifier was used to classify these selected features.

Data description
In our research, we used DEAP dataset [17] as the source of brain signals. It is a multimodal dataset which can be used to analyze the human affective states. The data collection process was carried out in the controlled light environment. Biosemi ActiveTwo system was used to record the EEG signals of each participant. Two computers were used which were synchronized periodically with the help of markers -one for recording the signals and another for presentation of stimuli. The experiment was carried out using a 1-minute 40 music videos displayed in 17-inch screen but (800x600) resolution was maintained to minimize the eye movements. For the experiment, 32 AgCl electrodes were used to record the EEG signals at a sampling rate of 512 Hz. 16 male and 16 female participants with ages ranging from 19 to 37 were chosen to conduct the experiment. After 20 video trials, the subjects could take a break for snacks. Each trial lasted for 63 seconds, a total of 60-second video and 3-second pre-trial baseline. At the end of each video trial, a manual rating was done by the subjects on a scale of 1-9 to determine four different emotions (Arousal, Dominance, Liking, and Valence).

Signal preprocessing
The dataset was downsampled into 128 Hz and then Electrooculography (EOG) artifacts were removed due to eye movements. The signals were then filtered with a minimum of 4 Hz and a maximum of 45 Hz using a band-pass filter. To create a common reference, the data were averaged. Later on, the data was segmented into 60 seconds by removing the 3 seconds pre-trial baseline and was arranged in Experiment_id order.
For our research, we have chosen pre-processed data files that included EEG signals of each participant. All participants file have two arrays such as data and label, which is clarified below in Table 1. The data array contains EEG signals for all the participants and all the videos inclusive whereas the labels array contains the video ID classifications according to the label (valence, arousal, dominance, and liking).

Signal refining
As the first step towards band extraction, the data were first rearranged to make it appropriate for the extraction process. The preprocessed data from DEAP dataset was used for our work which contained 32 files representing each participant. Each file contained two arrays: one was 3D array, named Data, of size 40x40x8064 and another was the 2D array, named Label, of size 40x4. For our study, the 3D data array was used throughout the course of our work. A total of 40 channels were used to record the EEG signals, out of which 32 were EEG channels and 8 were peripheral channels. Previous studies illustrated that the information related to emotions are focused mostly in the frontal and temporal areas of the brain [18]. However, in order to decrease the computational costs of our proposed method, we only worked with the channels that are related to the frontal lobe of the brain and these channels are Fp1, F3, F7, FC5, FC1, Fp2, Fz, F4, F8, FC6, and FC2. Classification and feature extraction from 3D array were laborious as it was hard to manipulate the data as per our requirements. For this reason, the preprocessed data was sorted to 40 files each representing the music video used in the DEAP dataset. Each video file contained an array of size 8064x352 where the rows represent the length of data and columns represent the total number of channels of the 32 participants as described in Table 2.

Band extraction
The EEG signals used for our research are on the time domain. Existed researches demonstrated that, in order to recognize emotional activities with better accuracy the features are extracted in the frequency domain [22]. This is done by applying FFT to the time domain signal. FFT is an algorithm that is used to convert a signal from the time domain to the frequency domain.
For X and Y of length n, these transforms are defined as follows: where is one of roots of unity and . As it was discussed earlier, emotional activities cause the brain to generate signals in the form of waves. These signals, which can be subdivided into 5 frequency bands, hold a correlation with the emotional activities. These 5 different types of frequency bands are comprised of delta, theta, alpha, beta, and gamma [21]. As stated by [21], the alpha, beta, and theta can well represent the emotional and cognitive process of the brain than the other 2 bands. This is why, we have extracted these 3 bands using Butterworth band-pass filter after applying FFT on the EEG signals.

Feature extraction
The experimenters in [17] also provided information regarding emotions that are supposed to be felt after watching each video. It was estimated that each video can be placed in any of the 4 emotional quadrants which are HAHV, LAHV, LALV, and HALV as represented in Figure 2. Individuals can respond differently for a specific video, which can result in the presence of irregular samples in the EEG signals. These erroneous samples are required to be ruled out of each quadrant in order to minimize the inconsistency in the samples. For our research, in order to reduce the data deviation, the extracted band values of the videos were averaged according to their corresponding quadrant along with specific emotion as illustrated in Table 3.   HAHV  LAHV  LALV  HALV  1  8  16  10  2  9  22  21  3  12  23  31  4  13  24  32  5  14  25  33  6  15  26  34  7  17  27  35  11  18  28  36  -19  29  37  -20  30  38 After sorting the video in accordance with their quadrant and averaging the bands of all the videos from each quadrant, 4 video files were created which contained only the averages values of the extracted bands. These band values were further scaled so that the SVM classifier does not get influenced by large band values. Once the band values were scaled, the features of the input signals were then extracted. For our work, we have extracted the statistical features minimum, maximum, variance, standard deviation, wave entropy, power bandwidth, skewness, and kurtosis based the location or central tendency (statistical Features I), the dispersion or spread (Statistical Features II), and the shape of distribution (Statistical Features III) as illustrated in Table 4.

Emotion classification
Numerous machine learning algorithms have been in the existed studies, out of SVM is measured as one of the most efficient classifiers for classifying emotions [8], [9]. The basic perception of the SVM is to determine a decision hyperplane in order to classify data samples into two classes. The optimum hyperplane for differentiating two groups is determined by maximizing the distances between nearest data point of both the classes and the hyperplane [23]. The classification procedure includes predicting a confusion matrix model by partitioning the sample data into a training set and a test set, for training and validation respectively, using a technique called k-fold cross validation [24]. This technique randomly divides the data into k equal subset of the data and is repeated 10 times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set [24].
In our paper, different combinations of features were used for training and testing the SVM classifier in order to generate the confusion matrix model. This model was then used to find the accuracy depending on the k-fold cross validation. Here, we integrated SVM with 10-fold cross fold validation with the parameters, kernel and regularization, which were selected by the grid-search method. In order to implement SVM, LIBSVM library is used, which is a widely used library for support vector machines [25].

RESULTS AND ANALYSIS
Classifying the statistical features for obtaining a decent outcome was not an easy process. Various aspects were required to be considered prior to reaching to a conclusion as the initial trials did not generate a satisfying output. In this paper, 10-fold cross validation was incorporated with the SVM classifier using the regularization and kernel parameter, which were selected via a grid-search approach. In order to determine the accuracy of k-fold cross validation for the classification technique, (2) was used.
The equation for accuracy is defined as: (2) where TP: the number of true positive, TN: the number of true negative, FP: the number of false positive, and FN: the number of false negative. At the very beginning, prior to averaging the extracted band values according to their corresponding quadrants, all the statistical features from Table 4 were fed into the SVM classifier at once. They were used to train and test the SVM with a specific end goal to construct the confusion matrix model. This model was then used to determine the accuracy based on the 10-fold cross validation. However, the very first trial did not provide an expected output since the accuracy was only found to be 2.03%. This occurred because the features, before averaging the data, do not contain distinguishable characteristics as it can be seen from Figure 3. A box-and-whisker plot was used to graphically represent each feature through their quartiles, as shown in Figure 3 Afterward, we averaged the band values, in accordance with their quadrant to reduce the data deviation and improve the accuracy of the classification process. Once the data were averaged, these were scaled and the statistical features were extracted again as discussed in Section 2.5. The features were once more fed into the classifier all at once. This time, there was a significant improvement in the percentage accuracy for classification as it increased drastically from 2.03% to 32.14%.
The percentage of accuracy before and after averaging the data of all the features is summarized in Table 5. It can be observed from the table that the features do not offer satisfactory outcome prior to averaging the data because not all the participants feel the same emotion for a specific category of video. Hence, the SVM classifier could not create our expected model due to the irregularity in data for each quadrant. Table 5. Accuracy for Classification before and after Averaging the Video Data State of the Data Accuracy (%) Before averaging the data 2.03 After averaging the data 32.14 Even though the accuracy for classification improved by a certain amount after averaging the band values, it still did not provide the expected output. Subsequently, we reconstructed the boxplot to analyze each feature that can be seen from Figure 3(b), but this time after averaging the data according to their corresponding quadrants.
A box-and-whisker plot was used to distinguish each feature once the data were averaged with respect to their equivalent quadrants, as shown in Figure 3(b). It can be observed from the figure that the data of the features skewness, kurtosis, and wave entropy can be easily distinguished from each other as the data do not overlap and are significantly deviated. The situation is also similar for the features skewness and power bandwidth as these two features contained significant distinguishable characteristics. On the other hand, the features mean, variance and standard deviation were relatively deviated from each other, but most of the data were still seen to be overlapping. Furthermore, it was seen that the difference in the features minimum, maximum, and variance was not as prominent as the deviation in the data was very insignificant. Based on the above parameters, we segregated all the features into the following combinations: 1) Feature combination A: mean, standard deviation, variance.
3) Feature combination C: minimum, maximum, variance. 4) Feature combination D: skewness, power bandwidth. It was observed that the accuracy for classification improved by a significant amount as illustrated in Table 6. It can be noticed that the combination B provided a better result with an accuracy of 92.36% for the quadrant HAHV_LALV and 89.11% for the quadrant HALV_LAHV, whereas the combination C provided the least result with an accuracy of 11.23% for the quadrant HAHV_LALV and 15.69% for the quadrant HALV_LAHV. The reason for feature combination B providing better result is that the samples of the features skewness, kurtosis, and wave entropy can be easily distinguished from each other as the data are significantly deviated (see Figure 3(b). Similarly, the accuracy for feature combination D was also seemed to be satisfactory for the same reason. On the other hand, the feature combinations A, and C could not provide a satisfactory result due to the overlapping and similarity of the data. This affects the SVM classifier in generating a confusion matrix model that fails to depict the better accuracy. By observing the above results, we could conclude that the shape of the distribution could well represent the emotional activities of the brain and thus has provided a better percentage accuracy for classification.
As DEAP is a public dataset, we took a step forward to compare and analyze the existing methods that have already used the DEAP dataset for recognizing emotions with our proposed approach. Table 7 illustrates some of the existing methods for emotion recognition using the DEAP dataset. It also provides information regarding the emotions identified, the features extracted and the accuracy for emotion classification by each method. It can be observed that most of the existing methods extracted more than one type of features and few identified more than two emotions. However, for our work, only the statistical features were considered for the feature extraction process and only two states of emotion were identified, namely valence and arousal. Furthermore, it can be also seen from the table that the accuracy of the existing methodologies is limited within 80%, whereas our proposed approach provided an accuracy of approximately 92.36%. Thus, it can be said that our approach is more efficient in terms of classifying emotions than many of the existing approaches.  [26] Valence and arousal DWT, WE, and Statistical 71.40 [27] Male/female valence and male/female arousal Statistical, Linear, and Non-statistical 78.60 [28] Stress and calm Statistical, PSD, and HOC 71.40 [29] Valence, arousal, and dominance Statistical and HFD 71.40 [30] Excitation, happiness, sadness, and hatred WT (db5), SE, CC, and AR 78.60 [31] Anger, surprise, and other HHS, HOC, and STFT 78.60 Proposed Approach Valence and arousal Statistical 92.36

CONCLUSION
In this paper, the preprocessed EEG signals from DEAP dataset were used to classify two types of emotions, namely valence and arousal. The samples in the dataset were first transferred from time domain to frequency domain by applying FFT followed by the extraction of the alpha, beta, and theta frequency bands that are particularly significant for emotion recognition. Subsequently, the extracted bands were averaged in correspondence to their quadrant for each emotion and the averaged band values were used to extract statistical features. After that, the extracted features were scaled and various feature combinations were fed into the SVM classifier for emotion recognition. It was observed that our approach predicts emotions with an accuracy of 92.36% using skewness, kurtosis, and wave entropy features. Our proposed model shows better results compared to the existing methods for DEAP dataset.
Prommy Sultana Ferdawoos is an undergraduate student from the Department of Computer Science and Engineering at BRAC University. She will be completing her graduation on Bachelor of Science (B.Sc.) in Computer Science by the end of December 2018. She aspires to pursue a career in the field of Brain-Computer Interface, BioInformatics, Artificial Intelligence, Networking and Image Processing.