Emotion recognition from syllabic units using k-nearest-neighbor classification and energy distribution

ABSTRACT


INTRODUCTION
Interpreting emotional information is imperative for the social interactions that we have every day [1], [2]. This involves many different components such as body language, posture, facial and vocal expressions. Any information that relates to these components is usually obtained from physiological sensors, sound, or image [3]- [8]. The data is manipulated of a very low level (sound samples or even pixels of images). Between this low-level data and the interpretation that humans make of it, the gap is enormous. Indeed, the manifestation of emotions is an especially intricate area of human communication lying at the intersection of multidisciplinary sciences such as psychology, psychiatry, audiology, and computer science [9]- [11]. The analysis of conversations 'speech analytics', is one of the recent challenges in many applications, for example, health monitoring [12], video games [13], and computer science [14]. A typical domain of learning emotional state from analysis of conversations is call centers. Indeed, a better understanding of the needs of a customer means for the enterprise a better management and greater benefit [15]- [17].
The science of learning lays on data which come usually from different signals. One of the most frequently used signals is the speech signal. Speech is indeed one of the fundamental modalities that men use

PROPOSED METHOD
In this article, a new characteristic extraction scheme is proposed which is based on a pseudophonetic approach [45], [46]. The key point of our work is to extract the characteristics according to different segments such as the syllabic units to remedy the linguistic variation constraint. These segments are identified by manual segmentation of the speech signal. Our developed method is based on extracting clues utilizing signal processing methods. Low-level descriptors (LLDs) and high-level descriptors (HLDs) as shown in Table 2 are obtained from a voice signal labelled by four emotions (anger, sadness, joy, and neutral). Each chosen audio sequence must firstly satisfy the audibility criterion and thereafter passes through the following process:  Modulating the signal according to a set of contextual, cultural, and linguistic variables whose purpose is to allow communication (emotional or not),  Capturing the signal produced by the speaker,  Annotating the signal [47],  Segmenting the signal in syllabic format,  Extracting acoustic features by using signal processing techniques,  Classifying using K-NN algorithm.

Data preparation
Any scientific study in machine learning is extremely dependent on the data that is used to describe the phenomenon to be modelled. Therefore, the collection of information adapted to the task that we want to Nowadays there has been some genuine work in the zone of emotion recognition in general and emotion from sound specifically; however, an enormous portion of this work has been assessed on acted speech [48], [49], and very little work has been done on real and spontaneous speech [50].
The study presented in this article is located in the context of emotion detection during interactions between Moroccan citizens, aged 16-60 years expressing four basic emotions: happiness, anger, neutral state, and sadness. Our corpus Moroccan Arabic dialect emotional database (MADED) is obtained from uncontrolled recordings, collected from real situations that can be extremely diverse. The selected subset includes situations taking place in different contexts (indoor, outdoor scenes, monologue, and dialogue). The emotions are validated and labelled by the interface shown in Figure 2 built by our team. Firstly, the data set is changed over to .wav format and cut into a syllabic structure as being the basic unit similar to the essential unit of our handling. Then it goes through another classification step of the syllabic type stored in the same folder.

Figure 2. Desktop annotation tool used for emotion evaluation
In the literature, there is a great deal of discussion on the length of the audio for which the emotion can be extracted dependably [51]- [54]. In our study, we propose a basic methodology for segmenting the audio based on a pseudo-phonetic approach. The main idea is to extract some features like (formants, pitch, and energies in six bands) from the speech signal according to the syllabic segment (CV), where C indicates Consonant and V Vowel. For all our experiments, we considered only four emotions, in particular, joy, anger, sadness, and neutral because the rule-based emotion extraction system that we used catered to only these four emotions. There were 979 sound records relating to these 4 emotions, specifically 250 syllables for /ba/, 240 for /Du/, 270 for /Ki/, and 219 for /Ta/ as shown in Table 3.

Speech features extraction
At present, there is no agreement on the best arrangement of important descriptors for an automatic emotion recognition system. The most well-known practice is to select a large number to have a richer classification. However, increasing the number of descriptors too high for a corpus of reduced size can possibly lead to performance degradation and accordingly be counterproductive. To solve this problem, it is necessary to adopt a new strategy that will make it possible to reduce the number of descriptors while keeping an acceptable recognition rate. The system we propose is composed of standard acoustic features as shown in Table 2 that served as the challenge baseline set since. The interspeech emotion challenge 2009 [55]. The novelty comes from the way we computed the energy of the CV segments [56]. The main tools we used are MATLAB codes and the toolbox Praat [57], [58].

Computation of logarithmic energy characteristics based on DFT
In the pre-processing phase, we divided the speech signal, sampled at 22050 Hz, into time segments of 11.6 ms with an overlap of 9.6 ms. Thereafter we applied a hamming windowing to each segment followed by zero-padding. Finally, we calculated a 512-point discrete fourier transform. Now, to compute the energy and its distribution for each syllable as shown in Figure 3, we chose six specific bands of frequency as in [59], [60]: where the band record B goes from 1 to 6. The frequency index s ranges from the DFT (Discrete Fourier Transform) indices representing boundaries (the lower and upper) for each frequency band.After that, the normalization is applied to each frame, according to (2): where is the standardized band energy B in the frame n, (n) is the general energy in the frame n and ( )is the band energy B in the frame n.

THE RECOGNITION MODEL
The recognition model consists of two steps: (1) extracting the acoustic features to have a data set (2) applying the classification model. The data set is obtained by using Praat and MATLAB codes. For the task of classification, we utilized the KNN [61] to classify an instance of a data set into an emotion class. The KNN algorithm is a supervised learning technique; it can be used for both classification and regression. To make a prediction, the KNN algorithm will be founded on the entire dataset. Indeed, for an observation, which is not part of the dataset, that we want to predict, the algorithm will search for the K instances of the dataset closest to our observation. Then for these K neighbors, the algorithm is based on their output variables Y to calculate the value of the variable Y of the observation that we want to predict. Also if K-NN is used for the classification, it is the mode of the variables Y of the K closest observations which will be used for the prediction.

Algorithmic composition
We can represent the operations of KNN by writing the following pseudo-code: Algorithm 1: Start Algorithm Input data: a) A data set D. b) A function for defining the distance d. c) An integer K For a new observation X for which we want to predict its output variable Y do: Step 1: Calculate all the distances of this observation X with the other observations of the data set D.
Step 2: Retain the K observations from the data set D closest to X using the function of calculating the distance d.
Step 3: Take the values of Y from the K observations retained: a) If we perform a regression, calculate the mean (or median) of Y retained. b) If we perform a classification, calculate the mode of Y retained (this is our case). c) Return the value calculated in step 3 as the value that was predicted by K-NN for observation X. End K-NN needs a distance calculation function between two observations. In our case, we have continuous data; hence the Euclidean distance is a good candidate.

Euclidean distance
It is a distance that computes the square root of the sum of the square differences between the coordinates of two points: where x= ( ) and y= ( ); j=1...n For all our experiments, we used the free software STATISTICA [62] which is a set of data mining tools allowing the processing and selection of the parameters and proposing different learning algorithms. This software is currently increasingly used in the pattern recognition community. It includes many known algorithms such as KNN, SVM, decision trees (J48), as well as Meta algorithms.
STATISTICA KNN is a memory-based model characterized by a bunch of examples (objects) for which the results are known (i.e., the examples are labeled). In KNN the independent and dependent variables can be either categorical or continuous. The problem is the regression for continuous dependent variables, otherwise, the problem is the classification. Hence, KNN in STATISTICA can handle both classification and regression problems. In the event that we have another model (object), we would like to approximate the outcome dependent on the KNN examples. To settle on the choice KNN should discover K models that are nearest in distance to our new model (object). KNN predictions depend on averaging the results of the K-Nearest-Neighbor for the regression problems. For the classification problems, KNN utilizes the vote dominant part rule. The value of K strongly impacts the prediction accuracy. To find the optimal value for K we can utilize the cross-validation algorithm in STATISTICA.

EXPERIMENTAL RESULTS
We have run the KNN algorithm several times, each time aiming for a different goal. The classifications we made were accordingly binary or multiple. All experiments were performed on data sets collected from syllabic units: /ba/, /du/, /ki/, /ta/. There are four classification tasks presented in this work: a. We tested the ability of the proposed model to detect each of the studied emotions. The classification in that case is binary. The targeted emotion (N: neutral, H: joy, S: sadness, A: anger) was labelled by its name and the others by O (others). b. Still from the binary classification perspective, we tried to see to what extent our model is able to separate among positive and negative emotions. c. Within each group of emotion (positive and negative), we tested if the proposed feature vector is a satisfactory tool to distinguish between them (according to [63], positive emotions include (joy and neutral) and negative emotions include (sadness and anger)). d. At last, a multiple classification is performed to evaluate the whole system.
From these experiments, results are presented below. As shown in Figure 4, the analysis of the syllabic unit /ba/ using the proposed set of features gives high accuracy percentage of detection. Indeed, we obtained for example 60.20% for all emotions and 73.63% to distinguish between positive and negative emotions. In the same way, the rates vary respectively from 86.21% to 78.95% for positive emotions (neutral vs joy) and negative emotions (anger vs sadness). The KNN algorithm recognized a neutral state with more than 89.55%, sadness with 85.07%, anger with 79.10%, and joy with 76.12%.  Similarly, Figure 6 shows the accuracies that we obtained for the CV /ki/. The KNN algorithm recognized joy with more than 93.20%, anger with 84.47%, 81.55% for the neutral state, and 78.15% for sadness. Also, the binary classification between positive and negative emotions achieved a recognition rate of 72.33%. Furthermore, by considering each group of emotions (positive and negative), the results reached 87.62 % for the positive emotions and 75% for the negative ones. Concerning the classification of all emotions, the performed rate reached its minimum value of 52.43%.
Finally, the study of the CV /ta/ shows that: neutral state presents a high recognition rate of 90.13%, followed by anger with 84.57%, sadness with 78.15%, and lastly joy with 77.16%. For negative vs positive emotions, the accuracy is 75.92%. Within each category of emotion (positive and negative), the accuracy is 89.74% and 77.38% respectively. The rate obtained for the whole emotions is 61.73% which can be considered as an acceptable rate as shown in Figure 7.

DISCUSSION
Our study aims to provide an automatic emotion recognition model with a reduced set of acoustic features. Measurements were carried out using the MADED database from which we extracted four plosive consonant /b/, /d/, /k/, /t/ associated with the three vowels /a/, /u/ and /i/. These choices are based on previous works [64], where the authors proved that the energy and its distribution in specific bands play an important role in characterising arabic plosive consonants. The results we obtained are quite satisfactory comparing to previous works as shown in Table 1. It should be pointed out that our study raised many questions. Indeed, it is seen from the results that consonant /d/ achieves the best rates in almost all cases (especially in the neutral case 94.95%) which may lead us to think that place of articulation of the consonant has a role to play in determining the emotion under consideration. Moreover, syllables associated with the same vowel (/ba/ and /ta/) seem to present almost the same results. But as we come to investigate more carefully the recognition rates, we can see that: a. Negative emotions present the best rates (78.95% for /ba/ and 77.38% for /ta/ while for /du/ 70.18% and for /ki/ 75%) b. When the vowel /a/ is associated with the plosive /b/, sadness is more recognized than anger. The opposite occurs when /a/ is associated with /t/. A slight comparison between syllables /ki/ and /du/ shows that for both the joy presents the best recognition rates (93.20% resp. 91.41%). But differences occur in the neutral and multiple classification cases. These rates establish in fact how far objects from the emotional representation we propose are close to each other. Indeed, the exploratory nature of our study has dictated the choice of KNN algorithm rather than SVM or artificial neuronal network (ANN). in the classification task. Our main concern is to establish to how extent the features vector succeeds to evaluate similarities between the same emotions.

CONCLUSION
This work gives a good grounding in modeling emotion with acoustic features. The method given here uses energy and its distribution in six bands as a principal tool for distinguishing between the four basic emotions: neutral, sadness, joy, and anger. The classical KNN algorithm is used to perform the classification task. In some cases, the results were conclusive but not exhaustive. This study can be extended in future works to richer corpora with different utterance representations in different languages and with different algorithms like neural networks and support vector machine algorithms.