A pre-trained model vs dedicated convolution neural networks for emotion recognition

ABSTRACT


INTRODUCTION
As computer technology has advanced, efforts have been made to develop smart devices capable of simulating the human mind. Human ambition was not limited to training machines to perform human tasks, but the goal became to develop devices capable of analyzing and distinguishing human emotions. Emotions can be recognized in various ways, based on heart rate variability (HRV), electro encephalography (EEG) signals, galvanic skin response signal (GSR), speech emotion recognition (SER), facial expression recognition (FER). One of the most effective methods of communication mechanisms is FER by which human machine interaction (HMI) systems can understand humans' internal emotions as facial expressions, which plays an important role in social interaction [1]. Automatic recognition of facial expressions is an important topic in computer vision research. This is due to the real value for application in many fields, such as security, interactive game, health care, patient status monitoring [2], and commercial advertisements to know the consumer's reaction [3], and in the autonomous driving system (ADS), the FER was used to identify the driver's emotions. ADS makes good use of the features of FER and improves its safety by preventing traffic accidents [4], [5]. Reaching a machine capable of distinguishing facial expressions or face detection [6], [7] is a difficult task, due to the wide variety of faces in terms of age, gender, and others [8]. Deep learning is the most powerful technology for having brought artificial intelligence (AI) devices closer to human-level intelligence. The use of deep learning in the field of facial expression recognition has given promising results. Table 1 shows the difference between the proposed ERCNN model and the original VGG16. The basic layers in the proposed ERCNN architecture are 24 layers as: Seventeen CONV-2D layers, each convolution layer has a kernel size of (3,3) and multiple filters in each convolution layer were used. It was started from (64 to 512) filters and the rectified linear unit (ReLU) activation function was the activation function used with Conv2D. After each Conv2D layer, the batch normalization was added to make the deep network faster and more stable by applying normalizing and standardizing operations. Five max-pooling 2D layers and pool-size (filter size) of (2,2) were used, to get the most important features from the features extracted from the convolution layer, thus reducing arithmetic operations, and preventing overfitting. Each max-pooling followed by the dropout layer with the dropout probability of (0.3), to drop the nodes in random way, this will prevent a model from overfitting. One flattening layer and one fully connected (dense) layer with its SoftMax activation function were used. Architecture for the proposed ERCNN model was shown in Figure 1.  The training parameter was specified as: number of the epoch is 200, the batch size 265, the width and height (48,48), the number of classes FER+ is 8 classes. Adam was chosen as an optimizer. The padding "same" option was used to avoid image size loss after applying the CNN kernel, and the stride is 1. Cross-entropy was used as a loss function to speed up the training and improve the model classifications.

VGG16-Pre-trained model
VGG16 is a convolution neural network used for images classification. It was created after training on ImageNet dataset and classified it into 1,000-class. In this paper and based on the transfer-learning approach, a pre-trained VGG16 model will be created. VGG16 is loaded from Keras, and it is fine-tuned to fit the requirements of identifying human emotions. To create an emotion recognition model, the initial layers of VGG16 are frozen. The last four layers are retrained on the extended data in order to predict the emotions. The fully connected layer is removed, and a new fully connected layer is created to meet the requirements of the emotion recognition task. At the retraining, the VGG16 pre-trained model parameters are set as: the input layer is of (224,224,3) shape. The AveragePooling2D layer and the pool-size (filter size) are (2,2). Rectified linear unit (ReLU) is the activation function used in the fully connected training layer, followed by the dropout layer with the dropout probability of (0.5). The SoftMax is the activation function in the fully connected output layer (prediction) and the number of classes is modified to match the emotions in the extended data to be 8 with FER+. The extended dataset is grayscale (single channel) images with 48×48 dimensions. Because VGG16 works with color images (3 channels), the dimensions are 224×224. Then, the grayscale images should be resized from 48×48 to 224×224 and converted into three-channel grayscale images using ImageDataGenerator.

COMPARISON BETWEEN THE ERCNN AND THE VGG16
CNN is a well-known deep learning algorithm that has been effectively employed in the identification of high-dimensional data, particularly images [25]. The entire convolution procedure involves converting a picture into another image of comparable size and convention using a weighted matrix. Furthermore, convolution is used to extract the feature map [26]. So, in this work, the focus has been on increasing the convolution layers. Table 1 shows the difference between the architecture of the proposed model and the original VGG16 architecture [27]. The number of convolution layers increased to 17, while the original was 13. One fully connected layer is used, while in the original VGG16 it was three fully connected layers and after first two fully used dropout 0.5, we added a batch normalization after each convolution layer while in the original the batch normalization does not used. The stride is 1 for max pooling with dropout 0.3 in our proposed model while it was 2 at the original VGG16. The original VGG16 used SGD as an optimizer function, but in our work, Adam was used as an optimizer function.

THE DATA SET 4.1. FER+ corrected dataset
Barsoum et al. [28] presented FER+ data set that modified with multiple labels for each face image. The wrong labels of FER2013 dataset have been corrected with crowd sourcing using10 taggers to label each image, as shown in Figure  indicates FERplus-corrected labels. The number of emotions in FER+ becomes 10 classes: neutral, happiness, surprise, sadness, anger, disgust, fear, contempt, unknown, and not face (NF). In FER+, 80% of the images were designated as training samples, 10% as validation samples (public test), and 10% as test samples (private test). In this work, the script (FERPlus/src/generate training data.py) is used to get the csv file for the corrected data from (Microsoft/FER+) [29]. The number of images in the FER+ dataset is 35,710 which is less than the number of images in the original FER2013 data (35,887) by 177 images. This difference is the result of deleting the NF class and the unknown class which contains blurred images.

New data
The first step in modifying the data set is to add new data to the FER+. Graduate students in machine learning at New York University created new data facial expressions [30]. The new data is made up of 13,690 grayscale images with dimensions of 48×48. We added 65% of the new data to the training data and 35% to the validation data. The number of emotions in the new data is 8 classes: anger, surprise, disgust, fear, neutral, happiness, sadness, and contempt. The emotion of the neutral contains the largest number of images (6868 images), and the emotion of contempt contains the least number of images (9 images) as shown in Table 2.

Pre-processing phase
After downloading and reading the data, the data is reprocessed in the form of a set of steps, as: split data into training, validation, and test set, convert strings to lists of integers, convert data to NumPy array and normalize grayscale image with 255, and shuffle the training data. When the dataset is used with VGG16, the images is resized to 224×224. VGG16 only deals with color images (RGB images), so the grayscale image should be made as image with three-channel gray (3 channel image) using ImageDataGenerator.

Combined the dataset
The extended data is a combination of two sets. In order to integrate the new data to FER+, a number of processes on the new data were made; remove the user id column, make the new data labels lowercase to match FER+ labels. The new emotion labels are anger, surprise, disgust, fear, neutral, happiness, sadness, and contempt. Saving the images in the new data to FER folders and defining the percentage of the distribution of the new data, 65% to be added to the training data and 35% to the validation data. The images of a particular emotion from the new data are added to the emotion that corresponds to it in the FER+ dataset (e.g., sad to sad, angry to angry).

Apply the data augmentation
One of the most prominent problems with emotion recognition systems is the lack of data on facial expressions, especially when deep learning began to be used in the field of computer vision. Training deep learning models depends on huge data to reach satisfactory results. Data augmentation was used to avoid the over-fitting problem by increasing the emotions dataset [31]. In this work, the augmentation was added by applying a number of techniques such as rotating for 40 degrees, width-shift-range set by 0.3, height-shift-range set by 0.3, zoom range (0.3) and horizontal flip. So, different versions of the original images are created, thus increasing the diversity of the extracted features.

Implementation of the proposed work system
The proposed work system was implemented using the programming language Python. It was trained and tested with Keras and TensorFlow on the Kaggle platform, which allows free kernel access to Nvidia K80 GPUs. It allows the kernel to use the GPU results in a 12.5X speedup during deep learning model training. The parameters selected to be suitable for the proposed work through trial and error. After trying many parameters, the parameters used in both models are shown in Table 3.

Testing the proposed ERCNN with the expanded data
In this experiment, the ERCNN model is trained and tested using extended data (FER+, new data). The accuracy obtained when the percentage of adding new data was 65% to the training data and 35% to the validation data was 87.133% in the public test and 82.648% in the private test. Figure 3(a) presents the model's loss and the model's accuracy. The confusion matrix for classes prediction with the test dataset of the FER+ dataset is shown in Figure 3

Testing VGG16 pre-trained model with expanded data
In this experiment, four layers of the VGG16 pre-trained model are trained with extended data (FER+, new data). When adding 65% of the new data to the FER+ training data, the model performance was poor as the network did not converge well as shown in Figure 5, which shows the overfitting that occurred. Also, the predictions on the internet images were poor as shown in Figure 6(a). Despite the public accuracy was 74.253%. The private accuracy was 66.498%. To improve the performance of the model and get rid of overfitting, the number of training data was increased by increasing the number of new data added to the training data by 90% and 10% to the validation data. Figure 6(b) shows some improvement in the model's performance when tested using images from the internet, in terms of emotion prediction, and the overfitting become less. Figure 7 shows the loss model and the accuracy model for the VGG16 pre-trained model. The accuracy was 71.685% in the public test and 67.338% in the private test. Confusion matrix for VGG16 pre-trained model with FER+ test dataset shown in Figure 8. The precision and recall for each class are calculated from the confusion matrix in Figure 8 with the VGG16 pre-trained model, the highest precision was 100% for the happiness class and the average precision of the  Figure 4, it is clear that the accuracy of the predictions of the proposed ERCNN model for each emotion was correct and with a high prediction accuracy. The accuracy of the happy face is 99.92%, the neutral face is 97.54%, and the sad face is 97.92%. When evaluating the VGG16 pre-trained model using the internet images, the performance was poor even after increasing the number of training data in FER+, one face emotion is correctly predicted from three faces' emotions as shown in Figure 7. The results of the comparison are shown in Table 4.

Comparing with the existing work
In this section, we will compare the proposed work system with studies that used the same data set (FER+ data) in training and testing. When the ERCNN model was trained and tested using the FER+ dataset, the results were 87.133% in the public test and 82.648% in the private test. It was higher than the accuracy obtained when training and testing of the VGG16 pre-trained model with the FER+ dataset, where the accuracy in the public test was 71.685% and in the private test was 67.338%. The proposed model was also tested (extra validation) based on images containing one or more faces in a single image with different emotions. To test the proposed ERCNN on an image containing one or more faces, the Dlib detector and OpenCV libraries were used. Excellent results were achieved, and the proposed ERCNN model has proven to be effective in predicting phase. Lian et al. [19] used the DenseNet-BC architecture, which has three dense blocks with 16 layers linked with global average pooling (GAP), then trained the model using the FER+ dataset, which combined the training data with the public test data, and tested the model using the private data, and got an accuracy of 81.93%. Table 5 shows the comparison of the results of the proposed model with those of other studies.

CONCLUSION
In this paper, two models for the recognition of human emotions were compared: a pre-trained model which was trained on a large set of images (ImageNet dataset and classified it into 1,000-class) then retrained by FER dataset and a ERCNN model based on VGG16 network that was built from scratch and trained on data specific to facial expressions only. The obtained results proved the effectiveness of the proposed model for the recognition of human emotions. The proposed ERCNN model outperformed VGG16 in terms of accuracy and time, as well as when evaluating models using images from outside the data used in training and testing. The main goal of this thesis is to enhance the accuracy of identifying human emotions through facial expressions by using CNN's ability to extract features from images (FER+) and classify images. Our next step in this research will be to train and test the proposed model using other data. We also aspire to create a hybrid network that combines the proposed ERCNN model with the VGG16 pre-trained model to obtain a diversity of the extracted features.