Local feature extraction based facial emotion recognition: A survey

Notwithstanding the recent technological advancement, the identiﬁcation of facial and emotional expressions is still one of the greatest challenges scientists have ever faced. Generally, the human face is identiﬁed as a composition made up of textures arranged in micro-patterns. Currently, there has been a tremendous increase in the use of Local Binary Pattern based texture algorithms which have invariably been identiﬁed to being essential in the completion of a variety of tasks and in the extraction of essential attributes from an image. Over the years, lots of LBP variants have been literally reviewed. However, what is left is a thorough and comprehensive analysis of their independent performance. This research work aims at ﬁlling this gap by performing a large-scale performance evaluation of 46 recent state-of-the-art LBP variants for facial expression recognition. Extensive experimental results on the well-known challenging and benchmark KDEF, JAFFE, CK and MUG databases taken under different facial expression conditions, indicate that a number of evaluated state-of-the-art LBP-like methods achieve promising results, which are better or competitive than several recent state-of-the-art facial recognition systems. Recognition rates of 100%, 98.57%, 95.92% and 100% have been reached for CK, JAFFE, KDEF and MUG databases, respectively.


INTRODUCTION
With the development of artificial intelligence and pattern recognition, computer based facial expression recognition has attracted many researchers in the domain of computer vision. Several studies have shown that the facial expression contributes to better understand the conversations [1,2], and it helps to express the individual's internal emotions, also, it is considered as the main modality for human communication. Recent progresses in psychology and neuroscience fields give a more positive interpretation of the emotions role in human behavior [3]. The facial emotion recognition system resides of three important steps; face detection, feature extraction and classification. By taking image or series of images as input, the most important step is feature extraction that allows to describe the input images and calculate their characteristic vector using a given operator. Indeed, extracting poor features involves producing poor recognition quality even with the use of best classifiers. Because of the exceptional exhibition of LBP based techniques, they have developed as one of the most unmistakable local image descriptors. Although initially intended for texture analysis [4], the LBP descriptor has given excellent outcomes in different applications because of its invariance to monotonic global graylevel changes, furthermore, its better resistance against brightening changes property in real-world applications including face recognition. Another equally important property is its computational effortlessness and the low length of its histogram vector, which make it ready to examine images in challenging real-time settings. The achievement of the LBP in numerous applications conceived an offspring of an immense number of LBP variations, which have been proposed and keep on being proposed. Without a doubt, since Ojala's work [4] and because of its adaptability and effectiveness, the general LBP-like way of thinking has demonstrated extremely well known, and an extraordinary assortment of LBP variations have been proposed in the writing to improve discriminative power, robustness, and appropriateness of LBP. The main objective of this study is to perform a large scale performance evaluation for facial emotion recognition, assessing 46 recent state-of-theart texture features, on four widely-used benchmark databases. Performance of the adopted facial expression recognition system coupled with the best evaluated texture descriptor on each dataset is compared against those of state-of-the-art approaches. We disclose in the experimental section the fact that some descriptors originally proposed for applications other than facial emotional recognition allow outperforming several recent state-ofthe-art systems. The remaining sections of this research work are arranged in the following way: Section 2. reviews the traditional LBP operator as well as some of its recent and popular variants. Section 3. reviews the few existing surveys on texture descriptor based classification and recognition as well as the evaluated stateof-the art LBP-like methods. Section 4. provides detailed explanation on the results of the experiments while comparing the performances of the best performing descriptors on each tested datasets with those of recent state-of-the-art facial emotional recognition systems. Finally, section 5. draw this paper to a close by proposing some future research perspectives.

BRIEF REVIEW OF EXISTING METHODS
The original Local Binary Pattern (LBP) operator proposed by Ojala et al [4] , which consists in coding the pixel-wise information in an image, is a powerful texture analysis descriptor. It aims to search micro-textons in local regions. The value I p of the pixels in a 3×3 grayscale image patch around the central pixel I c are turned into binary values (0 or 1) by comparing them with I c (value of the central pixel). The obtained binary numbers are encoded to characterize a local structure pattern and then the code is transformed into decimal number. Once a LBP code of each pixel is obtained, a histogram is built to represent the texture image. For a 3×3 neighborhood, the definition of the kernel function of LBP operator is given in (cf. Eq (1)), where I p (p ∈ {1, 2, ..., P}) signifies the gray levels of the peripheral pixels, P corresponds to the number of neighboring pixels (P=8) and ϕ(·) is the Heaviside step function (cf. Eq (1)).
Local binary patterns by neighborhoods (nLBPd) operator [5] consists in encoding the relationship between each pair of the peripheral pixels I 0 , I 1 , I 2 , ...,I 7 around the central pixel I c in a 3×3 square neighborhood. The pairs of pixels are compared with sequential neighbors or within neighbors possesing a distance length d. The kernel function of nLBPd code is defined by (cf. Eq. (2)). When d=1, the binary code of the central pixel I c is gotten as below (Eq. (3)): The procedure of Local Graph Structure (LGS) descriptor introduced by Abusham et al. [6] is to exploit the dominant graph process in order to encode the spatial data for any pixel in the image.
LGS is based on local graph structures in local graph neighborhood. The graph structure of LGS represents more left-handed neighbor pixels than right-handed ones. To overcome this defect, Extended Local Graph Structure (ELGS) operator is proposed [7]. The procedure for ELGS is based on using the LGS texture descriptor to build two descriptions (horizontally and vertically) and then combine them into a global description.

EVALUATED STATE-OF-THE-ART LBP VARIANTS
The pioneering LBP work [4] and its success in numerous computer vision problems and applications has inspired the development of great number of new powerful LBP variants. LBP descriptor is adaptable to suit in many different applications requirements. Indeed, after Ojala's work, e.g., Heikkila et al [8], several modifications and extensions of LBP have been developed with the aim to increase its robustness and discriminative power. These extensions and modifications of LBP, developed usually in conjunction with their intended applications (see Table 1), focus on several aspects of the LBP method such as, Quantization to multiple level via thresholding; sampeling local feature vectors and pixel patterns with some neighborhood topology; combining multiple complementary features within LBP-like and with non-LBP descriptors for both images and videos and finally, regrouping and merging patterns to increase distinctiveness. There are several researches reported in the literature that are devoted to surveying LBP and its variants. One can cite: (a) Hadid et al. [42] reviewed 13 LBP variants and provided a comparative analysis on two different problems which are gender and texture classification. (b) The work of Fernandez et al. [43] attempted to build a general framework for texture examination that the authors refer to as histograms of equivalent patterns (HEP). A set of 38 LBP variants and non LBP strategies are executed and experimentally assessed on eleven texture datasets. (c) Huang et al. [44] displayed a survey of LBP variants in the application region of facial image processing.
However, there is no experimental study of the LBP strategies themselves. (d) Nanni et al. [45] examined the performance of LBP based texture descriptors in a fairly specific and narrow application, which consists in classifying cell and tissue images of five datasets. It can be inferred that there is a limited number of state-of-the-art published works which are devoted to survey LBP-like methods in texture and face recognition and in particular facial emotion recognition which is practically nonexistent. Note that, most of these works remain limited in terms of number of LBP-like descriptors reviewed and tested datasets, suffer from lack of recent LBP variants and some of them do not include experimental evaluation. Since no broad assessment has been performed on an incredible number of LBP variations, and considering recent rapid increase in the number of publications on LBP-like descriptors, this paper aims to provide such a comparative study in facial emotion recognition problem and offers a more up-to-date introduction to the area. For that, 46 recent state-of-the-art LBP variants are evaluated and compared over four challenging representative widely-used facial expression databases. The performance of the best texture descriptor on each dataset is also composed to those of state-of-the-art facial emotion recognition systems. Note that for the descriptors, we utilized the original source code if it is freely accessible; otherwise we have built up our own implementation. The evaluated state-of-the-art texture descriptors and their intended applications are summarized in Table 1.

EXPERIMENTAL RESULTS AND DISCUSSION
In this section, the state-of-the-art LBP variants summarized in Table 1 are extensively evaluated and compared over four publicly available facial expression datasets (see section 4.2.). In addition, performance of the best performing method on each dataset has been compared against those of recent state-of-the-art facial emotion recognition systems. The following subsections describe: 1) the experimental configuration; 2) the datasets considered in the experiments, 3) the obtained results and 4) comparisons with other existing approaches.

Experimental configuration
In order to systematically evaluate the performance of the tested methods, we setup a comparative analysis through a supervised image classification task. Similar to most state-of-the-art facial expression recognition systems, the adopted system, shown in Figure 1, involves several steps including 1) image processing to alter and resize faces to have a common resolution; 2) feature extraction using the evaluated LBP variants; 3) histogram vector calculation. In this step, in order to incorporate more spatial information into the final feature vectors, the obtained feature images were spatially divided into multiple non-overlapping regions and histograms were extracted from each region. For example, the LBP code map is divided into m×n nonoverlapping sub-regions, from each of which a sub-histogram feature is extracted and is normalized to sum Ì ISSN: 2088-8708 one. By concatenating these regional sub-histograms into a single vector, a final LBP based facial emotion representation is obtained; and 4) image classification using the SVM classifier. In this step, the images of each dataset are preliminarily divided into a random split containing two sub-sets, one for the training and the other for testing. In the experiments, we tackled the 7-expression classification problems and overall results are computed as the average of the per-class accuracies and not the average accuracy of all samples, which avoids biasing toward expressions with more samples in the databases.  Tables 2 and 3 report the average accuracy of each tested descriptor obtained on CK, JAFFE, KDEF and MUG Databases. The first column consists of the name of the descriptor along with the parameter used if that concerns a parametric descriptor. The other columns concern the abbreviation of emotion categories that we tested and the accuracy obtained; NE: NEUTRAL, HA : HAPPY, FE : FEAR, SA: SAD, AN: ANGRY, DI: DISGUST, SU: SURPRISE, Acc: Accuracy.

Performance analysis on Cohn-Kanade (CK) Database
For this database, we used a subset of 10 sequences that reflect only the samples expressing the seven categories of emotions, and then we selected the four latest frames of each sequence that have the highest expression intensity. The optimal number of non-overlapping sub-regions to compute the histogram features is 14x14 for all the tested descriptors. For each emotion expression, two images are used as training set and the two others are used as test set. recognition recorded on CK dataset using the 46 evaluated state-of-the-art texture descriptors. It can be inferred that almost all the tested descriptors produce good results on CK dataset where their average accuracy is above 96%. Tweenty-seven LBP-methods like RALBGC, BGC1, BGC2, BGC3, dLBPα, ELGS manage successfully to differentiate all classes perfectly (average accuracy equal to 100%), leaving then, essentially, no room for improvement. Note that, all the evaluated descriptors reached a score of 100% for "Happy" and "Surprise" classes.

Performance analysis on JAFFE Database
In this second experiment, each emotion in JAFFE database is designated into 10 females with three samples. One image is taken for each person and for each emotion expression in the test, making a total of 70 samples in the testing phase while the remaining 140 samples depict the training set. All faces are preprocessed to align them into a canonical images with a resolution of 128x128. The histograms are produced on the feature images spatially divided into 12x12 non-overlapping sub-regions. It is apparent from Table 2 that DSLGS, ELGS and SLGS operators yield the highest average rate as they reached a score of 98.57%. Then, come the eight descriptors: BGC2, CSLBP, dLBPα, ILBP, LCCMSP, LDENP, LGCP and OS LTP which reached a recognition rate of 97.14%. It can be noticed that several tested LBP-like descriptors have perfectly recognized some classes by getting the accuracy of 100%. Note that there is a significant performance drop for all the tested descriptors on the class of "sadness" where the reached accuracy is in the range [50%, 90%]. It also emerges from Table 2 that some methods like CSALTP, GTUC and LMEBP produce the worst performance on almost all the classes where their accuracy is sometime below 70%. We would also point out that although parametric methods like eCS LTP, ILTP, GTUC, AELTP are regarded as "optimized" since their parameter values are tuned during the experiment, their performance is markedly weaker than the non-parametric ones.

Performance analysis on KDEF database
We choose the images of both sessions for each subject and only the view angle 0 • is considered. The subset contains 70 subjects, each one expresses two times the seven emotion categories. Thus, in total we use 980 images. We altered the sizes of all the faces of KDEF database into a steady sized template, which have the same resolution of 256x256 and the faces were then split into 14x14 blocks for region-based feature extraction. Each subject express two times the seven categories, so we selected one facial image per subject for training phase and the other one for test phase.
It is apparent from Table 3 that the LGS operator is ranked as the top 1 descriptor in KDEF database as it achieves a recognition rate of 95.92%, with perfect recognition (100%) of happy and neutral categories, followed by DSLGS, SLGS and LBP descriptors which reached a score of 95.31%. Then, come seven descriptors like BGC2, BGC3, CSLBP, dLBPα, ELGS, ILBP and LQPAT which allowed to achieve accuracies between [94.08% -94.90%]. Then tweenty-six LBP-methods attained accuracies between [90.20% -93.88%] where three descriptors RLBP, BGC1 and SMEPOP reached 93.88% and two descriptors MMEPOP and DBC attained 90.20% and 90.41%, respectively. Accuracies between [80.61% -86.53%] were achieved by eight LBP-like methods in which 80.61% was achieved by ALTP and 86.53% by XCS LBP. We can observe from Table 3 that the worst performance of 59.39% was attained by CSALTP descriptor.

Performance analysis on Multimedia Understanding Group (MUG) Database
We have used 924 facial expression images, i.e., 132 images for each facial expression. All faces were altered and resized to have a common resolution of 256x256. Then, they were split into 18x18 blocks for region-based feature extraction. For this experiment, in each emotion category, we used four images per subject, two for training phase and two for test phase. Table 3 gathers the obtained experimental results. Clearly, it can be observed that eight of the tested descriptors ELGS, LDTP, LDENP, LGCP, LNDP, LTP, LQPAT and SMEPOP manage to differentiate all classes perfectly 100% in accuracy leaving then, no room for improvement. In addition, thirty-one LBP-like methods give accuracies between [99.03% -99.68%], LBP attained 98.73%, DBC reached 98.05%, XCS LBP got 97.40% and finally, GTUC attained an accuracy of 97.08%.
As we can observe, all tested methods obtain very promising results on the MUG dataset, excpect three state-of-the-art methods AELTP, LMEBP and CSALTP attained the lowest accuracies comparing with the other methods tested. The undermost accuracy of 71.43% was achieved by CSALTP. Then an accuracy of 84.09% was attained by AELTP and finally 89.94% was obtained when testing LMEBP method.

Comparison with state-of-the-art methods
In this section, we compare the performance of the best performing descriptors on each database with those of existing state-of-the-art methods. We should note that the performance evaluation with other stateof-the-art approaches may not be directly comparable due to the differences in partitioning the dataset into training and testing sets, number of classes, number of subjects and features used. However, distinctive results of every approach still can be indicated. The extracted results from the reviewed state-of-the-art papers as well as the recognition rates reached by the best performing evaluated LBP-variants on each database are arranged in Table 4.
It can be observed from Table 4 that, except for both JAFFE and KDEF databases, where the number of the used samples is relatively the same for almost all the existing systems, the used number of samples on CK and MUG databases varies from one existing approach to another. Given two different systems to compare on a given database, two cases are possible to provide a fair and accurate comparison of their results. In the first one, the used number of samples and the configuration into train/test sets should be the same, whereas in the second case, the system using a less number of samples, must at least be tested with a delicate configuration into train/test sets compared to the other which uses a higher number of samples. We used the second case in our evaluation for comparing the state-of-the-art methods with the adopted system, which uses the most difficult configuration in terms of train/test sets. Indeed, almost all the existing state-of-the-art systems use a partition where the number of training images is superior to that of test images (e.g., 10-fold), while in this study, the half-half configuration is adopted. Examining Table 4, we could make the following findings : (a) KDEF database: It can be easily observed that the LGS operator is the best performing method which achieved the higher performance over the recent state-of-the-art systems with a recognition rate reaching 95.92%. (b) JAFFE database: It is easily found that the accuracy recorded by three LBP-like variants outperformed those obtained by the state-of-the-art approaches. Indeed, it emerges from encoding process justifies the robustness and effectiveness of LGS, DSLGS and ELGS descriptors. On the other hand, we remark that CSALTP descriptor suffers on KEDF experiment reaching just 59.39% also on JAFFE and MUG experiments, on which the results were very high by the majority of the tested descriptors, the reason behind is the user specified threshold used in this operator, which needs to be identified on each experiment based on testing many values requiring many computations. Rather than this, all the other descriptors record good performances proving the discriminative power of the local description concept.

CONCLUSION AND FUTURE WORKS
We reported in this present work a comprehensive comparative experimental analysis of a great number of recent state-of-the-art LBP-like descriptors on facial expression recognition. It is noteworthy that the choice of an appropriate descriptor is crucial and generally depends on the intended application and many factors, such as computational efficiency, discriminative power, robustness to illumination and imaging system used. The experiments presented herein significantly constitute a good reference model when trying to find an appropriate method for a given application. Our experiments on facial expression recognition included a detailed and comprehensive performance study of 46 texture descriptors of the literature covering numerous application areas like texture classification, image retrieval, finger vein recognition, medical image analysis, face recognition, face expression analysis, etc. To show descriptors performance over several challenging situations, the tested descriptors were applied on four famous and widely used datasets such as JAFFE, CK, KDEF and MUG databases. The main finding that can be drawn from the analysis of the overall performance from the experiments is that although some LBP-like features have been originally conceived and proposed for texture classification, they show considerable performance in facial expression recognition. Indeed, even though they were not specifically designed for facial expression recognition, some LBP variants outperform all state-ofthe-art approaches over the tested databases. It is of great importance to note that the descriptors based on dominating set and graph present a significant performance stability against the other evaluated state-of-the-art descriptors as they are often found among the best performing LBP variants on the four tested databases. For KDEF database, LGS operator, which is based on dominating set and graph theory, is the best performing descriptor reaching a score of 95.92% outperforming the recent state-of-the-art systems. For JAFEE database, the better recognition rate which was 98.57% has been achieved by three descriptors based also on dominating set and graph theory such as DSLGS, ELGS and SLGS. 27 LBP variants including again those based on dominating set and graph theory reached a score of 100% on CK database. Finally, many evaluated LBP variants like LDTP, LDENP, LGCP, LNDP LQP and SMEPOP descriptors as well as the ELGS operator reached a score of 100% over MUG database. As future works, we look forward to extend this study to include the evaluation of deep features and deep classifiers. Furthermore, we wish to further explore the power of texture descriptors in other applications such as compound emotion recognition, gender classification, face recognition, texture classification, etc., in order to assess their ability to work with various classification problems.