Comparison of convolutional neural network models for user’s facial recognition

ABSTRACT


INTRODUCTION
Face detection is one of the research topics that have remained in force in state of the art [1], gaining high relevance in applied systems such as access control to safe areas [2], attendance management systems [3], or young people recognition with some risk of vulnerability [4].For face detection, the techniques used are very varied.For example, the use of binary descriptors [5], mechanisms based on three-way decisions [6], and alignment learning [7], all of them based mainly on image and video analysis [8].One of the most efficient techniques for extracting features through images or videos focuses on deep learning algorithms.
Deep learning techniques have witnessed notable advancements in face identification, with convolutional neural networks (CNNs) emerging as prominent players [9].CNNs are object-recognitionfocused networks that excel in pattern recognition tasks [10].Their architecture has been continuously improved to enhance performance [11], and they find applications across various knowledge domains, particularly in image classification [12], [13].Noteworthy CNN-based architectures include which combines CNNs with long short-term memory networks (CNN-LSTM) for sequential data processing [14], R-CNN region-based networks [15], and the fast R-CNN, which improves detection speed for region-based networks [16].These advancements in CNN-based architectures have greatly contributed to the progress of face identification techniques based on deep learning.

ISSN: 2088-8708 
Comparison of convolutional neural network models for user's facial … (Javier Orlando Pinzón-Arenas) 193 Deep learning has facilitated the development of various applications in face recognition, including real-time human action recognition, with CNN-based models demonstrating impressive performance [17]- [20].However, within the current state of the art, there needs to be more comparative evaluation for different CNN architectures, specifically face detection.This work aims to address this gap by comprehensively evaluating 10 CNN-based architectures using transfer learning [21].
By focusing on face detection, this research contributes to a better understanding of the effectiveness and suitability of various CNN models in this specific domain.Among the applications that this comparative analysis allows is the development of access control systems by user recognition, among others.The article presents the methodology employed based on the use of convolutional networks by transfer of learning.
Next, the methods and materials are presented, exposing the database and architectures to be evaluated.The models compared were AlexNet, VGG-16, and VGG-19, GoogLeNet, Inception V3, ResNet-18, ResNet-50 and ResNet-101, and two additional proposed models called shallow CNN and shallow directed acyclic graph with CNN (DAG-CNN).The results section is presented, analyzing the activations of the networks with the best performance, and finally, the conclusions reached are exposed.

METHOD
A database consisting of three categories is created to carry out the comparison.Two categories are registered users to be recognized (Javier and Robinson).The other category represents a random group of individuals to verify that others are not recognized (Others).For the construction of the database, photos of users' faces are obtained in different positions so that the network can recognize the person, no matter if the face is not completely in front.For the "Others" category, the CelebA [22] database is used, thus obtaining faces with different characteristics, even some similar to the original users to recognize.In total, 3,840 images are used for training, of which 1,940 are in the "Others" category.
The reason for using almost twice as many images within that category as the two users is to give the network more possible characteristics of the unregistered subjects, to avoid that if a person has similar traits, the network can know that the subject is neither of the original.On the other hand, for the validation of the networks, 525 images are used, distributed in 75 images for each of the users and 375 for the category of "Others," in order to verify that, although there are many different users, the networks are capable of discriminating against them.In Figure 1, it is possible to see samples of the images for each category.The size of the images varies according to the neural network to be used since not all of them have the standard size of 224×224 pixels.
Eight of the most well-known ones are selected to compare the users' facial recognition capacity between different models of CNNs.These models are AlexNet [23], the two versions of the visual geometry group (VGG) model (16 and 19) [24], GoogLeNet [25], Inception V3 [26], and the ResNet models (18, 50 and 101) [27].It is also proposed to implement two additional basic models to verify if low-depth architectures can maintain a level of recognition as good as the pre-trained models.The two proposed architectures are basic CNN consisting of convolution blocks, where the first one is a sequential network named shallow CNN since it is less deep than its counterparts (apart from AlexNet).The second architecture comprises two branches, one with convolution filters of size 3×3, and another with convolution filters of size 5×5, to learn different patterns of faces.The latter is called shallow DAG-CNN because of its different paths and depth, which remains similar to the previous one, although it has a total of twelve convolution layers.A general diagram of the two architectures can be seen in Figure 2, where S refers to the filter stride, P to the padding used, and the last value represents the number of filters used in that layer.The weights of the two proposed networks were initialized using the He method [28].
The same training parameters were set for all the networks, even for the two proposed architectures, to avoid giving one network a greater advantage than another.For the pre-trained models, mixed transfer learning is performed, i.e., the weights of the first convolution layers are frozen and used as feature extractors while the rest of the layers are fine-tuned.The parameters are as follows: learning rate of 10-3 with a reduction factor of 0.5 every four epochs; training will be done for eight epochs, with a mini-batch size of 8 per iteration.These parameters are selected because the models were mostly trained with a learning rate of 0.1, and their weights are expected to not vary greatly from the initial ones.Similarly, it is optional to train for many epochs to avoid over-adjustment.As for the classification section, its learning rate is multiplied by a factor of 10 since, at this stage, the network has not had initial learning, so its rate must be higher for its learning curve to be greater.
During the training, three of the networks had problems with their learning: the AlexNet and VGG models, where their gradient tended to increase abruptly, preventing the network from finishing the training.For this reason, for these models, it was decided to carry out a transfer learning with complete fine-tuning in all its layers.Being pre-entrained and deep networks, it is the generalized characteristics of the architectures that extend the gradient loss.

RESULTS AND DISCUSSION
For the first performance evaluation, all networks were compared, as shown in Figure 3, during their training and tested with the validation set, thus obtaining the behavior of the network accuracy in Figure 3(a).In this, the networks with the worst behavior were AlexNet and VGG.Although their losses were reduced in Figure 3(b), these nets fell into overfitting early, remaining below 75% accuracy.On the other hand, the rest of the networks achieved an accuracy above 95%, with shallow DAG-CNN and ResNet-18 as the two slowest learning networks.The fastest models were the GoogLeNet, the ResNet-50, and the ResNet-101, achieving more than 98% accuracy in their first epoch.
The GoogLeNet and ResNet-101 models were the ones that obtained the best recognition performance, managing to discriminate without errors all the images.It is also possible to observe how the two shallow type networks obtained results above 98%; even the DAG-CNN achieved the same result as the Inception V3 without having previous learning and fewer layers than the Inception.AlexNet and the VGG maintained a low level of recognition because they could recognize either of the two users, i.e., the two registered users were recognized as one user, despite correctly classifying the category of "Others".Table 1 shows the accuracies obtained with each of the trained architectures.In order to enhance the testing and comparison of the models, it was proposed to verify their operation in real-time, adding a higher level of difficulty in recognition by changing the style of one of the users, that is, the beard and hairstyle.Most photos of the user "Javier" are presented as the examples shown in Figure 1, where he has hair and no facial hair.For real-time testing, the user uses full beard and hair removal to verify if the networks can recognize him.Face detection is performed using the Viola-Jones algorithm [29], with which the bounding box is cropped and then sent to the neural network.
Each neural network was tested with the user "Javier" video sequence.Figure 4 shows a frame taken from the sequence, where the category in which each model classified the user's face is displayed.The AlexNet and VGG models maintained constant user recognition in other categories.On the other hand, the shallow and ResNet-18 types, although capable of recognizing the user, maintained a constant variation of categories, repeatedly classifying the user as unknown, even if the position of the face was frontal, as can be seen in the figure .As for the deeper networks, they managed to maintain an accurate recognition in most of the video, with few category changes, without confusing him with the other user or classifying him as unknown.With the models that performed better in the real-time tests, i.e., those with less variation in the user's classification, another test is done, where the user makes slight rotations of the face, to check if the networks could maintain an accurate recognition.In this test, only two networks could maintain a correct classification, the Inception V3 and the ResNet-101, as seen in Figure 5.The percentages of success are respectively 92% and 76%, evidencing the better performance of the Inception architecture used.

Figure 5. Face rotation tests
The capability of these two networks (the Inception and the ResNet-101) is because, thanks to a large number of convolution blocks, they can learn specific patterns of the user's face, helping them to improve their recognition so that there is a change of style of the user.While architectures with less depth can recognize the user adequately, if a change is made that is not contemplated in the learning set, they will not be able to recognize the user because the learned characteristics will be more general .This factor is shown in Figure 6, where the first layer activations of the shallow DAG-CNN in the 3×3 filter-branch Figure 6  In shallow architecture, the network focuses on general face patterns, such as eyes, nostrils, and specific face parts.However, these are repeated through most filters without having variations of other user characteristics, as seen in the first image of activations.It can also distinguish shapes and edges, but only in slight sections of the face (such as the shape of the eye), without taking into account, for example, the general shape of the person's skull.Another pattern in their learning is found mainly only in the forehead, repeated by several filters, making a not very specific feature of the user to cover many activations.As for ResNet-101, although it also focuses its learning on these parts of the face, it does it in a more detailed way, without generalizing or joining several sections at the same time in one filter.But only aiming to discriminate certain patterns, achieving a better distribution of what each filter has learned.For instance, some filters learned the shape of the head, others the location of the eyes, and the nose, apart from its nostrils.Switching to its application in the real-time test further enhances each of the features learned by each network, whereas the shallow mostly keeps its activations on the whole face without discriminating more generally about the characteristics of the user.Meanwhile, ResNet can even highlight the shape of the user's face and head and the location of the ears.

CONCLUSION
In this work, a comparison of different CNN models for facial recognition was made to verify the performance of each one.With these comparisons, it was demonstrated that the best networks for this application were GoogLeNet and ResNet-101, which managed to recognize the two users without error correctly and to discriminate against all subjects not belonging to the database.However, the shallow networks without pre-training, such as the shallow CNN and the DAG-CNN, could obtain high performance, even matching the capacity of the Inception V3.
After performing a style change, an additional test was added to recognize one of the users.With this, it was found that the two networks with the greatest capability to withstand drastic changes in certain user characteristics are the Inception V3 and the ResNet-101, which have a greater capacity to learn detailed user features due to their depth.They managed to maintain constant recognition of the subject, even when performing face rotations.This robustness was demonstrated using the layer activations, comparing the learning of one of these against one of little depth, evidencing that these networks could learn more detailed patterns, allowing them to discriminate characteristic features of the user.

Figure 1 .
Figure 1.Examples of images used in the database

Figure 3 .
Figure 3.The validation set is used in (a) network accuracy and (b) network loss during training

Figure 4 .
Figure 4. Tests performed in real-time (a) and the ResNet-101 Figure 6(b) are obtained from an image of the validation set (top) and an image of the test video (bottom).

Table 1 .
Comparison of user's face recognition results