Deep learning for pose-invariant face detection in unconstrained environment

ABSTRACT


INTRODUCTION
We can define face detection as the process of extracting faces from the given images. Hence, the system should positively identify a certain region as a face. According to Yang et al. and Erik Hjelmas et al., face detection is a process of finding regions of the input image where the faces are present [1][2]. A lot of work has been done in detecting faces in still and frontal faces in-plane as well as complex background [3]. With the advancement in the field of information technology and computational power, computers are more interactive with humans. This human-computer interface (HCI) is done mostly via traditional devices like mouse, keyboard, and display. One of the most important medium is the face and facial expression [4], [5].
There are several algorithms that address frontal face detection [6] but only a small number of techniques exist that addresses non-frontal or multi-view face detection [7]. Most of the techniques uses scanning the image with sub window and then classify the sub window as a face and non-face pattern. The statistical learning methods are used for classification. This is because the pixels on faces are highly correlated while in non-face sub window they have less regularity. Hence use of nonlinear classifier is necessary due to huge variations in lightning and illumination, face expression, pose or appearance variations. Examples of such techniques are neural networks [8] or Support Vector Machines [9]. They used two neural network classifiers, first one for pose estimation and second for conventional face detection. Schneiderman et al. [10] proposed a technique that detects faces with out-of-plane rotation. In [11], Jones and Viola [12] extend this framework. Convolutional Neural Networks (CNN) [13], are the most recent cascade framework with quick rejection of background regions. The amount of research works on multi-view  [14,15], success of CNNs in many computer vision problem.
The CNNs can be visualized as a series of layers. The initial set of layers respond to discriminative low level patterns. The next set of layers respond to intermediate patterns which are composed of low level patterns and so on. The inspiration for CNNs and neural networks in general has been the biological understanding of the brain. It has been known for quite some time that the brain is made of over 100 billions neurons and these neurons are densely connected. The CNNs mimic neurons and their connections. A layer in CNN is made up of m × n neurons and neurons of the neighbouring layers are connected. In this section, we will describe about neurons, the connections and various types of layers that the modern CNNs have. Zhang et al. studied about enhancing multi-view detection of a face with multi-task deep CNN [16]. Farfade et al. [17] conducted a research to examine multi-view detection of the face using deep CNN. According to Parkhi et al. the recognition of the face from either a set of faces or single photograph tracked in a video [18]. Li et al. analyzed about CNN cascade for detecting the face [19].
Detecting face is a well-studied problem in the vision of computer. Contemporary detectors of the face can effortlessly identify near front faces. Complexities in detecting the face come from two aspects such as large space for searching of probable face sizes, positions and large visual differences of human faces in a chaotic environment. Former one imposes a requirement for the efficiency of time while later one needs a detector for a face to perfectly addressing a binary issue in classification. It was noted that uncontrolled issue in detecting face are extreme illuminations and exaggerated expressions can lead to large differences in visual in the appearance of the face and affect the face detector robustness [20]. This is significant to develop a method to properly detecting the faces as pointed out in [21]. Therefore, this particular research intends to concentrate on detecting the face with the help of multi-view face using deep convolution neural network.
In this work, we have presented a novel architecture of deep convolutional neural network (DCNN) for multi-view face detection. In most of the previous work feature selection was manual, that is handcrafted but in convolutional neural network feature selection is automatically, even in complex visual variations. As we know that CNNs need huge computational power because it requires exhaustively scanning of the entire image in multiple scales which is a bit difficult. Hence to speed up the detection, we proposed a CNN cascade structure which rejects false detection very quickly in early stage. The most prominent contribution of our work is as follows: 1. We designed a CNN cascade for fast face detection. 2. Our designed architecture is able to detection pose invariant faces in changing environment. 3. Our design is able to handle multi resolution images. 4. We improve the state-of-the-art performance on the face detection data set and benchmark.

IMPLEMENTAION METHOD
In the implementation, detection of the face and retrieval of the image will be attained with the help of direct visual matching technology. A probabilistic computation of resemblance among the images of the face will be conducted on the basis of the Bayesian analysis for achieving various detection of the face. After this, a neural network will be developed and trained in order to enhance the outcome of the Bayesian analysis. Next, to that, training and verification will be adapted to test other images which involve similar face features. Deep learning can be performed by supervisory signals.
Where,  is the feature vector, t represents target class and ∅ is softmax layer parameter, is the target probability distribution ( =0 for all except =1). ̂ =1 is the predicted probability distribution. The verification signal regularize feature and reduces intra personal variations given by Hadsell et al. [22]. Where, ∅ = { , }; are denote shifting parameters and learning scaling, represented as sigmoid function and is denoted as binary target of two compared facial images relate to same identity. Further operation of the convolution is represented as: Where, is input map and is output map, is the convolution between input and output. Maxpooling is given by: Where, output map pools over × non-overlapping region.
= 0, ∑ . , + ∑ . , + Where, , , , represent the neurons and weights in 3 rd and 4 th convolutional layers. Output of ConvNet is n-way software to predict the distribution of probability over n-unique identities.
DCNN is mostly adopted for classification and also adopted for detection and recognizing the face. Most of them consider the cascade strategy as well as consider batches with various locations and scales as inputs.

Proposed algorithm for deep convolutional neural network (DCNN)
This particular work develops an algorithm for detecting the face using multi-view with the help of deep convolution neural network. The steps of implementation are described below: Step 1: In the implementation, detection of face and retrieval of the image will be attained with the help of direct visual matching technology which matches the face directly. This technology makes use of similarity metrics of an image which can either be normalized correlation or it can be Euclidean distance, which corresponds to the approach called template matching. The similarity between the two images is measured through similarity measure, denoted by ( , ), Where, Ia and Ib are the two images between which the similarity is being measured.
Step 2: The next step is measuring probabilistic similarity or ∆ (the measure of intensity difference between the two images) given by Probabilistic similarity or ∆= ( − ) . This calculation of resemblance among the images of face will be conducted on the basis on the Bayesian analysis for achieving various detection of face.
Step 3: The probabilistic calculation of resemblance also supports multiple face detection. In order to characterize the various types of image variations were used for statistical analysis. Under this the similarity measure S (Ia, Ib) between the pair of images Ia and Ib is given in terms of posteriori probability (interpersonal variation) is provided by: If the multi-view face detection is done for a single person then (Ω |∆) > (Ω |∆) or it can be said that ( , ) > ½ .
Step 4: Further a neural network will be developed and trained in order to enhance the outcome from the Bayesian analysis.
Step 5: Next to that, training and verification will be adopted to test other images which involve similar face features. Implementation of the code is done step by step as follows: a. First, the DCNN object is created. b. Second, after this Graphical user interface is initialized. c. Then MCR (Misclassification rate) calculation is initialized and plot of MCR id created defining the current epoch, iteration, RMSE (Root Mean Square Error), MCR value of the image data. d. Training data is being loaded. e. Training data is pre-processed, errors are deleted and then image data is simulated. The screenshot for CNN training progress is shown in Figure 1. The plot of RMSE in training and plot of MCR is also shown in CNN training progress. The below equation is the CNN which is trained to minimize the risk of soft max loss function.
Here 'β' represents the batch used in iteration of stochastic gradient descent and label is ′ ′ and ′ ′ . Hessian calculation progress is started. Current epoch used for this is 3.00. Iteration value used for this research is 759.00. RMSE value used for this research is 0.18. MCR value used for this research is 0.90. Here 'theta' used is 8.000e -05 . Plot of RMSE in training is showed in zigzag lines. Plot of MCR in training is showed in curved lines.

CNN Structure
The CNN structure which is adopted in the present study is shown in Figure 2 which consists of 12net CNN, 24-net and 48-net structure. a. 12-net CNN It is the first CNN that scans or tests the image quickly in the test pipeline. An image having the dimensions of * ℎ having the pixel spacing of 4 with 12x12 detection windows for such type of image 12-net CNN is suitable to apply. This would result a map of: A point on the image map defines detection window of 12x12 onto the testing image. The minimum size of the face acceptable for testing an image is 'T'. Firstly an image pyramid is built through the test image in order to cover the face from varied scales. At each level an image pyramid is created, it is resized by 12/ which would serve as an input image for 12-net CNN. Under this structure, 2500 detection windows are created as shown in Figure 1. b. 12-Calibration-net For bounding box calibration, 12-calibration-net is used. Under this the dimension of the detection window is ( , , , ℎ) where ′ ′ and ′ ′ are the axis, ′ ′ and ′ℎ′ are the width and height respectively. The calibration pattern adjusts itself according to the size of the window is: In the present study number of patterns i.e. N=45. Such that: The image is cropped according to the size of detection window that is 12*12 which would serve as an input image to 12-calibration-net. Under this CNN average result of the patterns are taken because the patterns obtained as an output are not orthogonal. A threshold value is taken i.e. t in order to remove the patterns which are not the confidence patterns c. 24-net CNN In order to further lower down the number of the detection windows used, a binary classification of CNN called 24-net CNN is used. The detection window which remained under the 12-calibration net are taken and then resized to 24*24 image and then this image is re-evaluated using 24-net. Also under this CNN, a multi-resolution structure is adopted, through this, the overall overhead of the 12-net CNN structure got reduced and hence the structure becomes discriminative. d. 24-Calibration-net CNN It is another calibration CNN similar to that of 12-calibrationnet. Also under this number of calibration patterns are N. the process of calibration is similar to that of 12-calibration-net. e. 48-net CNN It is the most effective CNN used after 24-calibration-net but is quite slower. It follows the same procedure as in 24-net. This procedure used in this CNN is very complicated as compared to rest of the CNN substructures. It also adopts the multi-resolution technique as in case of 24-net. f. 48-calibration-net CNN It is the last stage or sub-structure of CNN. The number of calibration patterns used is same as in case of 12-calibration-net i.e. N=45. In order to have more accurate calibration, pooling layer is used under this CNN sub-structure.

RESULT AND DISCUSSION
Examples of the input images for two different identities with generated pose invariant output results are illustrated in Figure 3. In this figure, detected face for the various angle and poses for left and right profile faces including the frontal face are shown. Our detector gives results for images with varying poses with resolution. The modern face detection solutions performance on multi-view face set of data is unsatisfactory. Under this it was observed that in the presence of multi-resolution in CNN which is shown in Figure 5, number of false detection comes to halt (at the 10000 number of falsely detected faces) and the face is detected or the detection rate is achieved.

Comparison of Face Detectors
Effectiveness of the developed method is compared and contrasted with existing methods and techniques. It was noted that proposed method performs well in terms of accuracy and the recognition rate. We compare our method with other approaches including EdgeBox [23], Faceness [24], and DeepBox [25] on AFLW data set. Our method detects the input image at low resolution by rejecting quickly non-face regions for accurate detection. The use of Calibrated nets in the cascade improves the quality of bounding box. Meanwhile, we show that our detector can be easily tuned to be a faster version with minor performance decrease. The use of multi-resolution in CNN, more number of faces is detected falsely as compared to that of multi-resolution shown in Figure 4. The overall test sequence is shown in Figure 5. First of all test image is applied to the system, a 12 net structure will scan the whole image and quickly rejects about 90% of detection windows. Remaining detected window will be processed by 24 calibrated CNNs. In next subsequent stages, the highly overlapped window will be removed. Then a 48 net will take detected windows and evaluate the window with calibrated boundary box and produces as output as detected boundary box. Figure 6 shows all detection stages with different structure stages.

CONCLUSION
In this work, we develop an algorithm for detecting multi-view faces using deep convolution neural network. A major contributions were made in this particular research is that we have developed a procedure which can assemble a wide range of dataset, with the small noise of label while reducing the quantity of manual annotation included. The main concept of the algorithm is to influence the high ability of DCNN to classify and extract the feature. To learn the single classifier for detecting faces from different views and reduce the computational difficulty to simplify the detector architecture. For this work, we first transformed the completely linked layers into the convolutional one to reshape the parameters of the layer. By exploring a few key features of the network structure, we achieve high performance convolutional networks with a relatively small scale. Our detector gives results for images with varying poses and resolutions.