Gender classification using custom convolutional neural networks architecture

Received Sep 9, 2019 Revised May 20, 2020 Accepted Jun 11, 2020 Gender classification demonstrates high accuracy in many previous works. However, it does not generalize very well in unconstrained settings and environments. Furthermore, many proposed convolutional neural network (CNN) based solutions vary significantly in their characteristics and architectures, which calls for optimal CNN architecture for this specific task. In this work, a hand-crafted, custom CNN architecture is proposed to distinguish between male and female facial images. This custom CNN requires smaller input image resolutions and significantly fewer trainable parameters than some popular state-of-the-arts such as GoogleNet and AlexNet. It also employs batch normalization layers which results in better computation efficiency. Based on experiments using publicly available datasets such as LFW, CelebA and IMDB-WIKI datasets, the proposed custom CNN delivered the fastest inference time in all tests, where it needs only 0.92ms to classify 1200 images on GPU, 1.79ms on CPU, and 2.51ms on VPU. The custom CNN also delivers performance on-par with state-ofthe-arts and even surpassed these methods in CelebA gender classification where it delivered the best result at 96% accuracy. Moreover, in a more challenging cross-dataset inference, custom CNN trained using CelebA dataset gives the best gender classification accuracy for tests on IMDB and WIKI datasets at 97% and 96% accuracy respectively.


INTRODUCTION
Human facial analysis has become one of the most significant tasks in computer vision, since it plays a vital role in social interactions. Like other tasks such as the characterization of age, gender, facial attributes, expressions, and personality, automatic gender classification has various important applications such as intelligent user interfaces, user identification, social interaction, visual surveillance, collecting demographic statistics for marketing, behaviour recognition and so on. Therefore, many research efforts have been devoted to design automated system which can classify genders [1][2][3][4]. Although this task has been largely addressed in the past, the reported performances are far from optimal especially under unconstrained conditions [5,6]. Moreover, the complexity of this task largely depends on the context of the application and training protocols. Gender classification model can be trained and tested from the same dataset, also known as in-dataset inference, or from different dataset, also known as cross-dataset inference. Besides, facial images used these datasets can be captured under controlled or uncontrolled/unconstrained environment which will increase the complexity of the task. One of state-of-the-art in gender classification is obtained by Jia and Cristianini where they used 4 million images to train their method called C-Pegasos [7] and tested it using cross-dataset inference strategy on unconstrained LFW dataset. However, more recently, Afifi  Abdelhamed [8] performed similar cross-dataset tests and based on their results, it can be observed that poorer classification performance may be obtained as a result, in which according to them is due to different conditions of collecting images in different datasets, such as occlusions, illumination changes, backgrounds, etc.
Recently, deep neural networks, more specifically convolutional neural network (CNN) [9] has become the golden standard for object recognition. CNN have improved nearly all areas of computer vision including human action recognition [10], hand-written digit recognition [11], face verification and classification [12][13][14] and face detection [15]. However, there are two problems associated with CNN in particular, which is (1) the enormous size of data required to train the network such as in [12], and (2) the memory requirement of the network due to computation of massive parameters often limits the application of CNN on embedded platforms such as in mobile phones, as well as on cloud services. For example, two state-of-the-arts CNN architecture called GoogleNet [16] and AlexNet [17] both contains 6,799,700 and 62,378,344 parameters respectively. Another example, a 16-layer CNN described in [18] has a weight file bigger than 500MB and requires about 3.1×10 10 floating operations per image. Thus, CNN can be regarded as a high-capacity classifier having very large numbers of trainable parameters that requires CNN to learn from larger datasets [16] due to difficult process of tuning and estimating each parameter from small number of samples. To reduce the effect of these limitations, we can optimize the CNN to reduce the complexity of the layers by employing several approaches such as either by reducing the number of convolutional layers, and/or reduce the number of neurons in fully connected layers and/or reduce the resolution of the input images. However, it must be done carefully as to ensure that the resulting architecture can still learn the task at hand, e.g. gender classification, and generalize well on unseen data. The improvement in computation should not compromise the accuracy of the classification.
In this work, the problem of gender classification and high complexity of exiting deep neural networks is addressed, by focusing on reducing the complexity of the CNN and to improve the memory requirement as well as the time required for network inference. In particular, the goal of this paper is to propose a complete design of a low-complexity hand-crafted CNN, where this network will be tested on gender classification task under a very challenging unconstrained conditions as well as undergoing experiment using cross-dataset inference implementations. This relatively simpler and minimized model achieved state-of-the-arts performance and shows a significant boost in inference time when compared against several existing CNN architectures. However, A hand-crafted architecture is a very challenging, time-consuming and require expert knowledge due to a large number of architectural choices [19]. This proposed simplified CNN can learn from relatively smaller dataset and perform classification on a larger dataset.
One of the important work on gender classification called Face Tracer [20] employs combination of Adaboost and Support Vector Machines that select and train on the optimal set of features for each attribute based on the salient structure of faces. Similarly, PANDA-w and PANDA-1 [2] combines deep learning and part-based models by training pose-normalized CNNs to classify various human attributes such as hair style, gender, expression, clothes style, etc from that works well for images under large variability of pose, appearance, viewpoint, occlusion, and articulation. Liu et al., [21] proposed cascades of dual CNNs, called LNet and ANet. These CNNs are pre-trained in different sessions but jointly fine-tuned with attribute tags. ANet is pre-trained for attribute prediction using huge face identities, whereas LNet is pre-trained for face localization using huge general object categories. This approach surpassed the state-of-the-art by a considerable margin and discloses important facts on learning face representation. By combining the intermediate layers of a deep CNN using a separate CNN followed by a multi-task learning algorithm, Hyperface [22] boosts up their individual performances by exploiting the synergy among the tasks. The authors showed that Hyperface is able to extract both holistic and local information in faces and thus outperforms many competitive algorithms for face detection, pose estimation, gender recognition and landmarks localization. Jia and Cristianini [7] presented a simple yet effective classifier of face images called C-Pegasos which is generated by training a linear classification algorithm on a massive dataset which is automatically assembled and labelled. They used four million images and more than 60,000 features to train these classifiers. By employing linear classifiers ensembles, when tested an LFW dataset, C-Pegasos achieved an accuracy of 96.86%.
Recently, Afifi and Abdelhamed [8] proposed an approach based on the behaviour of humans in gender classification. They rely on foggy face which combined isolated facial features and a holistic feature, instead of dealing with the face image as a sole feature. Then, they use foggy face to train a CNN followed by score fusion based on AdaBoost to classify the gender class. Antipov et al., [23] suggested an ensemble model of CNN to enhance the gender classification accuracy from facial images in LFW dataset. Their ensemble model is purposely designed in such a way that minimized the memory requirements and computation time. Likewise, local deep neural network (Local-DNN) [3] is proposed for gender classification where it is relies on two fundamental ideas: deep architectures and local features. overlapping regions in the visual field is used to train the model by discriminative feed-forward networks built using multiple layers. The authors showed that Local-DNN outperformed other deep-learning-based methods and attains state-ofthe-art results in multiple benchmarks.
On the other hand, several works are more focused on designing custom or hand-crafted architectural design of CNN. Most of these works discussed on CNN design issues such as hyperparameters [19,24], new optimization of objective function [19], improved triplet loss function [25], structure compression of CNN [26] and estimating CNN architectures complexity [27]. These methods share similar objective which is to find the most optimized hand-crafted CNN architecture that produces a high recognition performance while maintaining its complexity to be as low as possible to allow faster computations and less memory requirement.
Additionally, the research community is also addressing a much more realistic general problem for gender classification which is gender classification from facial images in the wild. Many researchers are now focused on experiments involving recent and bigger databases that encompass more variations including identity, ethnicity, age, illuminations, image resolutions and pose variations. This has called for solutions on how to acquire a model that can generalize well on a dataset and gives good inference performance on a completely new unseen dataset. In doing so, is very important to ensure the model did not require constant retraining and can work well in challenging, unconstrained conditions. Thus, a cross-dataset tests have been adopted previously in [1,7,8,23] to measure the performance of gender classifier on new datasets that present this type of challenge. The rest of this paper is arranged as follows. In Section 2 the proposed custom CNN is explained, which can be used to automatically classify genders. Section 3 describes and analyzes the results on the publicly available datasets such as LFW, CelebA and WIKI-IMDB datasets. Finally, Section 4 draws some conclusions and discusses future work.

RESEARCH METHOD
The convolutional neural network architecture adopted in this work intends to reduce the complexity by optimizing the convolutional layers, reducing the number of neurons in fully connected layers and reducing the resolution of the input images. In designing this architecture, the network should still be able to deliver state-of-the-arts performance while maintaining a very optimal size. In doing so, AlexNet architecture is used as the starting network template. AlexNet architecture had a very similar architecture as LeNet by LeCun et al., [28] and it contains eight main layers where the first five are convolutional layers. Some of the convolutional layers are followed by Max Pooling layers, and the final three layers are fully connected layers. Firstly, the input is reduced into smaller input image resolution from the original 227×277 to 64×64. Some previous works on face recognition showed that this image resolution is sufficient to achieve good results [29]. By changing the size of image input layer, the size of subsequent convolution layers need to be altered too. For more stable and improved learning, 2 layers of convolution layer have been added to this network. Finally, to reduce the trainable parameters further, the number of neurons are reduced in all fully connected (FC) layers. The number of neurons is later reduced in this custom CNN from 4096, 4096, and 1000 neurons to 100, 50 and 2 neurons at each FC layer, respectively. It is important to note that the final FC layer acts as softmax layer.
To speed up the training, the normalization layers in the network which is based on local response normalization (LRN) are replaced to batch normalization layers [30]. The batch normalization layer makes normalization a part of the model architecture and it performs the normalization for each training mini batch. It allows the network to be trained a higher learning rates that is otherwise difficult and highly unstable with normal LRN layers. Batch normalization has been shown to achieve the same state-of-the-arts performance with 14 times fewer training iterations [30]. The overview of the architecture of the proposed custom CNN is shown in Figure 1.
Subsequently, the layers, size of kernels, number of kernels, and strides for each layer used in custom CNN architecture is given in Table 1. According to Table 1, in total, the custom CNN has 7 convolution layers and 6 batch normalization layers in between of each convolution layers. In between of each convolution layers, Max Pooling layer is added to reduce the spatial dimension of the input volume for next layers. The activation function used in this network is rectified linear unit (ReLu). Several dropout layers are also added with the ReLu to prevent overfitting. To prove that this custom CNN possesses smaller number of parameters as compared to several existing CNN, the parameters in custom CNN design is computed. Here it is shown in detail how to compute the size of output features of each convolution layer and how to compute the number of parameters associated with convolution layers and fully connected layers of a CNN.

Size of the output features of a convolution layer and max pooling layer
To compute the number of output features of a CNN, let's denote the width of output image as , the width of input image as , the width of convolution kernels layer as , the number of kernels used as , the stride of the convolution as , the padding as and the pool size as . The size of the output of a convolution layer and Max Pooling layer are given by:

Parameters of a convolution layer
In a CNN, there are two parameters for each layer, namely the weights and biases. The total number of parameters is the sum of all weights and biases. Let the number of weights and biases of the convolution layer be = 2 and = respectively, where is the number of channels of the input image. Thus, we can compute the number of parameters of the convolution layer :

Parameters of an FC layer connected to a convolution layer
Let the number of weights and bias of a FC Layer which is connected to a convolution layer denoted as = 2 and = respectively, where is the number of neurons in the FC Layer and is the size of the output image of the previous convolution layer. Thus, the number of parameters of the convolution layer can be computed from:

Parameters of an FC layer connected to another FC layer
Let the number of weights and biases of a FC Layer which is connected to an FC Layer be = −1 and = respectively, where −1 is the number of neurons in the previous FC Layer. Thus, we can compute the number of parameters of the convolution layer : Finally, total number of parameters = + + can be determined. It is important to note that there are no parameters associated with a pooling, dropout, and ReLu layers. The pool size, stride, and padding are all considered as hyperparameters whose value is set before the learning process begins. Although better results can be achieved when hyperparameters are properly adjusted [24], these hyperparameters are considered non-trainable and are external to the model. Throughout this work, the computer system that is used in the experiments runs on Intel i7-6700 CPU @ 3.40 GHz, with 16GB of RAM and uses GTX1080Ti as the main GPU. The hyperparameters used in training the custom CNN are as follows: momentum = 0.9, mini batch size = 500, L2 regularization = 0.001, initial learning rate = 0.1, learn rate drop factor = 0.9, and learn rate drop period = 10.

RESULTS AND DISCUSSION
In this section, the results of gender classification experiments on several datasets are presented. Critical discussions on the performance of proposed custom CNN is presented in terms of its accuracy and inference speed on GPU, CPU and an embedded system. The performance in gender recognition is also compared against state-of-the-arts methods such as AlexNet and GoogleNet to highlight the superiority of the proposed method. Besides, the custom CNN is also tested under a cross-dataset inference constraint, where custom CNN trained using CelebA dataset is tested on IMDB and WIKI datasets.

Datasets description
In the experiments several publicly available datasets namely the labelled faces in the Wild (LFW) dataset, CelebFaces Attributes Dataset (CelebA) dataset and IMDB-WIKI dataset are used. The LFW dataset [31] contains 13,323 photos of 5,749 celebrities taken under unconstrained environments which are then divided into 10-fold cross validation. Each fold contains both male and female images, as suggested by the restricted protocol. Performance is measured using the restricted protocol, in which only gender labels are available in training. LFW gender labels used in this work are determined by Afifi and Abdelhamed [8], where values of attributes suggested by Kumar et al., [32] are used to label the images based on gender. Subsequently, they remove incorrect labels by manually reviewing each category of male and female images three times. In this paper, a variant of LFW dataset called LFWA [33] which contains the same images available in the original LFW dataset is used, however, images in LFWA dataset are aligned using a commercial face alignment software. CelebA [34] is a large-scale dataset with large facial diversities, huge quantities, and comprehensive annotations that has more than 202,599 images from 10,177 identities, having 40 binary attributes annotations and 5 landmark locations for each image. The images in this dataset also contains background clutter and pose variations. IMDB-WIKI dataset [35,36] contains 524,230 images which made it one of the largest public face dataset available. These face images are crawled from IMDB and Wikipedia websites. This dataset in total contains 460,723 face images from 20,284 celebrities from IMDb and 62,328 from Wikipedia with gender and age labels are supplied for training. For IMDB-WIKI dataset, only photos that have the second strongest face detection below a certain threshold are chosen, thus the total number of images used in this work from this dataset are 33,181 images and 3,210 images for IMDB and WIKI respectively. In all experiments, all face images are aligned, cropped and resized to 64×64. Some examples of the images from LFW, CelebA, IMDB-WIKI datasets are shown in Figure 2. Subsequently, the number of male and female images for each dataset are summarized in Table 2.

Parameters of custom CNN architecture
Firstly, the complexity of the custom CNN architecture are compared against GoogleNet and AlexNet architecture by computing the number of parameters. The feature size (output) of a convolution layer and Max Pooling layer are computed using (1) and (2). Afterward, the number of parameters of a convolution layer , number of parameters of an FC layer connected to a convolution layer , and number of parameters of an FC layer connected to another FC layer are calculated using (3), (4) and (5) respectively. These parameters are given in Table 3.
Based on Table 3, the total number of parameters of custom CNN is 2,041,796 parameters. It is interesting to note that 95% of the parameters comes from the 6th and 7th convolution layer, and the first FC layer. This number of parameters is subsequently compared against the number of parameters of GoogleNet and AlexNet architecture, which is given in Table 4.   According to Table 4, AlexNet has the largest number of parameters with more than 62M parameters while GoogleNet has 6.8M parameters. This renders AlexNet to require larger memory space during training as compared to other methods. The proposed custom CNN possesses lowest number of parameters when compared against the GoogleNet and AlexNet parameters. In fact, custom CNN has 3 times smaller number of parameters than GoogleNet and 30 times smaller than AlexNet. This has a positive impact on the memory requirement for this custom CNN during training, which enable the custom CNN to be trained with mini batch size of 500, compared to GoogleNet and AlexNet which is set to only 30 throughout the experiments. The input image resolution for custom CNN is also significantly smaller at 64×64 pixels, as compared to 224×224 pixels and 227×227 pixels required by GoogleNet and AlexNet respectively. In total, the size of input for custom CNN is 92% smaller than input for GoogleNet and AlexNet. This will allow the custom CNN to be trained much faster and can be trained on relatively cheaper computer system with lesser specifications. Nevertheless, based on results presented in subsequence, the capability of custom CNN to extract and learn highly complex parameters from significantly smaller images in much faster time without sacrificing the accuracy is demonstrated. Since training large datasets on CPU or Visual Processing Unit (VPU) is painfully slow, it is more appropriate to show the computational advantage of custom CNN by measuring the average inference time for 1200 images from 10 folds of LFW dataset using custom CNN, GoogleNet and AlexNet. The inferences are run on GTX1080Ti (GPU), Intel i7-6700 @ 3.40 GHz (CPU) and Movidius Neural Compute Stick (Movidius NCS) (VPU). The average inference time measured in this experiment is given in Figure 3.
According to Figure 3, custom CNN requires only 0.92ms to classify 1200 images on GPU, while GoogleNet and AlexNet requires significantly longer inference time at 4.42ms and 1.95ms respectively. On CPU, the average inference time for custom CNN is just 1.79ms, while GoogleNet and AlexNet requires 62.37ms and 20.41ms respectively. For VPU, custom CNN again requires least amount of inference time of just 2.51ms, while GoogleNet and AlexNet requires 95.78ms and 55.60ms respectively. This put into perspective that custom CNN capitalize on its lesser requirements of image size and parameters which enables custom CNN to be inferred at significantly higher speed on GPU, CPU and VPU environment. The most important takeaway is this will allow custom CNN to be deployed in real-time tasks whether it is run on GPU/CPU-based system or on completely embedded system such as those relying on more energy-efficient VPU such as Movidius NCS. In the following section, it is shown that even though custom CNN works on smaller number of parameters, its performance in gender classification is in fact on-par and occasionally better than state-of-the-arts.

Performance in gender classification
The custom CNN is trained to classify gender from face images in LFWA dataset and its performance is evaluated in terms of accuracy, true positive rate (TPR), false positive rate (FPR) and Precision. These performances are averaged over 10 runs from 10 folds as cross validation as mentioned earlier. Two variants of images are used, namely the grayscale and RGB images. The performance for gender classification using custom CNN is compared against the performance of GoogleNet and AlexNet classifiers doing the same task and is shown in Figures 4 and 5. According to Figure 4, at least slightly better performance is achieved using RGB images compared with grayscale images for all tested classifiers. 0.01 accuracy improvement is obtained by custom CNN, GoogleNet and AlexNet when using RGB images as opposed to using grayscale images. According to Figures 4 and 5, GoogleNet gives the best accuracy for RGB images with an accuracy of 0.96, while custom CNN delivers 0.95 accuracy, better than AlexNet which is at 0.90 accuracy. In terms of TPR, GoogleNet again delivers best performance at 0.98 TPR, while custom CNN and AlexNet delivers 0.97 and 0.93 TPR respectively. Similarly, GoogleNet delivers best performance for FPR and Precision, where custom CNN delivers second-best performance, followed by AlexNet with the worst performance of all three classifiers. A closer look on the accuracy of tested classifiers for each test fold in Figure 6 shows that custom CNN and GoogleNet has comparable performance and more stable with just slight fluctuations in accuracy when compared to AlexNet. These excellent performances of custom CNN are quite impressive considering its significantly less complex architecture which contains fewer parameters and inferred at significantly higher speed than state-of-the-arts such as GoogleNet and AlexNet classifiers. Afterwards, the performance of gender classification for LFWA and CelebA dataset is evaluated in the form of Receiver Operating Characteristics (ROC) curve. It is one of the most important evaluation metrics for checking any classification model's performance. Area Under the Curve (AUC) is used to measure the performance at various thresholds settings used by the final softmax layer. The ROC curve from this experiment is shown Figure 7. Based on Figure 7, for LFWA datasets, GoogleNet has the best AUC, followed custom CNN and AlexNet. Interestingly, the AUC for custom CNN and GoogleNet is not too much different from one another thus proving that custom CNN again can deliver performance on par with GoogleNet classifier despite its much simpler architecture. Moreover, the AUC of custom CNN on CelebA dataset exceed the AUC of GoogleNet and AlexNet, making custom CNN as the best performing classifier for CelebA dataset. This exceptional performance on CelebA dataset also proves that even though custom CNN contains fewer parameters that the other tested classifiers, it can learn to classify the gender of more than 200K images correctly most of the time.

Cross-dataset inference
To evaluate the robustness and generalization performance of the custom CNN, the accuracy of gender classification in cross-dataset inference fashion is measured. Cross-dataset inference implies that a model trained using a dataset and tested on another completely different dataset. This implementation is similarly used in [1] and it is not transfer learning, since the same classification layer is kept and tested on the new dataset. This is a very challenging test, since the variations and inherent attributes contained in a dataset may not exist in another dataset thus introducing large variability in test dataset. However, it can be used to validate whether the model can generalize well and not overfit to training data. In this experiment, custom CNN, GoogleNet and AlexNet on LFWA and CelebA dataset are trained and subsequently tested on LFWA, CelebA, IMDB and WIKI dataset. IMDB and WIKI datasets are not used at all during training-only LFWA and CelebA dataset is used to train the model. The results of this experiment are given in Table 6, where shaded areas indicate that the results that are obtained from test on same dataset (in-dataset inference). To simplify the results, train datasettest dataset notation is used. For example, LFWA-LFWA indicates LFWA is used in training and LFWA is used in testing. In this case, different portion of data from the same dataset is used for training and testing. According to Table 6, LFWA-LFWA inference yields best result using GoogleNet at 96% accuracy while LFWA-CelebA inference yields best result using custom CNN at 94% accuracy. CelebA-LFWA inference yields best result using GoogleNet at 94% accuracy, while CelebA-CelebA inference yields best result at 96% accuracy using custom CNN. GoogleNet and custom CNN both deliver best result in IMDB-LFWA inference at 96% accuracy. On the other hand, custom CNN delivers best result in IMDB-CelebA inference at 97% accuracy. Again, for WIKI-LFWA inference, custom CNN and GoogleNet yields best result at 94% accuracy for both, while custom CNN yet again delivers the best result at 96% accuracy for WIKI-CelebA inference. Overall, custom CNN trained using CelebA dataset gives the best gender classification accuracy for CelebA, IMDB and WIKI datasets at 96%, 97% and 96% accuracy respectively which highlight its robustness and ability to generalize trained data on completely different datasets. Another important factor is CelebA is a very large dataset, thus most of the variations that exist in LFWA, IMDB, and WIKI may have been captured by the custom CNN from CelebA images. Several variabilities in terms of 1) identity, age and ethnicity, 2) pose and illumination conditions, and 3) image resolution may be shared across these datasets. On the other hand, AlexNet fails to generalize well in cross-dataset inference experiments where it delivers bad performance in all cross-dataset tests. AlexNet worst performance delivers 77%, 70% and 60% accuracy for LFWA-CelebA, IMDB-CelebA and WIKI-CelebA inferences respectively. From this result, custom CNN also shows that it can generalize well from smaller dataset and perform classification on larger dataset, where custom CNN trained using LFWA dataset can correctly classify gender in larger CelebA dataset with good accuracy of 93%.
One of the challenges of CNN is to comprehend what exactly happen at each layer during training. It is well known that each layer extracts high-level features of the image at earlier layers, while the final layer basically decides on the class of the images. The first layer normally finds edges or corners whereas intermediate layers interpret the basic features to look for overall shapes or components, like a cat or a ball. The final few layers accumulate those features into complete interpretations of the trained class. To learn more about the features learned by the custom CNN in classifying gender, the features learned by the custom CNN at convolution 4 layer and softmax layers are visualized for LFWA and CelebA datasets respectively. These features can be visualized using DeepDream, a dream-like hallucinogenic appearance in the deliberately over-processed images [38,39]. The visualization is generated images that strongly activate a particular channel of the network layers. The DeepDream visualization from custom CNN which is trained to classify gender is illustrated in Figure 9. At 4th convolution layer, features are more mixed and has different impressions, even though the training samples only contains two classes. This is due to many learned features contained in the images, which belong to different gender classes. At softmax layer, only 2 distinct learned features appear which is male on the left, and female on the right. Comparing softmax features between

CONCLUSION
In this work, a hand-crafted, custom CNN architecture is presented which is designed to distinguish between male and female facial images. This custom CNN contains only 7 convolutional layers and 2 fully connected layers, with batch normalization layers used in between the convolutional layers. It requires relatively smaller input image and as a result, it has significantly less parameters than other architecture such as GoogleNet and AlexNet. In fact, custom CNN has 3 times smaller number of parameters than GoogleNet and 30 times smaller than AlexNet. Extensive experiment using various publicly available unconstrained datasets demonstrated the advantages of custom CNN. It delivered the fastest inference speed in all tests where it requires only 0.92ms to classify 1200 images on GPU, 1.79ms required on CPU, and 2.51ms on VPU. The proposed custom CNN also yielded performance on-par with state-of-the-arts and even surpassed these methods in CelebA gender classification where it delivered the best result at 96% accuracy. Moreover, in cross-dataset inference experiment, custom CNN trained using CelebA dataset gives the best gender classification accuracy for IMDB and WIKI datasets at 97% and 96% accuracy respectively which highlight its robustness and ability to generalize trained data on completely different datasets. In future, the performance of the custom CNN will be evaluated on other classification tasks such as classifying people and objects.