Real-time human detection for electricity conservation using pruned-SSD and arduino

Received Jul 1, 2020 Revised Aug 2, 2020 Accepted Sep 24, 2020 Electricity conservation techniques have gained more importance in recent years. Many smart techniques are invented to save electricity with the help of assisted devices like sensors. Though it saves electricity, it adds an additional sensor cost to the system. This work aims to develop a system that manages the electric power supply, only when it is actually needed i.e., the system enables the power supply when a human is present in the location and disables it otherwise. The system avoids any additional costs by using the closed circuit television, which is installed in most of the places for security reasons. Human detection is done by a modified-single shot detection with a specific hyperparameter tuning method. Further the model is pruned to reduce the computational cost of the framework which in turn reduces the processing speed of the network drastically. The model yields the output to the Arduino micro-controller to enable the power supply in and around the location only when a human is detected and disables it when the human exits. The model is evaluated on CHOKEPOINT dataset and real-time video surveillance footage. Experimental results have shown an average accuracy of 85.82% with 2.1 seconds of processing time per frame.


INTRODUCTION
Unmanned electric power management system (UEPMS) is one of the challenging tasks that are required to save electrical resources in most of the countries. Several researches are going on for UEPMS with the help of sensors to detect the presence/absence of humans and to manage the power supply accordingly. Usage of sensors incurs an additional cost and hence this work aims to develop a system that manages the electrical resource using the existing closed circuit television (CCTV) surveillance camera. The surveillance video captured from CCTV cameras is used to detect the human's presence/absence to enable/disable the power supply thereby avoiding additional cost. Human detection in surveillance cameras footage has been an interesting [1] and challenging [2] topic in the recent years. The traditional hand-crafted methods like local binary pattern (LBP), histogram of oriented gradients (HOG) etc., are time consuming and proved to be comparatively inefficient to the recent convolutional neural network (CNN) based algorithms [3]. Various object detection algorithms in deep learning (DL) have shown promising results in classifying and detecting the location of the objects [4]. The first category of DNN is a two stage approach like RCNN, faster-regional CNN (faster R-CNN), region based fully convolutional neural network (R-FCN) [5,6]   which proposes regions by a separate network called region proposal network and the classifier processes these regions for classification [7]. The second one is a one stage approach like you only look once (YOLO) and single shot multibox detector (SSD) in which both the class probabilities and the bounding boxes are produced by the CNN itself [8].
In this work a system using modified-SSD which is based on visual geometry group (VGG16) is used for human detection in surveillance cameras [9]. The CNN based network takes the input from CHOKEPOINT dataset, which has frames of a surveillance video. The model is initially trained with the set of hyper-parameters obtained from orthogonal array tuning method (OATM). The optimal factors are derived from the factor table of the OATM method. Once the model is converged to the minimal loss function, the network is pruned to remove the less important parameters of the network. Pruning is a method done to reduce the complexity of the network thereby maintaining the accuracy of the model. The identification of these less important weights are done by the H-ranking algorithm proposed by Lin et al. After pruning, the network is retrained with the set of hyperparameters obtained from the OATM method. The process is iterated until the convergence or the error loss function is similar to the one obtained by the model before pruning is done. The output of the classifier is fed to an Arduino microcontroller to manage the power supply. Arduino enables the power supply only in the location (2.1 metres) in and around where a human is detected and disables the power when undetected. The system is also validated on a real-time dataset of a surveillance video in an indoor environment of a living room where the footage is converted to frames at the rate of 5 per second. The intensely compressed model shows promising results in prediction accuracy with reduced training time.

RELATED WORKS
Object detection has gained a lot of attraction by the researchers in various applications. From small crack detection to human detection, lesion detection [10] to detection from satellite images [11] etc. Overcoming the shortfalls of traditional hand-crafted methods, DL has achieved enormous growth in these detections. Object detection in surveillance videos is one of the most challenging tasks due to the lack of clarity in frames. It all started with the regional convolutional neural network (R-CNN) which uses selective search to detect the location of the object with a bounding-box [12]. Approximately 2000 candidate regions are extracted in this process which is extensively time consuming. This is overcome by spatial pyramid pooling net (SPP-Net) on the feature maps [13] and further by faster-RCNN [14] which uses a separate network called region proposal network (RPN) to generate candidate regions. A system for facial expression recognition was developed for an emotional audio and video data [15] using faster R-CNN.
Several modified versions of faster R-CNN were developed to improve the prediction accuracy at minimal cost [16]. Though all these techniques have improved the accuracy, it still serves as a time consuming task as it requires two stages for prediction. This was overcome by the YOLO model where the entire process is carried out by a single neural network that makes optimization quite easier [17]. An advanced version of one stage approach is a single SSD system, which is achieving promising results in realtime surveillance videos [18]. One such application was developed to detect small objects using contextual information in SSD at increased speed [19]. When compared to all the above listed detection algorithms, SSD has achieved relatively promising results at increased speed by applying prediction filters on every feature map produced. Though SSD achieves good results, one of the major challenging tasks of DL is the high training time required by the model to learn [20,21]. Hence the proposed work uses SSD architecture with a specific hyper-parameter tuning method to reduce the training time of the network [22]. As the deep network designed for any real-time application involves high computational cost, pruning is done in many applications to keep the model simple. Data-dependent and data-independent methods are the two techniques adopted to evaluate the importance of the weights among which optimal brain damage [23], optimal brain surgeon [24], absolute value method [25], LASSO regression method [26] are certain renowned techniques. An interesting method by Mingbao Lin et al. has been developed which uses a H-Rank filter pruning method to prune the model by calculating the rank of each and every parameter. The technique then re-arranges the ranked parameters in decreasing order and finally eliminates the least important parameters.
One of the important tools in literature to control the power supply is the Arduino micro-controller which has its usage in a wide range of applications. A system to quantify the energy of the given load and plan appropriate energy conservation policies was also designed [27]. Another application was developed for smart home energy management systems (SHEMS) [28] to enable/disable the power supply when a human is detected/undetected respectively. All these systems use sensors which incurs additional cost. Hence our system uses a modified-SSD to detect humans in existing CCTV surveillance footage with an Arduino microcontroller to manage the power supply accordingly.

PROPOSED FRAMEWORK
A modified SSD is developed in this segment with an orthogonal array based tuning method to reduce the training time with the set of obtained factor values. The model is further pruned to reduce the complexity of the DNN model which in turn reduces the computation cost intensely. The working model of the proposed architecture is given in Figure 1. The output of the model is embedded with an Arduino microcontroller to manage the power supply of the environment.

Datasets and data augmentation
The proposed system is evaluated using standard CHOKEPOINT dataset, consisting 64.204 frames of a surveillance video in an indoor environment. The system is also evaluated on a real-time surveillance video of a living room which is converted at the rate of 5 frames per second (fps). It consists of 9000 frames totally out of which 7420 frames (including augmented frames) are considered for training and the remaining 3180 are considered for testing. Validation set consists of 371 images to tune the hyper-parameters. Augmentation is also done to make the network more robust. Shrinking and cropping of the original training images are done. Gamma correction is applied to create variations in intensities and brightness of the frames using two sets of values (0.5 and 3.0) and (0.5 and 2.0) for all the three channels individually [29]. Hide and seek [30] is finally used to divide the entire image into patches with a division number of 4 to make the network learn the fine entities of the patches. The sub patches are hidden with a probability of 0.25. In both the cases the positive bounding boxes are the ones with the match score greater than 0.5 with that of its ground truth and the remaining are considered negative. All the experiments are conducted in GPU-NVIDIA TITAN X 12 GB, RAM-32 GB DDR4.

Model construction
The proposed framework uses a modified SSD technique which is a feed-forward convolutional neural network based on VGG16. The structure has a truncated part of VGG 16 with an additional convolutional structure attached to its end. It eliminates the need for RPN in faster R-CNN thereby increasing the detection speed drastically [31]. Instead of RPN, it uses multi-scale features and default boxes to make its prediction accuracy be in par with the average accuracy obtained by faster R-CNN [32]. The entire architecture of the proposed system is given in Figure 1. The input is a colour image I which is fed into the initial base network consisting of five convolutional layers ( = 5) as in VGG-16 which detects the edges, blobs, texture and object parts. The sixth and seventh dense layers of VGG-16 are replaced with convolutional layers while removing the eighth and the other dropout layers. ReLU is the activation function used at every convolutional layer except the output layer.
where Y is the feature map, W is the weight associated with input I. Each of these convolutional layers with size × and channels produces a feature map on which the prediction layer (classification layer) of size 3 × 3 × is applied. The values produced are with respect to each location of the feature map thereby yielding ( + 4) outputs where c represents the class scores with four offsets for z filters. The feature map is then subsequently processed by other convolutional layers from till the final layer where for each . Appropriate default boxes are to be selected for better prediction and hence the ground truth box is matched with the default box using Jaccard overlap method with a threshold value set to 0.5. This yields a large number of negative samples out of which the samples with the highest confidence loss is alone selected, thereby making a ratio of 1:3 for positive to negative samples. Finally, sigmoid function is used at the output layer to make a binary classification of the frames and the nonmaximum suppressions (NMS) makes object detections. The weights of the network are initialized using the "HE" initialization technique based on the formula where fan_in represents the number of input units. It provides a controlled initialization, thereby increasing the convergence rate. Learning rate, momentum, dropout and weight decay are the four hyper-parameters considered for tuning and are done by orthogonal array tuning method (OATM) based on Taguchi's factor table. This method yields a certain set of factor values which makes the model's convergence easier and faster. The hyper-parameters are represented as factors, while the corresponding values are defined as levels in the table. The orthogonal array table is generated using Weibull++ software with 4 factors that has 9 rows. The model is trained with the frames of the training datasets which is divided into 10 batches. The model is iterated for 90 epochs with the hyperparameters obtained by the OATM method and the accuracy for each set of hyperparameters is estimated. The row with the highest accuracy is considered and the corresponding values are the optimal values with which the hyperparameters are tuned. Each and every batch is iterated for 4 epochs and the average of these four experiments is considered as the accuracy of a selected single set of hyperparameters.
The accuracies and the corresponding values for these selected hyperparameters are presented in Table 1. Level 5 in Table 1 obtains the highest accuracy rate and hence the corresponding values are the optimal values for this phase of SSD. The specifications of the constructed model like the kernel size, input and output dimensions of the feature maps of every layer, the total parameters used in every layer etc are given in Table 2. The parameters are calculated using the formula; where the kernel_size, channel in , and channel out are defined as the kernel size of the weight filter, number of input channels, and number of output channels in each convolution layer. X and Y represent the horizontal and vertical dimensions of the feature maps in each convolution layer. The amount of computation is directly proportional to the number of parameters of the network in each and every layer [34]. Therefore, reducing the number of parameters in both convolution and FC layers helps achieve the goal. There are various strategies to decrease the model size, such as pruning [35] and quantization [36] etc., out of which pruning is selected for our work as it has yielded promising results in various applications [37].

Pruning
The SSD network exists with a set of N convolutional layers where i th convolutional layer is represented by Ni. Pruning is defined as the removal of filters or parameters which are considered to be less important. Hence, the entire set of filters is divided into and which represents the important filter and less important filter set respectively. The total number of important and less important filters is depicted by K and L respectively where K+L=Q representing the total filters. Inspired by [38], we perform pruning in the same way where filter pruning in general is formulated as; In (4) where ij represents 1 if the weights are labelled as K and 0 otherwise. The importance of the filter is measured by ( ). Hence the objective is to minimize the equation to remove L. As each and every feature map of the Ni th layer plays different roles, equation 4 has been reformulated as; [Ẋ ( ( , : , : ))] (6) where Y represents the feature maps, I is the input image sampled from distribution D(I) such that; also the evaluation of the filter is defined as; [Ẋ ( ( , : , : ))] = ( ( , : , : )) single value decomposition is applied where; ( , : , : ) = ∑ =1 (9) such that; when r ' <r, i mi and ni are top, left and right singular values respectively. Thus, the rank of each and every parameter is calculated and the technique re-arranges the ranked parameters in decreasing order which finally eliminates the least important parameters (bottom most). The model is trained again with the hyperparameters obtained by OATM method for all the factor values. The optimal set of hyperparameters obtained in this phase is different as the network is actually pruned. The H-Rank algorithm is again implemented and the process is iterated until the error function is similar to the one obtained by the model without pruning. The entire working model of the pruning methodology is given in Figure 2.

Training and testing process
The frames are resized to 300×300 to feed it into the modified-SSD network. The dataset is divided into a training set and testing set in the ratio of 70:30. The training set consists of 47.299 frames in which the validation set is a sub class containing 5% of the training frames (2364 frames) to tune the hyper-parameters of the network. This is followed by processing of the testing set (20.271 frames). Initializing the weights of the network is done using "HE" initialization technique. The frames of the training set are fed into the network and the experiments are run for all the levels of Table 1 (OATM). The model is then pruned by removing the less important parameters of the network using the H-rank algorithm. The network is retrained with the set of hyperparameters obtained by the OATM method. The process is iterated until the error loss function is similar to the one obtained by the model before pruning. A validation set and test set is passed after training the network and the experiment is iterated 90 epochs, and the mean average precision (mAP) is taken for each level. The highest mAP obtained for the test set is 87.21% before pruning and 85.82% after pruning. Compression rate of 42% is achieved by pruning which reduces the testing time to 2.1 seconds from 4.5 seconds of an un-pruned model.

Evaluation
The predicted values and the ground truth values are represented as = {pcx, pcy, pw, ph} and = {gcx, gcy, gw, gh} respectively. The loss function is the total loss calculated by summing the classification and the regression loss which is denoted by: where c represents the centre of the bounding box, p is the predicted value, g is the ground truth of the bounding boxes, N is the number of matched default boxes and the regression loss is calculated using the formula given by [39,40]: where ℎ 1( ) = { 0.5 2 , | |<1 and z -0.5 otherwise. The classification loss based on cross entropy is given by where p and r denotes precision and recall respectively. mF1 is given by the equation; where F1 is given by the equation; and precision and recall are given by the equations; TP, FP and FN denotes true positive, false positive and false negative respectively.

Power supply management
The output of the pruned SSD is fed to the Arduino microcontroller which is connected to the electrical supply of a room or any indoor environment. The controller enables the power supply in and around the location (2 mts) where a human is detected and disables it, if the human is undetected. Therefore, the electric resource is utilized only when and where it is actually needed and saves the resource efficiently. Using the proposed framework, the average monthly consumption of electricity for a residential environment is reduced to 72 from 90 units (kWh), which is nearly one quarter of the total electricity consumption.

RESULTS AND DISCUSSIONS
The prediction accuracies of the modified-SSD model which is based on OATM technique for both the datasets are given in Table 3. The validation and test data accuracies are measured and it is found that the prediction accuracy achieved by modified-SSD is very close to that of the original SSD but the training time is extensively reduced in modified-SSD, thereby increasing the processing speed drastically. The average loss of validation and test data sets by modified-SSD (without pruning) is 13.475% for both the datasets. The graphical representation of it is given in Figure 3. SSD with other pruning methods like Sparse structure selection and generative adversarial learning (GAL) have also been implemented for our prediction but the results of SSD with H-rank pruning tops other techniques. The results of various pruning techniques in terms of accuracy, compression rate and floating-point operations (FLOPs) are represented in Table 4. Our model out-performs the training speed of the model created by multi-layer pruning framework [41], as the latter implements three stages of pruning, only after completely training the network from the scratch. Hence the training time increases many folds in this method, whereas the former (our model) trains the network initially based on OATM method which reduces the training time drastically. Secondly, our model uses H-rank algorithm to focus on low-rank feature maps rather than eliminating the zero weights based on sparsity statistics with a negligible loss of 0.83% at 42.01% compression.  As our framework uses, low resolution indoor CCTV images, due to dullness, the accuracy is affected on further pruning and hence we stop pruning at this level. Though the results of Pruned-SSD seem slightly lesser than the traditional SSD, this difference can be ignored when compared to the processing speed of the pruned-SSD which is drastically improved. Pruning reduces the model's complexity, which eventually increases the prediction speed of the test data from 4.4 seconds on an average by un-pruned SSD to 2.1 seconds by pruned-SSD. The prediction speed and the loss function of both SSD with and without pruning are given in Figure 4.  The processing speed of the test data set is compared with various preliminary models and the results are presented in Figure 5. Among all the other techniques, the proposed modified-pruned SSD excels by yielding the lowest processing speed of 2.2 seconds on an average. The model detects the human along with the location represented by a bounding box which is given in Figure 6. This is then processed by the controller and the results of the controller are given in Figure 7. As the rotation of the fan (when power enabled) will not be clearly visible in an image, we have taken two bulbs, one representing the light and the other representing the fan in Figure 7.

CONCLUSION
UEPMS is one of the most essential services in day to day life due to the depletion of resources. Among various existing methods, this system uses the existing CCTV footage to detect humans and enable the power supply only in the location (2 mts) where humans are detected. The system uses modified-SSD for human detection with a specific hyperparameter tuning to decrease the training time of the model. The model is further pruned by the H-rank algorithm to decrease the computational cost thereby increasing the processing speed of the network. An Arduino micro-controller is used to manage the power supply of the system. The proposed architecture achieves an average prediction accuracy of 85.82% with a much reduced compression rate of 42% of the original network. It is evident that the proposed system saves nearly one third of the total electricity consumption. As the system is developed only for indoor environments, considering larger places like malls or outdoor environments, occlusion handling in these places could be taken as one of the future directions.