Deep-learning based single object tracker for night surveillance

,


INTRODUCTION
The role of video surveillance is to provide a protective mean through monitoring and analyzing any abnormality in the scenes. Nowadays, it is becoming more important with the ever increasing number of crimes. Crime can take place anytime all over the daybut it is more prevalent during night time, especially after the midnight. With the application of automated video surveillance system, it can provide continuous monitoring service for 24/7 with minimal dependency on the security officer.
In the past decades, research in automated video surveillance applications has evolved tremendously and many significant progresses can be observed through availabity of many commercial products in the market. Thanks to the new breakthroughs in software technology, it has became more effective and affordable. The key technologyin the effectiveness of these systems is the ability to detect and track the moving object even in the dark environments, especially during the night.
Both object detection and object tracking are the fundamental components in an automated video surveillance application. Object detection task is to detect the presence of object of interest in the video frame. While, object tracking connects and analyses the object movements for the successive video frames. The information derived from the tracker can be used to further analyze and deduce object activities  [12] Due to the CNN capability, this paper proposes a method of online tracking of object of interest for night surveillance application through deep-learning approach. A network with 3 convolutional layers and 3 fully connected layers is used to model the object appearance as proposed in [13]. The fully-connected layers will be updated online to cater the changes in target object appearance as it moves around the scene under different lighting conditions. Various hyperparameters for online learning are experimented, which include the selection of optimization algorithms, online learning rates and training sample ratio to find the optimal tracker setup. The main contributions of this work are: -Online target tracking framework for night surveillance video that utilizes deep-learning approach to dynamically represent target appearance model. -Research on the impact of optimal online learning hyperparameters for the best overall tracking accuracy.
The remainder of this paper is organized as follows: Section 2 discusses some related works on visual object tracking. Section 3 describes the proposed method, followed by experimental results and discussion in Section 4. Finally, Section 5 concludes all the research findings.

RELATED WORKS
This section will discuss general approach to visual object tracking, followed by specialized tracker for night surveillance applications and the evolution of object tracking algorithm towards deep-learning approach. A good object tracker is defined as an algorithm that is capable of providing accurate object localization with consistent object's tracking label across successive frames. Object tracking studies have been an active research field for the past several decades, and have demonstrated good progress in different scenarios and applications. Most of the tracking algorithms are based on tracking-by-detection paradigm, whereby object of interest is detected in every frame, which will beused to update the tracking states of the object. This approach is heavily dependent on the detection accuracy. Thus, an improvment in the detection algorithm will lead to better tracking accuracy accordingly. Among others, some good trackingby-detection algorithms are presented in [14][15][16][17][18][19][20]. Some of these tracking approaches are able to function wellunder good lighting conditions, however, their performance deteriorate as the environment becomes darker such as in night surveillance application.
Previously, one of the common approaches to improve tracking performance for night surveillance is by introducing preprocessing module to enhance image quality for the case of underexposed and low contrast environments. Some examples of the preprocessing step are histogram equalization, histogram specification and intensity mapping. Another approach is through analyzing the contrast level so that object detection will be improved before tracking is performed. This is based on the assumption that the human visual system is dependent on the neighbourhood spatial relation to its background. Huang et al. [1] used contrast changes information between successive frames to improve object detection accuracy in the night video application. Local contrast is computed by dividing the local standard deviation of image intensity with local mean intensity. Then, the object is detected by thresholding the contrast change between the successive frames. The computation is quite fast, but the local contrast information to indicate the presence of object of interest might be misleading as the background information itself may contrain high local contrast. On the other hand, the object might have almost similar appearance that produces low local contrast. Later in [21], Huang et al. proposed motion prediction and spatial nearest neighbour data association to further suppress the false detection. In [2], Wang et al. improve Huang's CC model by introducing salient contrast change (SCC), which involve two more steps; online learning and analyzing the detected object trajectories. By applying a threshold on the contrast change output, it is more sensitive to slight changes in the lighting level. Thus Nazib et al. [22] multiplyed Shahnon's entropy estimation with their own contrast estimation to produce illumination invariant representation. In [23], vehicles in night surveillance videos are detected by computing HOG features as input to support vector machine (SVM) to classify the detected object either as a vehicle or not, before Kalman filter is applied to track the vehicles.
Apart from previously mentioned approaches, there are also a few researches that has exploited camera technology to increase the detection and tracking accuracy in night environment. In [24], the researchers has used far-infrared cameras to obtain the foreground information through background subtraction technique. In [25], the researchers has used a near infrared camera to detect pedestrians using adaptive preprocessing technique for the night environment. Another research in [26] has used a fusion of two different types of camera, which are light visible camera and FIR camera mounted on a car to detect pedestrians during the day and night times. Even with the help from improved camera technology, the total cost of the sytems has risen because of more complex sensing hardwares.
Deep learning has been popularized by the introduction of AlexNet in 2012, when it has won ImageNet competition for image classification task [27]. Ever since, deep learning has been widely applied in many applications, overshadowing the other traditional machine learning approaches such as SVM and artificial neural network (ANN). In [28], CNN is used to detect human presence in night surveillance videos as an input to object tracker. Their proposed network consists of five convolutional layers and 3 fully connected layers. The input image is resized to 183x119 first, before histogram equalization is applied for human detection task. The proposed method is closely related to human/background classification in night scenes rather than tracking problem. Another early effort in applying CNN in object tracking is proposed in [29], where an online tracking framework based on multi-domain representations is proposed. Its architecture consists of multiple shared layers that they refer as domain independent layers, where only the classification layer is defined as the domain-specific ones. The shared layers are trained using multiple annotated video sequences offline, while classification layer is trained separately based on each domain. When a new sequence or domain is given, a new classification layer will be constructed to compute the target score based on the new input. Then, the fully-connected layers within the shared layers and the new classification layer will be updated peridiocally so that it is adapted to the new domain. In [30], multiple CNNs in TCNN is maintained in a tree structure to represent multi-modal target appearance. It will update the CNN models in the branches which has most similar appearance with the current target estimation. In [3,13], a general tracking framework for thermal infrared videos has been proposed. Thermal images exhibit similar properties to night surveillance images where the target object usually consists of low contrast information and negligable textures. In [3], multiple models are maintained to represent the target appearance in different cases such as for the case of temporary occlusion. During network updates, parent nodes will be replaced by the new node so that there is no redundancy in the pool of target object appearance models. In [13], a Siamese approach is utilized in which pair of patches are compared to find the most likely location of the target object in the current frame.  Figure 2 illustrates the overall workflow of the proposed tracking methodology. In the first frame, the tracker is initialized using a single ground truth bounding box that encloses the object. Positive and negative candidates are then generated using the given bounding box. Positive samples correspond to the patches or subimages that represent the object of interest, while negative samples correspond to subimages that belongs to the background. Let n t and m t be the number of positive and negative training samples, respectively. Positive training data are generated by randomly shifting the initial bounding box within a small distance (the shifted patch should at least consists of 80% overlap area with respect to the original bounding box) and negative samples are generated by randomly shifting the initial bounding box such that they will have minimal overlap area (overlap area with atmost 10% with respect to the initial bounding box). After generating all the training samples, appearance features will be extracted using the CNN networks to produce a feature vector with length of 512. Both sets of positive and negative feature vectors are then used to train the rest of the fully connected layers, which will result in the trained model.
During online tracking, the process starts by generating the possible candidate samples location pivoted on the last known location of the object. Total number of samples extracted is lesser compared to training samples to speed up the tracking process. The features are then extracted and tested using the trained network. The network output are the probabilities that the patch belongs to foreground object and background data. The locations of n highest foreground probalities samples will then be used to update estimated location of the tracked object in current input frame. Finally, the network is retrained or updated periodically to capture the changes in object's appearance as it moves around the scenes under different lighting exposure and background.

Network architecture
The network architecture consists of three convolutional layers and three fully connected layers (FC). The first three convolutional layers weights and biases are obtained from VGG-M [31], which has been trained on ImageNet dataset [32]. VGG-M is an eight layers network where the first five layers are the convolutional layers, which function as feature extractor and the last three layers are the dense FC layers. The original input size of VGG-M is 224x224. However, the proposed network uses only the first three convolutional layers with input size of 75x75. Thus, all training and testing samples are resized to match the corresponding input size. The full network architecture used in this work is illustrated in Figure 3.
The first CNN layer consists of 96 filters of 7x7 kernel. The stride step is 2 in x and y directions, followed by ReLU activation function, local response normalization and 3x3 maximum pooling to produce feature maps of size 17x17x96. The second convolution layer consists of 256 different filters of kernel size 5x5, which is then followed by ReLU activation function, local response normalization and 3x3 maximum pooling to produce 3x3x256 feature maps. Finally, the third layer consists of 512 filters of kernel size 3x3, which will produce feature maps of 1x1x512.
Both positive and negative extracted feature vectors are then used to train the three FC layers. Final output from the last softmax layer are the two probabilities that represent the likelihood of the input image patch belongs to the tracked object and the likelihood that the input image patch belongs to the background. Initially, all FC parameters are randomly initialized. In this work, three different optimization algorithms are experimented to train the FC layers: Gradient Descent [33], Adam [34] and Adagrad [35] with four different learning rates: 0.00125, 0.001, 0.00075 and 0.0005.

Network learning parameters
In this work, only the last three FC layers will undergo retraining so that the network is adapted to the changes in the object appearance. In the first frame, the weights of these layers are randomly initialized, while the biases are fixed to 0.05. Learning parameter values for positive samples, negative samples, initial learning rate and number of epoch are set to 500, 1000, 0.0005 and 150 respectively. Cross entropy (1) loss function is used to train the network, where p is the true label, q is the predicted probability and x is the number of output class. Since the network outputs are a set of two probabilities; probability that the sample is foreground and background, thus x value is two where the summation of each sample probabilities is equal to 1. Now, let the true label be =0 = and =1 = 1 − , and the predicted probability be =0 =̂ and =1 = (1 −̂). The loss function is then computed by taking the average cross entropy of all N input samples (3). Cross entropy, Loss function, During online learning, number of training epoch is reduced to 75, while the other two parameters; learning rate and number of positive and negative samples varies according to the best setup. Three different optimizers; stochastic gradient descent, Adagrad and Adam (adaptive moment estimation) are compared to find the optimal values of the model parameters (weights and biases) by minimizing the loss function.

Optimizer #1: Stochastic gradient descent (SGD)
Gradient descent [33] is a popular optimization technique and it has been widely used in network learning [28,29,36]. At a time step t, gradient descent algorithm computes the gradient of loss function with respect to the model parameters, where the resultant value is used to update the network. Gradient is a vector of partial derivative of the loss function with respect to every weight and bias for the training samples. Then, each of the weight and bias are updated by subtracting previous value with the multiplication of the learning rate with the calculated gradient (5), (6). The process will be repeated until the loss function is minimized (converge) or the maximum number of epoch is reached. One iteration of a gradient descent on where N is the number of training samples.Then the weight and bias of parameter i for time step t is calculated as in gradient step below: where is the learning rate. Note that the same learning rate is applied to all parameters updates. One gradient descent operation consists of one iteratation over all training samples. This is different from stochastic gradient descent, whereby instead of taking the whole training samples, it randomly selects a few training samples in each iteration to optimize the model parameters. This makes SGD computationally effective and makes it popular for online network training. Nevertheless, since SGD uses only a few training samples, the path to convergence will be noisy.

Optimizer #2: Adam (adaptive moment estimation)
Adam [34] optimizer stands for adaptive moment estimation. It computes different learning rate for different parameters by using the estimates of first and second order moments of gradient. The first and second order moments are the moving average and uncentered moving variance as shown in (4) and (5). It introduces three more hyperparameters compared to gradient step in SGD, which are β 1 , β 2 and ε; which correspond to exponential decay rate for first order moment, exponentially decay rate for second order moment and very small constant to prevent the case zero division, respectively. 1 st order moment (moving average) of parameter i for time step t, Estimation of these moments will be bias-corrected before they are used to update the model parameters. This step is important to ensure that the first and second order moments are not biased towards zero as the initial values of 0 and 0 are set to zero. Bias-corrected first and second order moments are calculated as below. Bias-corrected 1 st order moment of parameter i for time step t, Bias-corrected 2 nd order moment of parameter i for time step t, After estimating the moments, model parameter is updated as in (8). Note that the learning rate is now multiplied by the ratio of first and second order moments of the gradients. η is the learning rate and is a very small number to prevent division by zero.
Since its first introduction in 2015, Adam optimizer has been widely used in network learning [37]. It has fast convergence rate and thus practical for training a large model with large training samples.

Optimizer #3: Adagrad
AdaGrad [35] optimizer is a gradient-based learning algorithm, but it computes different learning rates for different parameters. AdaGrad performs smal update on the parameters that are associated with frequently occurring features, while it performs big update on the parameters that are associated with infrequent occurring features. This is achived by AdaGrad through modifying the general learning rate in (5), based on the past gradient of the parameter i. The gradient step in AdaGrad becomes: where , is the accumulated sum of the squares of the previous gradient with respect to the parameter i up to time step t. , = ∑ , 2 =1 (14) Note that since the gradient values are all positive, the accumulated sum , will keep increasing during the training process which will cause the learning rate in (13) to shrink and eventually become infinitesimally small. At this point, the optimizer is not able to learn any new knowledge. Despite of this weakness, AdaGrad still performs better compared to the SGD as the learning rate is not manually fine-tuned. AdaGrad has been used at Google [38] to train large neural networks to recognize cats in youTube videos. It is also used in [39] to train GloVe word embeddings, as infrequent words require much larger updates compared to the frequent ones.

Learning rate
Choosing a learning rate can be a difficult task. A too small learning rate leads to a slow convergence, while a too large learning rate can hinder convergence and causes loss function to fluctuate or even cause training divergence. In this work, learning rates of 0.00125, 0.001, 0.00075 and 0.0005 are experimented to find an optimal setup.

Object location estimation
Given an input frame during online tracking, the system will estimate the object location by analyzing the output probabilities from the network. The network outputs two probabilities; (1) probabilities that the input sample belongs to the foreground object and (2) probabilities that the input sample belongs to the background. The final object's location is estimated by computing the weighted average of the top five samples with the highest foreground probabilities whereby the weight is based on their probability values.

RESULTS AND DISCUSSION
For validation purpose, 14 night scene videos of size 352x288 has been collected. In each video, the tracked object size is about 30x70 pixels and the total number of accumulated frames of all video is 3646. The chosen videos contain the challenge of various lighting condition, occlusion and move-stop-move problem. Snapshot of the three camera views of the videos are shown in Figure 4. The groundtruth is generated manually by drawing the object bounding box in each frame by an expert in computer vision.

Implementation details
The tracking code is implemented in Python with tensorflow library. Original location of the tracked object is given in the form of bounding box ([x 0 , y 0 , width 0 , height 0 ]). In the first frame, hyperparameters for learning rate, number of epoch, number of positive sample and number of negative sample are initialized to 0.0005, 150, 500 and 1000 respectively. Samples extracted from first frame is the most important step as it is the only known groundtruth by the tracker. Figure 5 shows examples of positive and negative samples extracted in the first frame of three different test sequences. Then, the tracker will be updated online periodically through weak supervision as the consequent frames groundtruth data is not known.

Performance metric
To evaluate the performance of our night tracker algorithm, we use one of the VOT evaluation metrics, which is accuracy (Ac) as defined in (4). Accuracy measures how well the tracked bounding box relative to ground truth box by computing the intersection over union (IoU) area. A higher overlap area represents a better tracking accuracy. The tracker is not re-initialized in the event of track failure (where the IoU is zero). Accuracy, where denotes the number of frames in the test video, while , and , are the bounding boxes of object in frame i from the tracker output and ground truth, respectively. Table 1 shows the accuracy comparison between the three optimizer algorithms: SGD, Adam and Adagrad. For a fair comparison, learning rate, number of positive sample and number of negative sample are fixed to 0.001, 50 and 100, respectively. Default values for Adam's hyperparameters β 1, β 2 and ε are set to 0.9, 0.999 and 1e -08 , respectively. In average, Adam optimizer produces the best accuracy as compared to the other two optimizers, followed by AdaGrad. Adagrad performs significantly better in Cam01-video08 compared to the other two optimizers. While, it is noted that SGD performs the worst in most of the test videos. This indicates that the performance of adaptive learning rate method is better compared to a fixed value. As the number of iterations for each training is set to minimum, SGD may not be able to converge and contributes to its bad performance. Some samples of frame with overlaid tracking output for Cam01-video08, Cam02-video02 and Cam03-video02 are shown in Figure 6. Green, blue and magenta bounding boxescorrespond to the output of SGD, Adam and AdaGrad optimizers, respectively. In Figure 6, the first row images correspond to Cam01-video08, in which AdaGrad optimizer gives the highest accuracy. Initially, all three optimizers produce good results as shown in frame #2, then eventually SGD optimizer model has drifted to mix with the background (frame #27) followed by Adam optimizer (frame #71). The second row shows the images for Cam02-video02 sequences, in which Adam gives the best accuracy while the others give almost 0% accuracy (the bounding boxes are stucked at the background area as it contain more textures compared to thetracked object). The third row images correspond to the output for Cam03-video02, in which all three optimizers produce poor accuracy results. This might be caused by similiarity between the foreground appearance and the background. Table 2 shows the accuracy comparison between four different values of learning rate. In this experiment, Adam optimizer has been chosen as the basis optimizer, while the number of positive and negative samples are set to 50 and 100, respectively. In average, learning rate of 0.00075 gives the best accuracy performance compared to the others, followed by 0.0005 learning rate. The results also indicate that an increase in learning rate value, the average tracker accuracy will be lower.  ISSN: 2088-8708 Table 3 shows the accuracy comparison between four different combinations of total number of positive and negative training samples used during online update. Adam optimizer with learning rate of 0.00075 will be the basic setup for the training samples comparison. In average, a combination of 50 and 100 for positive and negative traiing samples, respectively returns the best accuracy compared to other combinations. Total number of negative samples are twice of the positive samples, such that it caters for larger background area compared to concentrated foreground samples.

CONCLUSION
In conclusion, the proposed tracking scheme is able to track object of interest in the night surveillance videos. Adam optimizer shows superior accuracy performance as compared to SGD and AdaGrad in most of the testing videos. The best learning rate is found to be 0.00075 that are achieved by using sample training ratio of 2:1 between negative and positive samples. Hence, this tracker can be implemented in the higher level application of night surveillance system.