A hierarchical RCNN for vehicle and vehicle license plate detection and recognition

ABSTRACT


INTRODUCTION
Vehicle and vehicle license plate tracking and recognition management systems play an important role in an intelligent transportation system, it is commonly used in automatic driving, vehicle theft prevention, access control security, high-efficiency roadway services, and so on. Most of these systems are based on traffic image or video analysis, where computer vision techniques were commonly employed. A huge recent development was found with the employment of machine learning and deep learning techniques, which demonstrate higher classification accuracy and robustness. Both deep learning and computer vision methods [1]- [15] were developed in various vehicle-relevant applications, such as vehicle detection, vehicle classification, vehicle plate recognition, and road condition monitoring [16]. Comparing with computer vision methods, the newly developed deep learning techniques have some advantages on the generalization capacity and robustness to uncertainties, noise, and occlusion in images, in the cost of higher computation load and demand on sample set size. Several methods were proposed to detect vehicle and vehicle license plates based on these newly developed techniques, such as the dirty number plate detection system based on  [1], the license number plate recognition system based on convolutional neural network (CNN), and recurrent neural network (RNN) [2], the technique to improve the car plate recognition rate based on support vector machines (SVM) [3], the vehicle detection system based on region convolutional neural network (RCNN) [4], the vehicle detection and counting system using the convolutional regression neural network (CRNN) [5].
However, there are some challenges to deep learning-based techniques in vehicle and vehicle license plate detection and recognition. Firstly, fake objects from a complex environment around the vehicle reduce the accuracy of detection by confusing the deep neural networks, buildings, trees, pedestrians, and other objects surrounded the vehicles. Secondly, the weather and illumination blur the vehicles and license plate, strong sunlight, foggy, rainy, and shadow. Thirdly, the human factor, speed of the vehicle, driver behaviors. All of the above make the difficult to achieve accurate and efficient detection.
In this paper, we proposed a hierarchical RCNN system for vehicle detection and license plate recognition under complex traffic backgrounds. The system contains four steps: i) a high-level RCNN to detect vehicles from traffic images/video, ii) a lower level RCNN is used to locate the license plate from the detected vehicle regions, iii) a smaller RCNN is trained to detect individual letter from the detected license plate, iv) finally an RCNN classified is employed to recognize the individual letter. To reduce computation, the training data were cropped into smaller blocks. The main innovation of the proposed method is manifested in the idea of hierarchical structure with multiple-level networks for sub-tasks. Differentiating from the common all-in-one deep learning structure, the proposed method considers the license plate recognition as multiple sub-tasks, which enables the optimization of network depth and structure for each sub-task. Each RCNN is designed with careful consideration of the complexity of the problem to be solved in the corresponding level, so an RCNN of appropriate depth is chosen for each step. The experiment results show the proposed system has higher speed and accuracy.
The rest of the paper is organized as follows. Section 2 introduces the relative work. Section 3 explains the proposed method. Section 4 demonstrates the experiments and analysis. Section 5 concludes the paper and indicates future work.

RESEARCH METHOD 2.1. Vehicle and license plate detection
Vehicle detection has been popularly studied in literature based on computer vision, which is the fundamental stage for an automatic driver system. There are many algorithms developed. According to the literature, vehicle detection is categorized into moving vehicle detection and static vehicle detection. In this paper, we focus on only moving vehicle detection. The features, background subtractions, frame difference, optical flow, machine learning, and combined methods are used to detect moving detection. A method was proposed [6] using optical flow to detect a moving vehicle. The color and texture features were used to reduce the effects from the complex background, and the fuzzy background subtraction was developed [7] for moving vehicle detection. [8] improved the frame difference method for moving vehicle detection using image contrast and morphological filter. The optical flow was combined with k-nearest neighbor (KNN) to classify the different kinds of moving vehicles [12]. An anti-tracking system [17] was proposed to detect the vehicle based on Haar features and modified the adaptive boosting algorithm.
The vehicle license plate, as an ID, can uniquely identify vehicles. So the automatic license plate detection and recognition is one of the important tasks in an intelligent driving system. Various methods have been developed, a method [10] was developed to detect vehicle plates using salient features, with a success rate of 93.1% reported. The rear vehicle lights were used to detect the range of license plate and then identify the license plated using the histogram algorithm [11], which was only validated by vehicles from five countries. The average success rate is 90.4%. An automatic license plate recognition [12] was proposed using updated YOLO, with a license plate recognition accuracy of 78%. However, the accurate rate still is an issue for real applications. And most of these methods have high computational requirements due to the system's need to search the vehicles and vehicle license plate from large and high-resolution images or videos that were captured from the road.

Region convolution neural network (RCNN)
Deep learning is one of the algorithms of the machine learning method that was developed based on the neural network [18]. Various deep learning networks structure have been depleted, such as AlexNet [19], Over feat [20], GoogLeNet [21], ResNet [22]. RCNN is the most popular model of deep learning because of the significant success in image processing. The RCNN is made up of neurons that have tunable weights and bias. The input data can be images, the layers of RCNN  lanes and vehicles using CNN. The front collision was predicated using CNN with adaptive region-of-interest [14]. Several recent studies [15], [23] developed one system for license plate recognition based on RCNN. In this paper, a hierarchical RCNN will be employed to obtain high accuracy and fast speed for vehicle license plate detection.

PROPOSED METHOD
Deep learning techniques are commonly a hunger for computation resources. Compared to CNN, the RCNN gets the advantage of higher computation efficiency. The traditional all-in-one method did not leave much space for modifications according to the real-world scenario, such as the task complexity, features to be considered, image qualities, and the relationship among sub-tasks. This paper is intending to propose a hierarchical structure to separate the all-in-one RCNN into multiple levels, with each layer focuses on a single sub-task. Therefore, the network in each layer will be developed according to the real demand of the sub-task only. This will lead to the lower complexity of the total task, which results in a smaller network size, lower computation load, and higher total accuracy.
Using the vehicle plate recognition applications, this paper employs RCNN for 3 level sub-tasks, vehicle detection, plate detection, and character level detection and recognition. These 3 tasks have different complexity on the background and object features. The main idea is to consider RCNNs with different complexity levels for these tasks. Figure 1 depicts the proposed hierarchical structure of the solution for vehicle-plate-character detection and identification, where multiple RCNNs with feasible complexity are considered for different tasks. The hierarchical structure is not only for RCNN, various pre-trained deep neural networks can be considered to replace the RCNN. In this paper, the RCNN is considered just as an example to show the idea.
The vehicle level RCNN is the most complex one because the vehicles are commonly found in busy traffic contexts when smart traffic monitoring systems are concerned. Detecting vehicles with acceptable accuracy is critical for total system performance. In this level, a 15 layer RCNN is employed, as shown in Figure 2, where the RCNN has 3 folds of convolutional-pooling layers, which are pre-trained.
After the vehicle is detected by the vehicle level RCNN, we can get the region of the vehicle. The plate level RCNN only focuses on this region, instead of the whole image. This greatly reduces the computation load. Meanwhile, this strategy can avoid mis-hit in plate detection, because other potential candidates in the background are excluded.
Because the problem to detect a license plate from a vehicle is relevantly simpler than detecting vehicles from the background, this plate level RCNN can be smaller (having fewer layers).  The features used to detect and recognize characters from the plate image are even less. So theoretically the character level RCNN can be smaller than the plate level. However, the plate level RCNN is already very shallow (only 2 convolution layers), it is not feasible to reduce the layers further to avoid the low fitting capacity of the RCNN. The character level RCNN takes the same structure as the plate level one. It should be noted that if the vehicle level deep network has more convolution layers, one can expect the character level RCNN can be smaller than the plate level.

EXPERIMENTS AND DISCUSSION
To validate the proposed method, experiments are designed based on the public accessible Canadian Institute for Advanced Research, 10 classes (CIFAR-10) database [24].

Experiment environment
All the experiments in this paper are completed on a Lenovo Thinkpad X1 laptop, with 16 GB RAM, an Intel(R) Core(TM) i7-8550 CPU at 1 Ghz. The operating system is windows 10 x64. The software is Matlab R2019a with the following toolboxes: computer vision, image processing, deep learning, and statistics and machine learning. The RCNNs are modified and trained based on the CIFAR-10 network [25].

Reduce the layers of RCNN
Associating the complexity of RCNN to the task is the main advantage of the proposed strategy. This experiment aims to demonstrate the feasibility of the deduction of the CIFAR-10 Network. The original CIFAR-10 Network (15-layer RCNN) has 1 image input layer, 3 folds of Convolution-ReLU-Pooling layers as the middle layer, and the final layers, as shown in Figure 2. The first experiment is to remove 1 fold of the ConvolutionReLU-Pooling layers from the middle layers. The modified network has the structure as shown in Figure 3.
After training using the CIFAR-10 dataset, the weight of the first convolution layer for the original network (3 convolution layers) and the modified network 12-layer RCNN (2 convolution layers) are shown in Figure 4 and Figure 5 respectively. From the weights, one finds that both of the networks have captured the basic features of the images in the dataset. Therefore, the modification of the structure does not significantly damage the feature extraction capacity.
The next experiment depicts the further reduction of the middle layers to a single convolution layer (9-layer RCNN). The weights of the single convolution network shown in Figure 6 shows the strong random distribution, which means the network failed to capture the image features. This is because the removal of two convolution layers has damaged the learning capacity. The accuracy during the training iterations in Figure 7 confirmed this point, where the 12-layer and 15-layer RCNNs have similar accuracy. This means the reduction of the layers does not significantly affect the learning capacity. However, the accuracy of the 9-layer RCNN was significantly lowered. Considering that the dataset of CIFAR-10 has 10 classes, the accuracy of 10% obtained by the 9-layer RCNN is just a random guess. It should be noted that, due to the advantage of the transfer learning strategy, the RCNNs do not need to reach 100% accuracy when training the middle layers, as long as the middle layers can capture the image features. Table. 1 shows the accuracy of the 3 RCNNs after the training of middle layers using the 10 classes. When the computation load is concerned, the modified RCNNs (12-layer and 9-layer networks) have higher efficiency. Considering both accuracy and time efficiency, one finds the 12-layer RCNN can be considered as the lower-level classifiers in the hierarchical structure in Figure 1.

Vehicle license plate detection and recognition
This experiment focuses on the specific application, the vehicle license plate detection and recognition. Table 2 shows the accuracy of each RCNN in the proposed structure. From Table 2, one finds that the RCNNs in vehicle and plate levels have similar accuracy, although the classification capacity was lowered when a convolution layer was removed. This is because the problem complexity for plate detection is lower than the vehicle detection, therefore, a smaller RCNN also can get similar accuracy. This also means there are some spaces to modify the all-in-one RCNN without significant damage to the accuracy. The character level accuracy is lower than the other two levels although the characters have fewer features than the plate and vehicle. This is because the classification problem in the character level becomes a multipleclass (36 classes) problem. The final layers of the character level should be improved, and more training sets should be considered. The experiments demonstrated the reduction of RCNNs and the performance of the total system, which validated the proposed strategy.

CONCLUSION
This paper proposed a hierarchical strategy for vehicle and vehicle license plate detection and identification based on RCNN. The complexity of the RCNN was associated with the tasks. In this way, multiple complex level RCNNs can be employed in the same system. As an example, a sample vehicle detection system was developed for smart traffic monitoring, thereafter the license plate recognition RCNN was considered for vehicle identification.
The vehicle level RCNN with the highest complexity was employed to detect the vehicle from the background. In this RCNN, the task can be considered as a two-class (vehicle and background) classification problem. The detected vehicle region (a portion of the original image) was then inputted to the license plate level RCNN, where the license plate was detected. This is also a two-class classification problem (license plate and vehicle body as background). In this way, the computation load is greatly reduced. Furthermore, the disturbances from outside of the vehicle were excluded, which improved the success rate of the plate level RCNN. Finally, the detected license plate became the input of the character level RCNN, where the individual characters were detected and classified. This RCNN solved a multiple class classification problem, where the numbers ('0' to '9') and letters('a' to 'z') are the classification targets. The future work includes: (1) separate the total task into more levels and design the network in each level according to the specific features in the corresponding sub-task; (2) improve the final layers for the character level RCNN to improve the accuracy of plate recognition.