Regional feature learning using attribute structural analysis in bipartite attention framework for vehicle re-identification

ABSTRACT


INTRODUCTION
Detection and re-identification of vehicles [1] have gained significant importance in computer vision because of its varied applications on video surveillance.In intelligent transportation, it is highly challenging to accurately perform vehicle re-identification (Re-ID) by retrieving images of vehicles captured by multiple non-overlapping cameras.This is mostly performed in illegal vehicle search systems and urban surveillance.Due to the similarity in vehicle appearance, person re-identification methods failed to provide adequate results for vehicle reidentification.They can be distinguished as similar vehicles in varied models (inter-class similarity) as in Figure 1(a) and same vehicle appearing different (intra-class differences) as in Figure 1(b) along with similarly looking various vehicles of the same model exhibited in Figure 1(c).More emphasis is given on the distinct regional features which differentiate every vehicle like the ocular lights, inspection markings of yearly maintenance, decorative mirror hangings and personalized designs [2].These minute details of each vehicle are used for accurate re-identification.Thus, the key component of vehicle re-identification is to distinguish the fine-grained dissimilarities of each vehicle thereby highlighting the distinct regions.The ISSN: 2088-8708  Regional feature learning using attribute structural analysis in bipartite attention … (Cynthia Sherin) 5825 global features of the vehicle such as color, model and make details about the brand and the spatial-temporal information from varied camera angles provides the vehicle direction in the given space and time [3] along with its appearance.The license plate details distinguish each vehicle individually but are commonly not available because of privacy issues, possibility of faking it, distortions, occlusion, motion blur and illumination in the real-world scenario.Lastly the vehicle orientation and camera angles [4] are sometimes unreliable because of the low resolutions in capturing minute visual appearances.Personalized local features are considered in addition to the overall structural appearance to enhance the performance of re-identification.An ordered vehicle dataset with 27 labelled classes of attributes are collected and named as Attributes27 dataset in proper bounding boxes which includes various types of cars, windshield glass, wheels, tissue box and back mirror.These images were gathered from a variety of surveillance cameras in urban locations and features several vehicles from various angles (top, face, rear, along with left, and right sides of vehicle), both during the day and at night.Table 1 provides a comprehensive list of the vehicle attributes, when combined with the personalized attributes to improve the efficiency of vehicle re-identification.By using the characteristic attribute data gathered, Attributes27 dataset is leveraged for diverse purposes such as analysis of vehicles along with its tracking [5] and detection, highly defined model-based identification of vehicles [6], logo detection [7], detection of vehicle plate for improved re-identification [8].Vehicle tracking, model detection, and vehicle re-ID requires varying degrees of structural data.Numerous existing works successfully explains the retrieval mechanisms of positive images along with the loss relationship using multiple vehicle re-identification datasets.With the aid of 20 cameras Liu et al. [9] gathered 40,000 images of 619 vehicle identities from unobstructed traffic conditions and devised an re-identification method based on appearance to extract the global and semantic characteristics on the VeRi dataset.To calculate the relative distance between vehicles, Liu et al. [10] used a double branched deep convolutional neural network (CNN) on a large labeled re-ID dataset called VehicleID which contains 2,22,000 images of 26,267 vehicle identities from varied real world surveillance cameras.Distinct vehicle datasets based on regional features with proper labelled attributes [11] with the help of bounding boxes are utilized for license plate and logo identification [12].Along with these information semantic data including vehicle color, surface texture, proportion, and geometric data also enhances the re-identification performance.Vehicle model, color, and unique identification features are taken into account by Liu et al. [13] in his multi-branch architecture to improve the performance of re-identification.The cross-sectional manipulation of vehicles is improved by generative adversarial networks (GAN) [14], by feeding the network with vehicle images from various angles.Zhouy and Shao [15] classification system divided the images based on viewpoints into various classes, which were combined together to generate the global features.He et al. [16] trained the model by merging the local model with the attention heatmaps to obtain the region of interest (ROI) recommendations.Teng et al. [17] obtained the final features of the image by making use of the spatial and channel attention data as a weight  [18] to enhance the reidentification task.Zhang et al. [19] presented a part-guided attention learning model to uniquely identify the fine-grained attributes such as distinct designs, yearly maintenance stickers, and tissue box to emphasize the various part regions.As a result, the regional features extracted by the self attention model are vital to show up the distinctive qualities of each vehicle image.The relationship between the correctly and wrongly identified images in comparison with the query is determined by the loss function in vehicle re-ID.To produce the loss function, Ding et al. [20] employed L2 norms to limit the loss function for the positive and negative images.Schroff et al. [21] developed a constant value from the triplet loss function by distinguishing the similar and dissimilar image pairs.
A bipartite framework divides the images into two non-coinciding parts using the numerous labelled attributes.The pattern branch and identity branch are the two main branches which make up the bipartite framework.The self attention model trains the labeled images effectively and the partition-alignment model enhances the accuracy of re-identification.
The significant contributions are in brief, as follows: − To gather an annotated dataset of labelled vehicle images comprising of 27 attributes per class.For efficient vehicle re-identification, the selective features of individual vehicles are examined using the multi-class attributes.− To construct a bipartite framework with multiple attributes and independent feature maps for the pattern branch as well as identity branch.The resultant framework produces an effective triplet loss function thereby diminishing the conflict between the multiple attributes.− To design a self-attention model to produce centralized multi-attribute attention heatmaps in combination with partition-alignment model to precisely identify the windshield area and bumper sections of the vehicle to highlight the fine-grained information.
Our proposed research work has the following sections: section 2 elaborates the comprehensive mechanism of the proposed method along with its architecture.Section 3 displays the experimental results obtained for the proposed method on Attributes27 dataset and standard vehicle re-id dataset VeRi-776.Section 4 concludes the overall analysis.

METHOD 2.1. Characteristics of Attributes27 dataset
Our proposed Attributes27 dataset comprises of 8,271 images gathered from variety of datasets under various environments.The VAC21 vehicle dataset served as a model for Attributes27 dataset containing 21 classes of labeled attributes.To strengthen the regional feature learning capability and detection accuracy of smaller areas, our dataset has been updated with extra attributes.The labeled attributes for each image in the dataset are displayed in Table 1.The visual representation of the labeled attributes is given in Figure 2.This dataset captures minute information of the vehicle at various levels.In the entry level the type of the vehicle image captured is concluded by the body type such as bus, jeep, truck, sedan, sport utility vehicle (SUV), hatchback.In the second level, the vehicle make and model are determined.The final level records the unique vehicle characteristics like yearly maintenance stickers, newer signs, and decorative hangings.The partition-alignment blocks are incorporated on the generated heatmaps which serves the input to create the ROI.To extract the regional features, the resulting ROIs are given as input to a small CNN, ResNet18.In order to train the model, the triplet loss for both the branches is obtained.
Figure 3. Architectural design of bipartite framework

Loss functions
As the classification loss ignores the fine-tuned appearance attributes acquired from semantic features, triplet loss tends to be comparatively more effective.The training phase is accompanied by both the triplet loss and label smoothing cross entropy loss.The cross-entropy loss with label smoothing is explained with (1).
With the ground truth value t allocated as 1, € stands for the smoothing parameter, N denotes the number of classifications, and   stands for the un-normalized logits of the  ℎ classification.Pre-processing the input data is highly recommended for combining the cross entropy and triplet loss.The input images are thus divided into a number of groups G1, G2... Gk with a total of n similar images per group throughout the training phase.Each triplet unit for the given input consists of anchor, positive, negative images.The loss function is given by (2).
stands for the anchor image, the matching images with g a are represented as   and the negative images which are not similar to   are represented as   .d denotes the distance between   ,   ,   .M stands for the margin value which separates the non-matching images from the matching ones.Thus, the overall loss is denoted as (3).
where  is the weights allocated.The dual branches namely the pattern branch and the identification branch are subjected to the triplet loss.Let us assume that there are sample images for the provided anchor image that

Self-attention block
The bipartite structure has a separate feature space for each branch, and an attention model adds more emphasis to the specific areas.The various positions of the similar input image are focused by the self-attention model and the acquired output belongs to the same input series.Thus the independent features are brought together by the self-attention [22] mechanism incorporated in the bipartite network.Figure 5(a) gives the brief architecture of the self-attention framework.The three 1×1 convolution layers C1, C2, C3 are fed with the input X to generate the ,  and ℎ heatmaps.The correlation matrix is generated as a product of the heatmaps g and f yielding s which is the probability map.For the input X the final self-attention map SA is generated by the cross product of probability map s and feature map ℎ.The self-attention map is given as (4).

Partition-alignment block
With the generated attention heatmap of every attribute, the ROIs are constructed by the partition-alignment block for deriving the final regional features.Multi-level attention pooling is used in the partition-alignment block to produce the ROIs.SAp and SAi are the self attention heatmaps produced by the identity and pattern branches respectively which is indicated in the partition-alignment block sample structure as shown in Figure 5 Various average pooling sizes are used since the size of the windshield area is less than the bumper area.After the detection of pattern branch, the bumper area is detected by applying 4×12 average pooling on the heatmap SAp.For detecting the smaller sized inspection stickers in the occluded windshield area SAi, a 3×3 pooling layer is utilized.The 4×10 average pooling layer processes the pooled map which is assigned as the seed.The resulting heatmaps are separately linked with the input image to distinguish the bumper and windshield regions.The algorithmic explanations of both the self-attention block and partition-alignment block are listed in Table 2.
Table 2. Algorithmic explanation of self-attention block and partition-alignment block Algorithm 1: Outline of self-attention block Algorithm 2: Outline of partition-alignment block Requirement: Let X be the input image.To Accomplish: Generation of the feature maps f, g, and h along with resultant self-attention map SA.
Step 1: Generate the feature maps h, f and g by applying the input X on three individual 1×1 convolution layers C1, C2, and C3 respectively.
Step 2: The probability map s is produced by the product of f and g.
Step 3: The Self-Attention map SA for each branch is generated by the product of h and the probability map s.
Requirement: Generated heatmaps for the pattern and identity branch, SAp and SAi respectively.
To Accomplish: Obtain bumper and windshield region coordinates   and   . Step

RESULTS AND DISCUSSION
To precisely accomplish the re-identification of vehicles, the entire experimental analysis is carried out on the Amazon elastic cloud compute (EC2) instance of amazon web services (AWS) console.Pytorch and Tensorflow are used in conjunction with a high frequency Intel Xeon Scalable Processor in the p2.xlarge instance for accelerated computing.The bipartite self attention framework is analyzed on both Attributes27 and VeRi-776 dataset.The VeRi-776 dataset consists of 50,000 images of 776 distinct vehicles taken by 20 cameras from various angles.A testing set contains 11,579 images of 200 vehicle identities, while the training set has 37,778 images of 576 vehicles.ResNet-50 and ResNet-18, allocated with ImageNet pretrained weights in the training phase are applied on the bipartite framework and subnet respectively.Through data augmentation techniques the input images are randomly cropped to 224×224×3 with a probability of 0.5.The Adam optimizer [23] is employed with a weight decay of 5×10 −4 and momentum of 0.9.Each batch with 16 different identities of the augmented images are obtained with a size of 120 per batch.A random selection of instances with the same ID is performed, and the count is set to 10.The learning rate set to 0.001 for 100 epochs.The similarity rate between the query and the collection of galleries imaged are provided by Euclidean distance (L2), which is used to assess the final ranking outcome.
The re-identification performance is evaluated by mean average precision (mAP) and cumulative match curve (CMC).The area covered by the precision-recall curve is termed as average precision (AP) and mAP stands for the i th query of mean average precision for the total of N gallery images.The collective percentage of the accurately matching images from top-K images is given by CMC.Rank@1 and rank@5 scores are used to analyze the accuracy in re-identification.The formulae for mAP is given as (5).
The performance evaluation of the bipartite self-attention framework on VeRi-776 and Attributes27 dataset for both the self attention block and partition alignment block is given in the Table 3.The training accuracy and loss for an average of 10 epochs is tabulated for the trained ResNet-50 and ResNet-18 CNN models for both the blocks respectively to exhibit its robustness in the VeRi-776 and Attributes27 dataset.The overall performance analysis report of the bipartite framework is analyzed based on the metrics such as precision, accuracy and recall.The graph for these metrics plotted in Figure 6 clearly indicates that the accuracy, precision and recall of bipartite with self attention framework shows 86.4%, 86.1%, 86.3% in VeRi-776 and 98.5%, 97.1%, 98.3% in Attributes27 dataset respectively.Figure 6.Overall performance analysis for the bipartite framework

Comparison with other methods and results
With the use of the multi-view images from VeRi-776 and Attributes27 datasets, feature maps are created to get the ROIs that provide the regional features needed to compare the performance of re-ID with other cutting-edge techniques.mAP and CMC are employed in VeR1-776 dataset in comparison with the state-of-art methods which is shown in Figure 7.The findings of the comparative analysis on the VeRi-776 dataset show that VAMI+STR [15] and FACT [9] obtain more information from multi-attributes.VAMI [15] and OIFE(4views+ST) [24] gets the multi-view information with the help of attention model.In the case of MSVF, SCAN the dual-branch architecture is utilized and PR+GLBW [16] makes use of LocalNet [25] to get the local information of each vehicle.Our bipartite framework when applied on Attributes27 and VeRi-776 dataset displayed a good improvement in performance of 98.5% and 84.3% respectively.Due to the varied image categories the percentage of accuracy in VeRi-779 dataset has decreased when compared with the Attributes27 dataset.

CONCLUSION
In this paper, we have effectively gathered the regional features with the help of 27 labelled classes of structured attributes in Attributes27 dataset.Both the Attributes27 and VeRi-776 dataset are tested on the proposed bipartite attention framework to achieve remarkable re-identification results.VeRi-776 dataset achieved 99.1% and Attributes27 dataset with 98.4% accuracy results with triplet loss incorporated in the bipartite framework when compared with the other techniques with only 65.7% and 78.6% accuracy.Bipartite architecture successfully overcomes the conflict between the triplet losses.The attention heatmaps generated by the self-attention block combines the individual features and with these heatmaps, the partition-alignment block detects the local regions.An excellent performance of 84.3% and 98.5% is achieved on VeRi-776 and Attributes27 datasets respectively with effective detection of windshield and bumper areas in the captured front angle images.

Figure 1 .
Figure 1.Changes in vehicle appearance (a) similar vehicles of varied models, (b) different appearances of same vehicle, and (c) different vehicles of same model

Figure 2 .
Figure 2. Attributes distribution in a sample image of Attributes27 dataset


ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5824-5832 5828 has different identities.L is the distance between the sample images and anchor image.When the identity branch is trained, the distance L increases as the images belong to various identities and when the images from the same model are trained, L becomes closer.With the advantage of dual branch in the proposed system it overcomes the conflicts of single branch framework, thereby improving the performance.The experimental analysis on VeRi-776 and Attributes27 dataset results in 84.3% and 98.5% of mAP respectively for the proposed bipartite structure with self attention.Figure4displays the comparative study of the framework with enhanced performance and accurate detection of windshield and bumper zones.

Figure 4 .
Figure 4. Comparative study of self-attention bipartite framework

Figure 5 .
Figure 5. Bipartite network structure (a) self-attention block and (b) windshield and bumper regions highlighted by partition-alignment block

Table 1 .
Hierarchical list of labelled attributes (Attributes27) , the pixel with the highest value becomes the seed value;Step 6: The pixel in   ′′ includes the region covered by SAi, adding them to the candidate areas;Step 7: The coordinates   denotes the candidate region's maximum value; Step 8: Return the values   and   .

Table 3 .
Average values of accuracy and loss data for Attributes27 and VeRi-776 dataset