Content-based product image retrieval using squared-hinge loss trained convolutional neural networks

ABSTRACT


INTRODUCTION
Users input a text containing the product's keywords to search for an item in an online store.However, text keywords could not distinguish products based on their visual perception.A visual search system uses an image as a query instead of a text to solve the problem.Visual features representing the image are extracted and matched with others in the database to get similar images [1].Image features represent shapes or color distribution of the image.Shape-based image retrieval uses edges or moment invariants, while color-based retrieval uses a histogram of the image pixel values.Key-point-based features, including scale-invariant feature transform (SIFT) [2] and speeded-up robust features (SURF) [3], are also used in visual searches besides those features.
Recent studies have focused on convolutional neural networks (CNN) since it outperforms other approaches in large-scale object detection and image classification [4] in ImageNet large scale visual recognition challenge (ILSVRC) competitions [5].Furthermore, the CNN model is also applied in image matching and retrieval.CNN-based features of the image were obtained by training the CNN model for the image classification task using a specific dataset and a loss function.Then, the model is modified by removing its classification layer.After that, an input image is processed with the modified model to extract the feature [6].

5805
CNN-based features have been applied in content-based product image retrieval.The CNN model is trained for classification tasks using product category label supervision.The output is the model's last layer trained using the softmax loss function for multiclass classification.Along with softmax loss, the squaredhinge loss function is also known for its performance in multiclass classification [7].Unfortunately, to our knowledge, studies on content-based retrieval, specifically in product images, have not applied the squaredhinge loss function in their model training.
This study proposes a method for extracting features from product images based on CNN models trained with squared-hinge loss as an alternative to softmax loss.Extracted image feature was then indexed using the nearest-neighbour (NN) indexing technique.Retrieval experiments were conducted on different CNN models with softmax and squared-hinge loss functions.We evaluate the training process and the retrieval result to obtain the best configuration for the feature extraction method.
Contributions of this study are: i) Our method can be used as an alternative to the existing CNN-based feature extraction method, specifically for content-based product image retrieval.Also, ii) we present the best configuration of the CNN model, training parameter, and loss function to achieve the best result, and iii) We believe that our works can be applied for content-based retrieval in e-commerce shops.
The rest of the paper is organized as follows.Section 2 reviews related works on CNN features and product retrieval.Details on the method for feature extraction, indexing, and matching are given in section 3. Experimental results with discussion are presented in section 4. Finally, in section 5, we conclude the paper with a summary.

CNN-BASED FEATURES
A deep convolutional network consists of two parts: i) the convolution layer and the pooling layer and ii) the fully connected layer [8], as shown in Figure 1.The first layer in the convolutional layer is the input layer that accepts input in a raw red, green, blue (RGB) pixel image.A convolutional layer is a set of feature maps with neurons, and the parameters of the layer are the filter or kernel set.The pooling layer reduces the activation map's spatial dimension, the number of parameters in the network, and computational complexity.The fully-connected layer (FC) has neurons connected to the previous layer as in a neural network [9].A loss value is a penalty for a mismatch between the desired output and the resulting output of the last layer.CNN feature extracted from the nodes in the FC since this layer has a global receptive field representing the global features [10].Researchers use various CNN models for feature extraction.For example, in [11], [12], use the FC before the last layer, named FC6 and FC7 of the AlexNet model [13].While Razavian et al. [14] applies a different CNN model, i.e., OverFeat [15], and extract feature from the first FC layer (layer 22) of the model.Moreover, the HybridNet [16] was used in [17] to extract features using the activation of the first FC layer (i.e., FC6).
CNN feature extraction described above is performed on general image classification and retrieval.Researchers were also interested in applying the CNN feature extraction method to specific images, such as product images.The work of [18] uses a self-built network model, which is more straightforward than the standard pre-defined model for classifying images.In [19], product retrieval using CNN features from the VGG-19 model was applied to retrieve fashion product images.Alternatively, Elleuch et al. [20] used the features from the Inception V3 model [21].The feature is extracted from the bottlenecks layer, the layer just before the last output layer on a clothing dataset.

METHOD
The method in this study consists of three steps, as shown in Figure 2. The transfer learning scheme is applied to the pre-trained CNN model in the model training.Then, the model was trained with images and category labels from the dataset using squared-hinge loss.The output of this step is the fine-tuned CNN model.After that, image features are extracted from the fine-tuned CNN model and indexed using the nearest neighbor (NN) indexing technique.In image retrieval, the query-by-example is performed by matching features from the query with the database's indexed features using the k-NN search algorithm.
Figure 2. The proposed method for content-based product image retrieval

CNN models
The CNN architecture has evolved, particularly in the convolutional module.The VGG19 [21] is one model that uses the conventional convolutional module.Another model uses several filter sizes in a single image block, then concatenated and transferred onto the next layer instead of limited to a single filter size, as in the Inception module [22].MobileNetV2 [23] uses depthwise convolution to reduce the number of parameters, which applies a single convolutional filter per input channel.A pointwise convolution is used to create a linear combination of the output of the depthwise convolution.Unlike the types above, ResNet [24] uses the residual module, a skip-connection block that learns residual functions with reference to the layer inputs instead of learning unreferenced functions.
The layer before the last fully connected layer (FCn−1) represents the image feature vector extracted from the CNN model.Each node in the layer reflects a feature vector element.Therefore, the number of nodes in the FCn−1 should be considered when selecting the CNN model for the image feature extractor since it affects the feature dimension.We use CNN models with different FCn−1 nodes, as shown in Table 1, to see the correlation between the feature dimension and retrieval accuracy.

Transfer learning on the CNN model
Deep transfer learning tries to improve accuracy by adopting the model from another domain that has high accuracy.We perform transfer learning from a pre-trained ImageNet model [5] as the source domain to the product images dataset as the target domain.First, the source model's convolution and pooling layer parameters were transferred to the target as in the network-based transfer learning schema [25].Then, to fine-tune the model, the source model's last FC layer (FCn) was replaced by a new layer (FC'n).This new layer has the same number ISSN: 2088-8708  Content-based product image retrieval using squared-hinge loss trained … (Arif Rahman) 5807 of nodes as the number of classes in the product dataset.After that, the fine-tuned model is retrained using the product image dataset.Figure 3 shows the transfer learning process from the pretrained model.

Loss function
We trained the CNN model using a loss function for classification tasks.Since it is a multiclass problem, the loss function should produce multiclass outputs.Softmax loss LS in ( 2) is typical for a multiclass classification task.
fyi is output from fully-connected layer f for input with label yi, while fj for label j, N is the number of samples, and K is a total class number.
Alternatively, for multiclass classification, we can use squared-hinge loss LH expressed in (3), where m is the specified margin value.
The LH is a more local objective since it computes uncalibrated scores for all classes, while LS allows all labels' computing probabilities.

Feature extraction and indexing
Feature extraction is the inference process of the fine-tuned CNN model.The feature extracted is a CNNbased feature, a vector whose value is the node value of the fully-connected layer before the last layer in the model FCn−1.This layer has a global receptive field that can be used as a global image feature [10].The vector feature was then normalized with l2-norm as in (4) to form the image feature F. This normalization ensures that vector feature values are in the specific range of value.
The vector feature values are stored in a feature database before being used in the image-searching process.An indexing technique was applied to store and retrieve features efficiently.The nearest-neighbor indexing is applied since it is straightforward and sufficient for image search with CNN-based features.

Image retrieval
The retrieval process begins with preprocessing the image query.Then, the image feature is extracted using the fine-tuned CNN model from the model training.After that, the extracted feature is normalized using l2 normalization as in (4).This normalized feature is used as the query to get similar image features in the indexed feature database.The feature similarity is measured based on the Euclidean distance of the query and each image feature in the database.Similarity search of the feature on the database performed using the k-NN search algorithm.The k feature vectors in the indexed feature database with the smallest distance to the query are returned as the query result.

Experimental setup
We use hierarchical class image data with general to specific levels: superclass, class, and image.A superclass refers to product categories (e.g., bicycle, sofa, and shirt), the class represents each product item in the product category, and an image is the product item picture.A product item may contain more than one picture.The experiments were conducted using labeled product image datasets with fine-grained categories: Stanford online product (SOP) [26] and InShop DeepFashion (InShop) [27].The SOP includes home product images, while InShop contains clothing images.SOP consists of 12 superclasses, 22,634 classes, and 120,053 images.The InShop dataset contains 23 superclasses, 7,982 classes, and 52,712 images.Both SOP and inShop datasets are also used in [28].
CNN model training was performed on a GPU-enabled machine in 100 epochs.The training is optimized with stochastic gradient descent (SGD), and the learning rate is set to 10 −3 .Feature indexing and k-NN searching are implemented using neighborhood graph and tree (NGT) [29], a graph-based indexing library.Results on diverse datasets in [30] show that the batch size at the training CNN model gets the best accuracy at sizes greater than 64.However, the greater batch size requires more computational resources.Due to the limitation of computational resources, we found that 96 is the optimum batch size in our experiments.

Evaluation metrics
Retrieval results on both datasets using queries from all images in the test split are evaluated with two metrics, and for both, a higher value means better performance.− The P @k (precision at k) is expressed as (5), − The mAP @k (mean average precision at k) is written as (6), where ri is the number of relevant results in the top i.This metric considers the order of the retrieval results.We use k=50 in the experiments, assuming maximum query results displayed to the user when searching for an online shop item.

Model training
The CNN model training was conducted using the squared hinge and softmax loss functions.We use a learning graph with loss value and epoch axes to analyze training performance.In general, the training obtains good-fit models for all datasets since the curves in the graphs descend smoothly and converge to a certain point.

Retrieval results
Retrieval experiments were performed on CNN-based features using SOP and InShop datasets using CNN models trained with softmax (S) and squared-hinge (H) loss.In Figure 6(a), P@k results on retrieval using features from ResNet18-H and MobileNetV2-H tend to get better accuracy in the SOP dataset.Besides, in Figure 6(b), retrieval results in mAP@k metric on ResNet18-H and MobileNetV2-H have very slightly different values, and ResNet18-H gets the best results in each k.
The findings for the InShop dataset retrieval are comparable to those of the SOP dataset.Figure 7(a) and Figure 7(b) show the P@k and mAP@k results on the InShop dataset.ResNet18-H and MobileNetV2-H continue to exhibit the highest accuracy in P@k and mAP@k.However, the gap between the two is now more pronounced, with MobileNetV2-H delivering superior performance compared to the others.We calculated the P@k and mAP@k accuracy gap between squared-hinge loss, and softmax loss trained CNN feature.In all models and k=1 to 50, retrieval accuracies using squared-hinge loss trained feature improve the softmax loss trained feature by 3.3% on average in SOP, while in Inshop by 3.7%.These results confirm that utilizing the CNN feature from the model trained with squared-hinge improves accuracy compared to a model trained with softmax loss.
Feature vectors extracted from the CNN models have various dimensions, from 4096-dim in VGG19 to 512-dim in ResNet18.Feature dimension affects the computational resource requirements since extraction, indexing, and matching operations are performed on each feature vector element.From this point of view, the ResNet18 feature is preferred to other features since it has the lowest feature dimension while still giving competitive accuracy.

CONCLUSION
Image retrieval using features extracted from various fine-tuned CNN models has been done using fine-grained image product datasets.Retrieval results using our method show that features from the CNN model trained with squared-hinge loss improve the retrieval accuracy compared to features from softmaxtrained models.Overall, MobileNetV2-H features get the best retrieval accuracies.However, ResNet18-H has the advantage since it has the lowest feature dimension while still getting competitive accuracy.Since the image of a product item in the online store has a different quality than the image taken by the user, distance metric learning should be considered for calculating the distance between features.

Figure 5 (Figure 5 .
Figure 5. Loss graph of training CNN models in (a) SOP and (b) InShop dataset

Figure 6 .Figure 7 .
Figure 6.Retrieval results of CNN-based features trained with softmax loss (S) and squared-hinge loss (H) on the SOP dataset in (a) P@k and (b) mAP@k metric Int J Elec & Comp Eng ISSN: 2088-8708  Content-based product image retrieval using squared-hinge loss trained … (Arif Rahman) 5811

Table 1 .
Main module and FCn−1 nodes of CNN models