Thai culture image classification with transfer learning

ABSTRACT


INTRODUCTION
Deep learning is a subfield of machine learning that is inspired by the structure and function of neural networks in the brain.It involves training artificial neural networks on a large dataset, allowing the network to learn and make intelligent decisions on its own.One area where deep learning has been particularly successful is in the field of computer vision.This includes tasks such as image classification [1], [2], object detection [3], [4], and image segmentation [5], [6].Deep learning has made significant progress in recent years with the development of convolutional neural networks (CNNs) and other novel deep learning architectures [7]- [9].These networks are able to learn complex patterns and features in images, allowing them to achieve state-ofthe-art results on a variety of tasks [10]- [14].It has also been used to improve the efficiency of image and video processing, making it possible to run these tasks on mobile devices and other low-power platforms [15].
Although deep learning has been the method of choice for many tasks, it requires a large amount of data and computational power.Nevertheless, training a deep network on a new small dataset can be mitigated by using transfer learning.Transfer learning is a machine learning technique that allows a model trained on a large dataset to be fine-tuned for a specific task [16]- [18].It has been widely used in a variety of applications,  [19], [20], natural language processing [21], [22], and speech recognition [23], [24].In the field of image classification, transfer learning has been shown to be particularly effective when there is limited data available for the target task [25].In this paper, we explore the use of transfer learning for classifying images of Thai culture.
Thailand is a country with a rich and diverse cultural heritage, and accurately classifying images of Thai culture is important for a variety of applications, such as tourism [26], education [27], and cultural preservation [28].For example, a machine learning model that can classify images of Thai temples, traditional dress, and cultural festivals could be used to create a virtual tour of Thailand for tourists [7] or to help educators teach about Thai culture.However, building a machine-learning model from scratch to classify Thai cultural images can be challenging due to the limited availability of annotated data [29].Many traditional machine learning models require large amounts of labeled data to achieve good performance [30], and it can be timeconsuming and expensive to manually annotate a dataset for a specific task [31], [32].In this paper, we proposed to utilize transfer learning for the Thai culture image classification problem.By exploiting models pre-trained on a large image dataset, we can leverage the knowledge learned by the models on our target task.
The main contributions of our paper include: i) Collecting a Thai culture image dataset which consists of 1,000 high-quality images from 10 well-known Thai cultures and traditions; ii) Investigating the performance of three famous CNNs, namely MobileNet, EfficientNet, and residual network (ResNet) as pretrained models for Thai culture image classification; iii) Exploring how pre-trained models can be utilized for Thai culture image classification by comparing training the models with the random initialization, utilizing them as feature extractors, and fully fin-tuning them; and iv) Explaining the quantitative accuracy of the bestperforming model, EfficientNet, with gradient-weighted class activation mapping (Grad-CAM), which confirms that the model focuses on relevant features in the input images.For the remainder of this paper, we presented our methodology including details about our Thai culture image dataset, preprocessing steps, and transfer learning approaches.We then report on the results of our experiments and discuss the limitations of our approach.Potential directions for future work are also considered, ensuring a comprehensive exploration of opportunities within this domain.

RESEARCH METHOD
In this section, we introduce our dataset for Thai cultural image classification.We then summarize the detail of our pre-trained models and how they could be fine-tuned for our dataset.Lastly, we explain the hyperparameter setting used in our research.

Dataset collection
We used a dataset of Thai cultural images that we collected and annotated ourselves.The dataset consists of a total of 1,000 images, split evenly into 10 classes.Figure 1 shows a sample of each Thai cultural tradition: Figure 1 (the merit-making ceremony of offering candles during the Buddhist Lent), Figure 1(c) ประเพณี สารทเดื อนสิ บ (the meritmaking ceremony of offering food to monks on the tenth lunar month), Figure 1(d) ประเพณี สงกรานต์ (Songkran festival), Figure 1(e) ประเพณี วิ ่ งควาย (running with a buffalo festival), Figure 1(f) ลอยกระทง (Krathong festival), Figure 1(g) ประเพณี ยี ่ เป็ ง (the merit-making ceremony of offering candles to the spirits), Figure 1(h) ประเพณี บุ ญบั ้ งไฟ (the merit-making ceremony of lighting candles), Figure 1(i) ประเพณี ตั กบาตรดอกไม้ (the merit-making ceremony with flowers), and Figure 1(j) ประเพณี ชั กพระ (the tradition of carrying the Buddha image in a procession).Each image was manually labeled with its corresponding class by four annotators, and any discrepancies were resolved through consensus.The images in the dataset were collected from various online sources, such as Google Image Search and Flickr.We made sure to include a diverse range of images within each class and to ensure that the images were representative of Thai culture.We also made sure to exclude any images that were unrelated to Thai culture or that were inappropriate for the target audience.To preprocess the dataset, we resized each image to 224×224 pixels and applied standard image augmentation techniques, such as horizontal flipping and random cropping.We also randomly split the dataset into training and test sets, with a ratio of 80:20.

Pretrained models
In this study, we investigated the use of transfer learning for image classification of Thai cultural images with three state-of-the-art convolutional neural networks: MobileNet, EfficientNet, and ResNet since they represent networks from three different scales.MobileNet is a lightweight architecture designed for mobile and embedded devices, making it efficient in terms of computational resources.EfficientNet is an architecture that aims to improve the accuracy and efficiency of CNNs.ResNet is a deep residual network that has achieved state-of-the-art performance on several image classification benchmarks.
parameters and the inference times on the central processing unit (CPU) and graphics processing unit (GPU) for all three networks.The CPU and GPU used to measure the time were an Intel(R) Xeon(R) with a clock speed of 2.00 GHz and an Nvidia Tesla P100 with a memory of 16 GB, respectively.From the table, we can notice that the number of parameters for ResNet was significantly larger than the previous two.Nevertheless, the inference time was less prominent, especially for the GPU.By comparing the performance of these architectures, we hope to gain a better understanding of how different CNN architectures affect the transfer learning process and the final classification performances.

MobileNet
MobileNet [33] is a small, low-latency, and low-power convolutional neural network designed for mobile and embedded devices.It was developed by Google in 2017 and has been widely used in various applications, such as image classification, object detection, and semantic segmentation.MobileNet is based on the concept of depthwise separable convolutions, which decompose a standard convolution into two separate operations: a depthwise convolution and a pointwise convolution.This allows MobileNet to achieve a much smaller model size and faster inference times compared to traditional CNNs, while still maintaining a reasonable level of accuracy.MobileNet also uses a lightweight version of the Xception architecture, which consists of a series of depthwise separable convolutions and a few fully connected layers.This allows MobileNet to reduce the number of parameters and computations in the model, making it more suitable for mobile and embedded devices with limited resources.

EfficientNet
EfficientNet is a family of convolutional neural network models developed by Google in 2019 that aims to improve the efficiency of CNNs while maintaining a high level of accuracy [34].The EfficientNet models are designed to be scalable in terms of model size, number of parameters, and computational cost, making them suitable for a wide range of applications and hardware platforms.EfficientNet models are built on top of the MobileNet architecture to achieve a high level of efficiency.The EfficientNet models also use a compound scaling method that scales the model size, depth, and width in a balanced manner to achieve optimal  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 6, December 2023: 6259-6267 6262 performance.EfficientNet models have achieved state-of-the-art performance on various benchmarks, including image classification, object detection, and semantic segmentation.They have also been widely used in practical applications, such as mobile devices, internet of things (IoT) devices, and edge computing systems.

ResNet
Residual network (ResNet) is a convolutional neural network architecture developed by Microsoft researchers in 2015 [35].It was designed to address the problem of vanishing gradients, which is a common issue in very deep CNNs where the gradients of the parameters tend to become very small during training, leading to slow convergence and poor performance.ResNet addresses the problem of vanishing gradients by introducing a residual block, which is a building block of the network that allows the input to be added to the output of the block.This allows the gradients to flow more easily through the network, enabling the model to learn deeper and more complex representations.ResNet has achieved state-of-the-art performance on various benchmarks, including image classification, object detection, and semantic segmentation.It has also been widely used in practical applications, such as image recognition and natural language processing.

Transfer learning
Transfer learning is a machine learning technique that involves using a pre-trained model on one task as the starting point for a model on a related task.The idea behind transfer learning is that the features learned by the pre-trained model on the original task can be useful for the new task, allowing the model to learn more efficiently and potentially achieve better performance.Transfer learning can be useful in situations where there is a limited amount of data or computational resources available for the new task.By using a pre-trained model as a starting point, the model can benefit from the knowledge and representation learned on the original task, allowing it to perform better on the new task with fewer data and resources.There are two main approaches to applying transfer learning when using a pre-trained model as a starting point for a new task: using the pre-trained model as a fixed feature extractor or fully fine-tuning the whole model on the new task [36].

Feature extractor
Using the pre-trained model as a fixed feature extractor involves using the output of the pre-trained model as input features for a new model trained on the new task.The weights of the pre-trained model are not updated during training, and the new model is trained only on the new task.For computer vision tasks, the baseline pre-trained model is usually trained on a large and diverse dataset such as the ImageNet dataset [29].In this case, the features learned by the pre-trained model can be useful for the new task, allowing the new model to learn more efficiently and potentially achieve better performance.Figure 2 demonstrates the overview of using a pre-trained network as a fixed feature extractor.In the figure, the blue component represents the convolutional layers' parameters being frozen while the red components refer to fully connected layers being trained on a new small image dataset.6263 is significantly different from the original task and requires a more specialized model.Fine-tuning all layers of the model allows the model to learn task-specific features that may not be present in the pre-trained model.However, it also requires a larger amount of data and computational resources compared to using the pre-trained model as a fixed feature extractor.Figure 3 depicts the overview of fully fine-tuning the pre-trained model on the new task.In this case, both the feature extractor and the classifier were both trained on a new small dataset.

Hyperparapeters settings
To prevent an exhaustive search and reduce the time and resources required, we set the values of most hyperparameters to default values.Adam optimizer was used to adjust the networks' weights with a batch size of 32.For training the fixed feature extractors, the learning rate was set to 1×10 -3 , and the β1 and β2 were set to 0.9 and 0.999, respectively.For fully fine-tuning the whole network, we lowered the learning rate to 1×10 -5 in order to prevent significant changes to the models' parameters.All training was carried out for 100 epochs.All of our experiments were conducted using the framework Keras [37] in Python.

RESULTS AND DISCUSSION
To fully understand how transfer learning improves the performance of the models, we compared and discussed the proposed models in the following subsections.Specifically, the accuracy and inference time for the models were evaluated qualitatively.Lastly, the best model was then qualitatively examined with the Grad-CAM technique.

Quantitative results
To measure the performance of the models, we compared the quantitative results of three pre-trained models in terms of classification accuracy.Additionally, training time was used to evaluate the three different training strategies.We then discussed the best model when both metrics were taken into account.

Accuracy
We compared the performance of MobileNet, EfficientNet, and ResNet when trained with random initialization, used as fixed feature extractors, and fully fine-tuned for image classification on the proposed Thai culture image dataset.We used the same training and test sets for all models and training strategies.Their performances were evaluated using classification accuracy, which is defined as the fraction of correctly classified images over all tested images.The results of the experiments were shown in Table 2 which demonstrated that all three models performed better when they were used as a feature extractor followed by full fine-tuning and random initialization.Comparing three different networks, EfficientNet performed best at 95.87% followed by ResNet at 95.04%, and MobileNet at 92.56%.The results were similar to [36] which also pointed out that intermediate-size networks tend to perform better when they were repurposed for a new smaller dataset.Interestingly, fully finetuning the networks tends to have positive effects on the network, but in this case, it worsens the accuracies for all three networks.We argue that this is because of the size of the new dataset.The pre-trained models were trained on a large and diverse ImageNet dataset but the proposed dataset was significantly smaller.In this case, the features learned by the pre-trained models on the original task were more useful for the new task, when used as a feature extractor.When the models were fully fine-tuned, they were adapted to a new task by training all of their layers on the new dataset.This allows the models to learn new-task-specific features that may not be present in the pre-trained models, but it also requires a larger amount of data and computational resources otherwise the models can overfit the training set.As a result, the models seem to overfit the training set when they were fully fine-tuned compared to when they were used as feature extractors.When the models were trained from random initialization, the best-performing model was ResNet with an accuracy of 60.33% followed by MobileNet and EfficientNet at 52.89% and 47.08%, respectively.The results highlighted the critical contribution of utilizing transfer learning on a small dataset, especially by the EfficientNet which increase its performance from 47.08% to 95.87%.

Training time
We additionally compared the performance of the models based on the training time on the GPU.The results are shown in Table 3. From the table, we can notice that the training time of the fixed feature extractor was the smallest compared to random initialization and fully fine-tuning because the number of parameters used was much significantly lower.Besides as the random initialization and full fine-tuning updated the same number of parameters, their training time is relatively similar to each other.These results further favor using the pre-trained model as a fixed feature extractor on the proposed Thai culture image dataset as it was able to achieve the highest accuracy while requiring the least amount of training time.

Qualitative results
To better understand the reasoning behind the performance of our best model, an EfficientNet model trained as a fixed feature extractor, we used the Grad-CAM [38] visualization technique to generate heatmaps of the regions in the images that the model attended to when making predictions.Grad-CAM is a visualization technique for understanding the decisions made by a convolutional neural network.It produces a heatmap that indicates the regions of the input image that are most important for the prediction made by CNN.Grad-CAM can be used to explain the decisions made by CNN, which can be useful for debugging and improving the performance of the model by highlighting the most relevant parts of the input image for making predictions.

CONCLUSION
In this study, we investigated the use of transfer learning for the task of image classification on the Thai culture image dataset.The dataset was constructed from 1,000 images manually annotated into 10 wellknown Thai cultures and traditions.We used MobileNet, EfficientNet, and ResNet as pre-trained models and evaluated their performance when they were trained with the random initialization, used as a feature extractor, and fully fine-tuned.The results showed that all three models performed better when used as feature extractors compared to random initialization and fully fine-tuning, with EfficientNet achieving the highest accuracy of 95.87%.Favorably, using pre-trained models as a feature extractor also takes the least amount of training time.In the case of EfficientNet, it only took 24 ms/iteration.Our qualitative findings from Grad-CAM additionally supported that the proposed method effectively localized the relevant information in the images for prediction.
Future work could focus on improving the performance of image classification on the Thai culture image dataset by exploring the use of different pre-trained models and fine-tuning strategies.This may include experimenting with different ratios of the networks that are allowed to be updated during fine-tuning.Additionally, novel neural network architectures such as vision transformers could be adapted to further enhance the performance.

Figure 1 .
Figure 1.Sample images from the Thai culture image dataset consist of 1,000 images of various aspects of Thai cultural traditions, (a) including the merit-making ceremony of hoisting a cloth flag, (b) the meritmaking ceremony of offering candles during the Buddhist Lent, (c) the merit-making ceremony of offering food to monks on the tenth lunar month, (d) the Songkran festival, (e) running with a buffalo festival, (f) Krathong festival, (g) the merit-making ceremony of offering candles to the spirits, (h) the merit-making ceremony of lighting candles, (i) the merit-making ceremony with flowers, and (j) the tradition of carrying the Buddha image in a procession.The images were collected from various online sources and manually annotated by assigning each image to one of the 10 well-known traditions

Figure 2 . 2 . 3 . 2 .
Figure 2. A pre-trained network can be used as a fixed feature extractor, where the weights of the feature extraction layers are not changed during training, but the weights of the classifier are updated during training on a new small dataset

Figure 3 .
Figure 3. Fully fine-tuning the pre-trained model on the new task involves adapting the model's parameters to the new small data by training all layers of the model including the feature extractor and the classifier

Table 1
Thai culture image classification with transfer learning (Munlika Rattaphun)

Table 1 .
Comparison of MobileNet, EfficientNet, and ResNet in terms of parameters, CPU inference time, and GPU inference time

Table 2 .
Classification accuracy of MobileNet, EfficientNet, and ResNet when trained as random initialization, a fixed feature extractor, and fully fine-tuning on the Thai culture image dataset

Table 3 .
The training time of MobileNet, EfficientNet, and ResNet when trained as random initializations, a fixed feature extractor, and fully fine-tuning on the Thai culture image dataset