PithaNet: a transfer learning-based approach for traditional pitha classification

ABSTRACT


INTRODUCTION
Pitha is a type of food that originated in the Indian subcontinent and is equally popular in Bangladesh and India.Pithas are similar to pancakes, dumplings, or fritters.The main ingredient of a pitha is a batter composed of rice flour or wheat flour, which is further moulded and could be filled with either sweet or savory fillings.Pithas, or traditional handcrafted cakes, are a wintertime treat that Bengalis are renowned for enjoying.In the winter, date juice and molasses made from sugarcane and dates are available, which are essential ingredients for pithas.Hence, this dish is immensely popular during winter in Bangladesh.When farmers harvest paddy from the field in the late autumn, Bangladesh's pitha season officially begins.In addition to the winter season, these pithas can be found on festive occasions such as weddings, Eid festivities, and puja celebrations.Many well-known pithas, including patishapta pitha, bhapa pitha, chitoi pitha, til er pitha, kolar pitha, tel er pitha, nakshi pitha, to name but a few, are seasonal during the winter.Food picture classifications, such as fast food, vegetables, fruits, cake, and so forth, have attracted a lot of interest in the field of research.Nevertheless, the classification of food images is still the most recent area of study.Therefore, the classification of traditional pithas has been studied in this research.

ISSN: 2088-8708 
PithaNet: A transfer learning-based approach for traditional … (Shahriar Shakil) 5433 the form of names or components can also be correctly gathered.Given that the categorization accuracy reached 70%, it is envisaged that this method will be implemented in the mobile-based system and provide a simple way for users to learn about Indonesian cuisine [12].In this research, deep learning-based automatic food classification methods are described.For classifying food images, squeezenet and VGG16 CNNs are employed.SqueezeNet can reach quite a respectable accuracy of 77.20% even with fewer parameters.The accuracy of the projected VGG16 has significantly improved, reaching 85.07%, thanks to the deeper network [13].For effective categorization of food photos, support vector machine (SVM) classifier, feature concatenation and deep feature extraction are employed in this research.The proposed model is performed using three openly accessible datasets FOOD-101, FOOD-5K and FOOD-11 and when evaluating performance, the accuracy metric is taken into consideration.According to the testing results, the accuracy for the FOOD-5K dataset is 99.00%, and for the FOOD-11 and FOOD-101 datasets, it is 88.08% and 62.44%, respectively [14].
Singla et al. [15] inn this study, the researchers tested a deep CNN-based GoogleLeNet model for classification and recognition of food and non-food objects.The photos used in the trials were gathered from two of their own image databases as well as from available image datasets, social sites, and imaging tools like smartphones and wearable cameras.The classification of food and non-food exhibits a high level accuracy of 99.2%, and the recognition of food categories exhibits an accuracy of 83.6%.In another study, the purpose of this study was to create a pre-trained structure for food recognition.Three different methods were applied to accomplish this goal, and their outcomes were evaluated to those of well-known pre-trained models AlexNet and CaffeNet.To apply these pre-trained models to their issue, the transfer learning technique was used.Test findings demonstrate that, as predicted, pre-trained models outperform suggested models in terms of output.The maximum improvement in classification performance achieved by the Adam technique was 32.85%; the maximum improvement achieved by Nesterov was 14.77% [16].
In throughout this paper, 1,676 datasets with 20% testing data and 80% training data that contain pictures of Indonesian traditional cakes will be subjected to the CNN Algorithm approach.Preprocessing, operational datasets, visualization datasets, modeling methodologies, performance evaluations, and errors analysis are all used in the stages, which led to the conclusion that performance evaluation has reached a level of 65.00% [17].In order to classify images of Punakawan puppets, this work used a Gaussian filter as a preprocessing technique and a VGG16 learning architecture for classification.The study discovered that the maximum accuracy, 98.75%, was achieved while utilizing contrast limited the adaptive histogram equalization (CLAHE) + red, green, and blue (RGB) + Gaussian filter and thresholding images [18].
CNN breakthroughs in recent years suggested recognizing 12 different forms of illnesses affecting rice plants.Additionally, the performance of 8 various cutting-edge convolution neural network models has been assessed with a focus on diagnosing diseases of rice plants.The validation and testing accuracy of the proposed model are 96.5% and 95.3%, respectively, and properly diagnoses illnesses in rice plants [19].This study proposes an effective CNNs-based fish categorization technique.Three types of splitting 80%-20%, 75%-25%, and 70% were tested on a novel dataset of indigenous fish species found in Bangladesh.The proposed model improved CNN's classification capacity with the highest accuracy of 98.46% [20].The Kaggle-180-birds dataset was classified in this study using three different classifiers: the deep learning method (ResNet50), the classical machine learning (ML) algorithms (SVM, and decision tree (DT)), and the transfer learning-based deep learning algorithm (ResNet50-pretrained).The outcomes showed that the transfer learning classifier had the best classification effect, with a 98% to 100% accuracy rate [21].Thfese various research methodologies provide us an assistance because our study is extremely unique and there hasn't been any significant research on this particular pitha classification topic.
Table 1 shows that most of the work is done on food image classification like fast food, cake, and other food using CNN models or pre-trained models.However, this work is done on traditional food which is pitha classification using transfer learning like EfficientNetB6, ResNet50, and VGG16.Among them, VGG16 showed better accuracy.There has been no significant work regarding the traditional pitha classification.

METHOD
In this section, we discuss about datasets, data pre-processing, model building, statistical analysis, and general architecture of our proposed model.The workflow diagram of this study is shown in Figure 1.Data  The overfitting issue with model training has been solved using image augmentation approaches, which also improves model performance [22].By concentrating on horizontal flips, rotation, zooming, shear height-shift, width-shift, and rescale while augmenting, the CNN models will become less sensitive to the precise location of the item [23].The rotation angle was 30, zoom range 0.2, height-shift, and width-shift range 0.2 and shear range was also 0.2.The augmentations were done in such a way that the quality of the images was not lost.As 10% (524) images were used to evaluate the model, the augmentation technique applies to the rest of the images which is 4716.We generated 10,000 images from 4,716 original images after augmentation.After that, all images were resized to 224×224. Figure 5 shows the differences between original images and augmented images.

CNN
Most recently, CNNs have made significant advances in deep learning.It has resulted in a remarkable rise in the accuracy of image identification and recognition.A CNN is a sort of ANN that is mostly used for image detection and processing because of its capacity to spot patterns in visual data.CNN was the top contender for our experiment because feature engineering is not required there.If we used traditional ML approach in our pitha dataset, we also needed to use some feature extraction algorithms for feature selection in different types and shapes of pithas and it would take a lot of time.Also, we compare handcrafted features with CNN where CNN performs better.

Convolutional layer
Filtering actions are carried out mostly via convolutional layers.This layer is responsible for handling the vast majority of the computations of a CNN model.In order to construct the feature map, the set of images is fed into the surface.The kernel, also known as the feature detector, will look for features in the image.The kernel size can vary, but 3×3 or 5×5 matrices are commonly used.Following the convolution layer, the rectified linear unit (ReLU), which is frequently used in neural networks, performs the nonlinear activation process [24].
Any negative input causes the function to return 0, while any positive value  causes it to return that value.Thus, it may be expressed as (1).

Feature map
CNNs are unreasonably expensive due to the depth and numerous parameters.Therefore, dimension reduction between the layers is required.The dimension is reduced in the pooling layer by a down sampling process [25].The pooling layer compiles the trait features in a specific area of the feature map.If the input is represented by  of an image  dimension, ℎ represents filter, and the result matrix is marked with M [m x n] and  and  represent padding size and stride size respectively, then the formula to calculate feature map values is given in (2) and Figure 6 shows the calculations behind the feature map [26].

Pre-trained models
Pre-trained models include network weights that have undergone training.Utilizing an architecture that has already been learnt from a classification task thereby minimizes the number of steps necessary for the output to converge.It is because, in general, the features that are collected for the classification task will be comparable.Initializing the model with the pre-trained weights saves time during training and is therefore more effective than starting with random weights.According to a survey in the literature on similar classification tasks, researchers have utilized different pre-trained models for food classification tasks, including AlexNet, VG16, EffecientNet, and SqueezeNet.In this paper we have selected three popular pre-trained models VGG16, ResNet50, and EffecientNetB6.A brief explanation of those three base models will be described in this section:

VGG16
VGG-16 is a 16 layers-deep CNN.It was first introduced in the ILSVRC-2014 by Simonyan and Zisserman [2].On the ImageNet dataset, which has roughly 138 million parameters, this model won the top prize by achieving 92.7% accuracy and ranked in the top 5.The input in this architecture is 224×224.It has a convolution layer with 3×3 kernel size with stride 1, a max-pooling layer that employs 2×2 kernel with stride 2, and a total of three fully connected layers connected with SoftMax activation function.It is currently one of the most popular options in the community for extracting features from images.In ImageNet, which contains more than 14 million images from close to 1,000 classes, the VGG16 model scores about top-5 test accuracy.As our pitha dataset will have various classes, we put it on our list in the hopes that it will perform well.

ResNet50
A well-known neural network called residual networks, often known as ResNet, provides the basis for numerous computer vision tasks.[3] ResNet for image recognition was initially developed by Zhang [3], in a research paper titled "deep residual learning for image recognition", which took first place in the ILSVRC-2015 contest.ResNet's ability to train incredibly deep neural networks was its core innovation.In the 34-layer net, every block of 2-layers is swapped out with this 3-layer bottleneck, block to produce a 50-layer ResNet, which has a fixed input size of 224×224 pixels.In this architecture, the number of parameters is almost 23 million.The reason of choosing ResNet50 was we wanted to test a deeper network than VGG16 and VGG19 in this experiment.ResNet50 contains more layers, but because global average pooling is used rather than fully-connected layers in the architecture, the overall size is really significantly smaller.

EfficientNet B6
Making use of a compound coefficient, the EfficientNet architecture and scaling algorithm consistently scales all depth, breadth, and resolution dimensions.Tan and Le [27]

Transfer learning
Conventional ML models need to be trained from scratch with a substantial amount of data, which is computationally costly.Transfer learning is a powerful deep learning approach in which a network learned for a task "A" is repurposed for a new task "B".In a CNN model, the initial layers typically work to identify the common features from each image.The model only makes an attempt to differentiate between classes in the final few layers.Because the higher layers deal with specific features, we influence the dichotomy by deciding how much to change the network's weights.Because PithaNet has a moderate number of samples and the dataset is comparable, we can avoid a lengthy training process by using the model's prior knowledge.Therefore, training the classifier and the upper layers of the convolutional base should be sufficient.As a result, this paper concentrates on preserving the original configuration of the convolutional base and more layers frozen to avoid overfitting.The typical approach for solving image classification-related problems is to use a set of fully connected layers come after two or more densely connected layers where the final dense layer has a commonly used activation function SoftMax, if the problem is multiclass classification.To minimize overfitting global average pooling layer is added which will minimize the overall number of parameters.Figure 7 depict the architecture of the proposed modified transfer learning model.  2 summarises the input size, learning rate, epoch count, and parameter count of the adjusted model.As EfficentNetB6 and VGG 16 have significantly more parameters than ResNet50, therefore, this study used a learning rate of 0.1 for those two and 0.001 for ResNet50.Figure 8 displays the training, validation, and test accuracies on 8,000 training photos, 2,000 validation images, and 524 testing images.The EfficientNetB6 model train accuracy is about 96.5%, validation accuracy is 87.5%, and test accuracy is 90%.In comparison, ResNet50 performs slightly better than EfficientNetB6.It provides 97.6% train accuracy, 89.6% validation accuracy, and 90% test accuracy.Nonetheless, the VGG16 gives a satisfactory result with an accuracy of about 95.5%, a validation accuracy of 91.5, and a test accuracy of 91.5%.The number of epochs is the sum of all the entire iterations over the training dataset.The batch size is the quantity processed prior to model modification.All the models were trained with 50 epochs and the batch size was 16.For optimizing the algorithm Adam was used.As a loss function, the categorical cross entropy was applied, and metrics were utilized for accuracy.Figure 9 displays the model's training and validation accuracy and loss of EfficientNetB6, ResNet50, and VGG16 model.Also, it indicates that this model is not over fitted or under fitted.It performs well in the real test datasets images.This can be better understood from the confusion matrix.Table 3 demonstrates the accuracy of 8 classes for EfficientNetB6, ResNet50, and VGG16 model.The percentage of accurately anticipated data points across all data points is what is known as accuracy.Following the (3), it can be calculated.Table 4 demonstrates the precision and recall of 8 classes for EfficientNetB6, ResNet50, and VGG16 model.Precision is defined as the ratio of correctly diagnosed positive samples to all samples that are positively classified.One metric for measuring the effectiveness of a machine learning model is precision.How many of the discovered objects are actually relevant depends on a model's precision.The (4) can be used to find it.Recall or sensitivity refers to the quantity of positive records that were accurately anticipated.The proportion of positive samples to all positive instances is used to compute the recall that were correctly classified as positive.Recall measures how well the model can differentiate positive samples.The recall increases as more positive samples are found.It can be formulated by (5), Table 5 demonstrates the F1-score and specificity of 8 classes for EfficientNetB6, ResNet50, and VGG16 model.The F1-Score or F-measure is an evaluation metric for classifications, where it is specified as the harmonic average score of recall and accuracy.It is a metric used in statistics to assess how accurate a test or model is.It is represented as follows in mathematics (6).Specificity may be defined as the ability of something like the algorithm or system to predict a genuine negative of each available category.It is frequently referred to as the real negative rate in literature.It may be calculated using the (7).

CONCLUSION
The intended outcome of this research is to classify eight different varieties of Bangladeshi pitha using a pre-trained CNN model.The study shows that VGG16, ResNet50, and EfficientNetB6 performs well.Among these, the model VGG16, based on a CNN, provides a higher accuracy of about 92%.Although image classification is a typical task in computer vision, it becomes quite complex while classifying multiple classes.Some pithas usually differ from one another in terms of shape.Even though it might be challenging to operate under certain constraints, we made an effort to get through these obstacles despite our lack of resources.Most importantly, the F1-score of the VGG16 model for each class is nearly 90%, which indicates that almost all types of classes are predicted successfully.
In the future, we would like to develop an automated application that can identify not only the pithas, but also their ingredients, and calorie information.Anyone using that app will be able to see the name and features of pithas by clicking on a particular image of a Pitha.This will be beneficial for future generations in order to gain more knowledge regarding our traditional pithas, which will enhance their cultural awareness as well.Future addition to this study will also include new datasets and pitha variations.


ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5431-5443 5434 collection and pre-processing is the first step of this research work.Here, we preprocess the data using augmentation, and image resize techniques.Afterward EfficientNetB6, ResNet50 and VGG16 are used for model building.Finally, we test all these models with real data and make predictions.

Figure 1 .Figure 2 .
Figure 1.Workflow diagram of the study

Figure 7 .
Figure 7. Proposed transfer learning architecture

Figure 8 .
Figure 8. Accuracy of CNN models during training and testing

Figure 11 .Figure 11 .
Figure 11.show the number of misclassified images, where out of 525 images EfficientNetB6 misclassified 51, ResNet50 misclassified 54, and VGG16 performed better with only 44 misclassified images.VGG16 founds a little difficulty in class "Chitoi Pitha" with 11 misclassifications but in the rest of the classes, the misclassification rate is significantly low.After reviewing all of these, it's clear that VGG16 performs well on PithaNet dataset.

Table 1 .
Exclusive review of food image classification of recent work [28] EfficientNet architecture in the research paper titled "EfficientNet: Rethinking model scaling for CNN."A presentation of this paper was made at the 2019 International Conference on ML.They suggested a brand-new scaling technique that scales the network's depth, width, and resolution equally.With the help of the neural network search, they developed a new baseline model and increased it up to produce the EfficientNets family of deep learning models, that beat the earlier CNNs in terms of performance and accuracy.The input size of EfficientNetB6 is 224 x 224 and it has more than 40M parameters[28].Beside VGG16, which is a more popular and established model and ResNet50 for its deeper layer and different architecture, we choose EfficientNet hoping to obtain a good result with reasonable parameters and cost.
PithaNet: A transfer learning-based approach for traditional … (Shahriar Shakil) 5437 up with

Table 2 .
The input size, learning rate, number of epochs, and different parameters of the modified models