Food sales prediction model using machine learning techniques

ABSTRACT


INTRODUCTION
Food sales are an inspiring topic for researchers today due to their importance for companies, merchants, and entrepreneurs [1].However, food sales prediction is an important and critical step and at the same time useful for these companies to know the expected quantities of food will need in the future and ensure that food is not damaged as a result of its expiration date [2].Accurate prediction of food sales is a difficult task due to the dynamic change of external parameters.In general Time series data control food sales prediction [1].The huge amounts of data collected and the large number of features affect each other making data impossible to be processed manually.Here comes the role of artificial intelligence to solve this problem and automatically predict the future of sales.It is the responsibility of artificial intelligence algorithms to process the data and give the appropriate prediction.In this regard, many algorithms differ from each other in the results they provide to give a good prediction.
Many researchers have addressed the topic of food sales prediction, and many algorithms have been used.But most researchers did not use only one, two, or three algorithms.Also, they often use one dataset.As previously mentioned, this study, used more than one dataset and more than one algorithm.So, the related works can be listed below.According to the study mentioned in source [3], the researchers designed two prediction models: the main prediction model and the weather prediction model to prove the effect of weather ISSN: 2088-8708  Food sales prediction model using machine learning techniques (Hussam Mezher Merdas) 6579 factors on the sales prediction of fresh products.The algorithms used are random forest (RF), ridge regression (RR), and support vector machine (SVM).The results of this study by using the relative reduction rate of the using root mean square error (RMSE) for the three algorithms were: ridge regression 68.90%, RF 23.66%, SVM 59.52%, and the results of this study by using relative reduction rate of the mean absolute percentage error was: ridge regression 66.2%, RF 34.99%, and SVM 61.13%.According to Prabhakar et al. [4], predicted the future sales of one of the leading competitor's grocery store chains in the market.This study explains the algorithms used to predict the unit sales of products sold by this grocery store.Predictive modeling is used to solve the analytics problem that came from daily trading, continuous selling, and the relationship of products to each other.The results of this study by using RMSE: linear regression 11.97, gradient boosting 9.5, neural network (NN) 9.82, SVM still running for 5 hours without result.
According to Arunraj et al. [5], this study aims to build a model which tries to take into consideration all the effects of the on-demand factors, to predict the daily sales of spoiled food in a store.So, using a prediction system that Calculates uncertainty and accuracy which are important measures to prevent food waste.The results of this study by using RMSE is 0.613.
According to Posch et al. [6], a designed model for predicting future sales of items in restaurants.Which use two Bayesian generalized additive models.The first model assumes a normally distributed for future sales, while the second one depended on the negative binomial distribution.They used two datasets collected from point of sale (POS) systems in a restaurant and a staff canteen.The results showed that the used approach often provided the most robust point predictions and the best.The negative binomial distribution of the model is more accurate than the ones obtained by the other.
According to Meulstee and Pechenizkiy [7], an ensemble learning approach for sales prediction has been presented.Dynamic integration of classifiers employs to guarantee good handling of fluctuations and seasonal changes in consumer demands.So, they focus their study on how the food wholesale forecasting is currently operated in the company, and how can improve product prediction.They evaluate their approach by using real data collected by the retailing company.The results prove that the ensemble learning approach is better than the currently used baseline, and if the feature set for a target product is complemented with features, the ensemble learning approach can handle seasonal changes.And if information about holidays and the weather is presented explicitly in a feature set, an ensemble can become more accurate.
Finally, according to Schmidt et al. [8], each day is broken up into three sales periods because of predicted partitions of sales days, so they focused on the middle timeslot, where sales predicting should extend for one week.They use a dataset generated by three years of sales between 2016 to 2019 from a restaurant for researching sales prediction methods.They make preprocess and convert raw data to a workable dataset.So, they tested many machine learning (ML) algorithms on the dataset, such as recurrent neural network (RNN) and ridge model.As a result, the Ridge regression model was the best model for oneday forecasting with a mean absolute error (MAE) of 214, and the temporal fusion transformer model was the best for one-week forecasting with an MAE of 216.
In this paper different datasets are used, one with a strong correlation, this dataset is obtained from "publicly available Alibaba's Tianchi platform data", this dataset consists of 15 attributes and 1,000 objects containing sales data of different cities and branches between its features, and the other dataset with little correlation includes sales data from ten stores in different cities sourced from Kaggle, consisting of 12 attributes and 8,523 objects, to study the results of both.The two datasets were examined to detect the missing values, as they were compensated using appropriate mathematical operations, such as taking the mean for the rest of the values.As well as processing the repeated values, and spelling errors in the entries were also processed, so these datasets were filtered and prepared in the appropriate way to present them to the algorithms that will use.These algorithms will be ten algorithms.Finally, several metrics will be used to evaluate these algorithms to select the best three.To take it into account as the appropriate algorithm for such datasets.The main objective of this process is to help companies to know the future sales of their products and the consequent reduction of losses resulting from food spoilage and achieving more profits.

THE PROPOSED MODEL
The proposed model is based on two key steps as shown in Figure 1.The first step is to use two datasets with different content and different correlation between their features.The goal of this step is to make a practical comparison between these two datasets by doing the necessary processing on them and inserting them into several algorithms and observing the results to benefit from this step in subsequent studies or for companies that have similar databases.The second step is to use ten different ML algorithms after making preprocessing data that utilize.Then evaluate these algorithms using several different metrics.And by using these metrics, the best three algorithms have been determined and recommended for future utilization in the field of predicting food sales.

METHOD 3.1. General steps
The main steps in the proposed model start with loading the two datasets, then performing the visualization and pre-processing operations, this step is followed by measuring the degree of correlation between the features of each dataset separately.After that the ten ML algorithms are applied to each dataset, then the results are evaluated using accuracy measures.Finally, the three best algorithms that gave the highest accuracy are obtained with each dataset separately as shown in Figure 2.

The datasets correlation
To find out the degree of correlation between the features of the datasets, the "heatmap correlation matrix" was used.It is a visualization tool provided by "The Seaborn library in Python".It is a matrix of colors used to describe the correlation between attributes in each selected dataset.As clearly shown in Figures 3(a) and 3(b), the higher intensity of the color between two attributes indicates that one attribute depends on the other attribute.While lower intensity is Indicate the weakness of the attribute's dependence on the corresponding attribute.Therefore, it can be concluded that the correlation between the dataset 1 attributes is higher than the dataset 2 attributes correlation.

ML algorithms used
In the context of this study, ten supervised ML algorithms (multilayer perceptron, SVM, multiple linear regression, decision trees, RF regressor, K-nearest neighbor (KNN), bagging regressor, least absolute shrinkage and selection operator (LASSO), random sample consensus (RANSAC), and gradient boosting) were used.The goal of that is to achieve a comprehensive and effective comparison to obtain satisfactory results.The following sub-paragraphs highlighted the selected ML algorithms that were conducted.

Multilayer perceptron
The multilayer perceptron (MLP) is a type of neural network this type is the most frequently and most known used [9].It is a mathematical system derived from the human nervous system and the structure of the brain [10].The MLP has three basic layers; the input layer, the hidden layer, and the output layer.As a human brain, the information transmitting in one direction from the input layer to the output layer in multilayer feed-forward networks.The Boolean function had been used in feed-forward networks [11].The ability of the network to perform complex functions depends on the number of hidden layers and the number of nodes in each hidden layer [11].However, the complexity of the increased number of hidden layers and the number of nodes does not necessarily mean performs better.

SVM
SVMs is a supervised learning method used as a classification and regression prediction tool [12].SVM based on ML theory to improve predictively.SVM uses hypothesis space of linear functions in a high dimensional feature space, by using a learning algorithm from optimization theory [12].It is being used for many applications in the regression field.Later, support vector regression (SVR) was derived from SVM which had high performance in the nonlinear regression problems [13].The idea was: based on the information of given training samples, input variables are mapped from the primal space to higher dimensional feature space built a linear regression function [14].

Multiple linear regression
Linear regression is a method that estimates the value of a variable by using another variable importance [15].In machine learning, linear regression is a popular and widely-used method.The simplicity of linear regression makes it so popular [16].By using past data, it tries to predict values.Linear regression has two types: simple and multiple linear regression.Multiple linear regression is the most popular type of predictive analysis [17].Multiple linear regression determined a relationship among multi-random variables.So, it shows how the relationship is located between multiple independent variables on the one hand and a dependent variable on the other hand [18].Its equation can be written as: where  is predicted (dependent),  is a parameter,  is an independent variable, and  is the random error.

Decision tree
From the concept of a real tree, the concept of a decision tree emerged in the field of artificial intelligence [19].A decision tree is used to represent decisions and decisions making.The decision tree algorithm is a supervised learning algorithm used for regression and classification [20].A decision tree can be used to create a training model that can use to predict the value of the target variable by using decision rules from history data (training data) [21].When using decision trees in prediction we compare the root attribute values with the record's attribute values.The depth of the decision tree controls its efficiency [22].

Random forest regressor
RF collects multiple decision trees, its response is to give efficient results by using of bagging mechanism [23].Bagging is one of the most common ensemble techniques [24].Ensemble methods use several learning models to give the best prediction rather than using any machine learning algorithm alone [25].RF was used for classification and regression.For regression, the algorithm returns the average prediction or mean of the individual trees [25].Controlling the depth of the RF allows for optimizing and giving the best prediction result [26].Each decision tree makes a number of splits and by using these splits we find max depth [23].

K-nearest neighbors
The KNN algorithm is a supervised learning approach used for regression and classification [27].Based on the similarity between the target and other available cases, KNN calculates the prediction value by using the distance measure [28].The Euclidian distance is the most popular approach for finding the distance [28].KNN works by finding the K most similar neighbors (instances) to the testing point from our datasets [29].By using the Euclidean distance formula √∑ (  −   ) 2

𝑛 𝑖=1
), KNN finds the mathematical distance between points [30].Selection best value of K is difficult.Noise will affect the result if K was small and if K was a large value which means computationally expensive [30].To select the best K value for our data we used different values of K and implement the algorithm several times.

Bagging regressor
Bootstrap Aggregation or as it is called Bagging is one of the ensemble machine learning algorithms that combine many decision trees to make a prediction [24].Bagging works as a basis for several algorithms.From the idea of the "bootstrap" sample, the bootstrap aggregation had come [24].That means using a sample drawn from the dataset with replacement, so it is selected multiple times as a new sample.A population statistic had been found from a small data sample by using the bootstrap sampling technique.This gets through drawing multiple bootstrap samples then calculating the statistic on each, and taking the mean statistic for all samples.That gives an accurate approach to estimating statistical quantities compared [24].Can create an ensemble of decision tree models by using the same approach, by using multiple bootstrap samples drawing from the training dataset then fitting a decision tree on each.Finally, combining predictions of the decision trees gives a more robust prediction than a single decision tree [31].

LASSO regression
LASSO is a method used for regression [32].LASSO works as a regularization technique.Regularization is used to avoid overfitting the data to enhance the prediction accuracy by using it over regression methods [33].LASSO shrinks data values and takes a central point as the mean by using an operation called shrinkage [34].Ridge regression and LASSO regression are the main regularization techniques [32].In this study, we will focus on LASSO regression.LASSO regression uses the L1 regularization technique [32].

RANSAC regression
The RANSAC algorithm is an approach that works as an iterative method that estimates the parameters of a model from a set of data inputs to cope with outliers [35].This algorithm was proposed by Fischler and Bolles [36].RANSAC had been used for regression in our study because it works very well with outliers [37].So, it works by selecting a random subset of data samples and then using this subset to estimate model parameters [38].In the next step, it finds the samples that are within model error tolerance.In this step, this algorithm considered these samples as a consensus set.So, the data samples in the consensus are called inliers and the rest are called outliers [39].RANSAC trains the model by using samples in the consensus if it has a high count.It returns the model that has the smallest error by repeating these steps for some iterations.

Gradient boosting
Gradient boosting is a popular machine learning technique used for classification and regression [40].It works well with missing values and outliers and it can discover nonlinear relationships between features [41].Gradient boosting is one of the ensemble methods which combines multiple weak models to give better performance [42].It is a boosting algorithm that minimized the bias error of the model [43].Mean square error (MSE) had been used as a cost function for regression tasks [44].Gradient boosting is a very powerful technique for prediction [41].It optimizes the prediction accuracy of multiple functions and finally gives a high-accuracy prediction.

RESULTS AND DISCUSSION
The proposed model in this study adopted important metrics to compare the accuracy obtained from each algorithm.These metrics are RMSE and MSE.The results are shown in Tables 1 and 2. And by using two types of datasets and ten different algorithms the following was achieved: a.The algorithms give excellent outputs when applied to the first dataset, which had a high correlation between its features, unlike the other dataset, which had a low correlation between its features.b.The best three algorithms applied on dataset 1 that have the highest accuracy and lowest (MSE and RMSE) are (SVR, LASSO, and bagging regressor).And the best three algorithms applied to dataset 2 are (gradient boosting, random forest regressor, and decision tree).

CONCLUSION
The correlation between datasets features has a significant impact on the results of the applied algorithms.When using datasets that have a high correlation between their features, it is preferable to use (SVR, LASSO, and bagging regressor) algorithms.And when using datasets that have a low correlation between their features, (gradient boosting, random forest regressor, and decision trees) are preferable.This study sets a clear roadmap for future studies and saves them time to use the appropriate algorithm for datasets with certain features, as well as saves the effort for companies by putting in their hand appropriate algorithms to apply to their datasets.

Figure 1 .
Figure 1.The proposed model

Figure 3 .
Figure 3.The correlation between features (a) dataset selected 1 and (b) dataset selected 2

Table 1 .
Results of the dataset 1

Table 2 .
Results of the dataset 2