Random forest model for forecasting vegetable prices: a case study in Nakhon Si Thammarat Province, Thailand

ABSTRACT


INTRODUCTION
Nakhon Si Thammarat is a province in the south of Thailand, where most of the population is engaged in agriculture. The main problems found in vegetable cultivation in the province are droughts. According to the statistics, Nakhon Si Thammarat experienced a total of 5 droughts during 2013 to 2019. In 2016, there were 12 districts with the highest drought level, and the agriculture was damaged by 883.54 square kilometres [1]. Besides the unfavourable climate, farmers face the problem of plant disease, pest infestation, and low consumer prices as farmers cannot set desired prices [2].
Although the price of vegetables has a large impact on the population, it is volatile and changes quickly. This makes it more difficult to predict future prices consistently. Nonetheless, vegetable price prediction is necessary for the general public to recognize the price of vegetables in advance [3].
There is currently a lot of research focusing on improving forecasting models to be more accurate by using modern statistical and computing methods such as machine learning (ML) and artificial intelligence (AI) depending on the goals and nature of the problem [4]. ML is a subdomain of AI [5]. It is a science of training computers to act without giving any command to it [6]. In AI, we make computers artificially more intelligent as they perform tasks on their own. These systems are highly accurate and fast in doing their tasks. While in machine learning, we create and train a model using various techniques such as supervised learning, unsupervised learning, and reinforcement learning [6]. The data in machine learning is made up of examples, and each example is described by a set of attributes. These characteristics are also known as variables [7], [8]. There are two types of supervised learning: classification and regression. In particular, the dependent variable in the classification problem is discrete but continuous in the regression problem [9].
Random forest is a machine learning technique that employs a large number of classifications or regression sub-trees. It is a popular prediction algorithm because it is a versatile algorithm for analyzing large datasets. Furthermore, it has a high prediction accuracy and provides information on important variables for classification [10].
In previous research, a variety of machine learning techniques have been applied to data analysis in order to identify patterns and trends. For example, one study compared the performance of random forest and multiple regression models in predicting apartment prices [11], while another used linear regression and random forest regression to forecast ticket prices for public transportation [12]. In addition, decision trees and random forest models were utilized to predict crop prices [13], and machine learning methods were employed to forecast the prices of agricultural products [8] and used cars [14]. A comparison was also conducted on the efficiency of machine learning models for predicting bird's eye chili prices in Nakhon Si Thammarat province [15]. Moreover, deep learning has been applied to forecasting in some cases [16], [17]. However, using machine learning models with a small dataset to predict vegetable prices may overfit the dataset and might not be efficient. Therefore, we propose using random forest models to forecast vegetable prices in Nakhon Si Thammarat Province and comparing the results for different crops. As a result, we propose to i) use random forests to forecast vegetable prices in Nakhon Si Thammarat Province and ii) compare the results across crops.

METHOD 2.1. Dataset
The Meteorological Station and the Provincial Commercial Office in Nakhon Si Thammarat province provided historical data on the climate and vegetable prices between 2011 and 2020 for this study in commaseparated values (CSV) file format. The dataset consists of 7 attributes, namely month, temperature (degree Celsius), rainfall (mm.), humidity (%), seasons, average price per month (Bath), and average price per year (Bath). The dataset contains no missing data nor any significant outliers. Table 1 displays the attributes and their data type of the dataset.

Research tools
In this study, we chose to run the experiments with Scikit-learn [18], Python's most comprehensive and open-source machine learning package. Scikit-learn covers four major machine learning topics: data transformation, supervised learning, unsupervised learning, and model evaluation and selection. Scikit-learn provides various ready-to-use pre-processing algorithms and machine learning models which can be directly applied to the collected dataset.

Research process
We followed the setup in [19] and divided the dataset into two parts for this study: the training set and the test set. The training set, which contains 84 data points (70%), is used to train the model. The test set, which contains 36 samples ( 30%), is reserved for measuring the performance of the models. Figure 1 [20] depicts a more detailed overview of how machine learning models are trained and tested.

Accuracy measures for forecasting
The performances of the models were measures with three metrics that are commonly used for regression problems. Particularly, we used mean absolute error (MAE), root mean squared error (RMSE), and  [8], [21]. To formally quantify the metrics, let and be the observed price and the forecasted price of a data point i, respectively.
The MAE determines the average size of error in a series of forecasts without taking into account their direction. It is the test sample's average of the absolute disparities between prediction and actual observation, with all individual deviations given equal weight. It can be formally defined as (1). Figure 1. The overview of how the machine learning models is trained and tested [20] The RMSE is the square root of the average of the error squares. It is, in other words, the average squared difference between the estimated and actual values. Because of its square design, serious mistakes are amplified and have a significantly greater effect on the value of the performance indicator. Simultaneously, the impact of relatively minor mistakes will be significantly reduced. This element of the squared error is sometimes referred to as penalizing excessive errors or being susceptible to outliers. It is mathematically defined as (2).
The MAPE is the extension of the MAE that satisfies the criteria of reliability, ease of interpretation, and clarity of presentation. It is formally defined as (3). Interpretation criteria to evaluate the performance of the predictive model using the MAPE are shown in Table 2 [22].

Random forest model
Random forest is an ensemble machine learning methodology that is a mixture of several tree-based predictors. It is a supervised method that can handle both regression (problems with continuous dependent variables) and classification (problems with categorical dependent variables) tasks. The core concept of the method is to integrate many decision trees to decide the final output rather than depending on individual decision trees, which reduces model variance [23]- [26]. Random forest constructs numerous versions of decision trees by sampling different subsets of the given training data. These tree predictions are combined with a majority vote to get the final projection. As a consequence, over-fitting is reduced, and predicted accuracy is improved [27]. An overview of how the algorithms work is depicted in Figure 2. The random forest training algorithm is mainly defined as follows.

Algorithm:
Step 1: From the dataset, pick M random records.
Step 2: Based on M records, build a decision tree.
Step 3a: From your algorithm, choose the number of trees and repeat steps 1 and 2 .
Step 3b: In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). Figure 2. General structure of a random forest [28] For each sub-tree, the prediction function f(x) is defined as formulas (4) and (5) [29] f where M is the number of regions in the feature space, Rm is a region corresponding to m, cm is a constant corresponding to m: The final classification decision is made from the majority a vote of all trees.

Results
This study developed a random forest model for predicting vegetable prices in Nakhon Si Thammarat province using scikit-learn (random forest regressor). Six hyper-parameter combinations were investigated, specifically three estimator values 50, 100, and 150) and two max depth values 5 and 10). Table 3 displays the model's predicted outcomes.
The forecast model development results are shown in Table 3. Setting the number of estimators option to 50 and the maximum depth to 10 consistently results in the least amount of error in terms of MAE, RMSE, and  Table 4, the MAPE for prediction accuracy was less than 10, indicating that the random forest model forecast was highly accurate for pumpkin and eggplant, while the result for lentils was good.  Table 5 compares the actual and expected costs of pumpkin, eggplant, and lentils over a 12-month period. Setting the number of estimators to 50 and the maximum depth to 10 yields the least error model. Figure 3 shows that anticipated vegetable prices were nearly identical to actual prices for the values of pumpkin in Figure 3(a), eggplant in Figure 3(b), and lentils in Figure 3(c).

Discussion
In this study, a random forest model was developed to predict vegetable prices in the province of Nakhon Si Thammarat. The results showed that the random forest model was an appropriate model for forecasting crop price because the forecasted outcomes were quite accurate. The findings are consistent with previous research, which found that random forest makes predictions with low RMSE and performs well with a high R-squared value [14]. Another study showed that random forest was a suitable model for predicting bird's eye chili prices in Nakhon Si Thammarat province [15]. A random forest approach for real-time price forecasting was discovered to be suitable and predict consistent results in the New York power market [30]. Furthermore, the random forest is used to predict house prices, with an error margin of 5 compared between anticipated and actual prices [31].

CONCLUSION
Forecasting vegetable prices is essential for farmers who want to know the price of their crops in advance. In this study, the random forest model was used to forecast vegetable prices. The study's data set, in particular, included seven characteristics.

Kritaphat Songsri-in
finished MEng and Ph. D. in computing from Imperial College London in 2011 and 2020, respectively. He has been a lecturer in the department of computer science at Nakhon Si Thammarat Rajabhat University since 2020. His research interests include Machine Learning, Deep Learning, and Computer Vision. He has published in and is a reviewer for multiple international conferences and journals such as IEEE Transactions on Image Processing and IEEE Transactions on Information Forensics & Security. Dr. Songsriin was a recipient of the Royal Thai Government Scholarship covering his undergraduate and postgraduate degrees in 2010. He received the Best Student Paper Awards at the IEEE 13th International Conference for Automatic Face and Gesture Recognition (FG2018) and the 6th National Science and Technology Conference (NSCIC2021) . In 2021, his PhD thesis received an award from the National Research Council of Thailand (NRCT) . He can be contacted at kritaphat_son@nstru.ac.th.