Comparative analysis of short-term demand predicting models using ARIMA and deep learning

Received Sep 9, 2020 Revised Dec 11, 2020 Accepted Jan 13, 2021 The forecasting consists of taking historical data as inputs then using them to predict future observations, thus determining future trends. Demand prediction is a crucial component in the supply chain’s process that allows each member to enhance its performance and its profit. Nevertheless, because of demand uncertainty supply chains usually suffer from many problems such as the bullwhip effect. As a solution to those logistics issues, this paper presents a comparative analysis of four time series demand forecasting models; namely, the autoregressive integrated moving Average (ARIMA) a statistical model, the multi-layer perceptron (MLP) a feedforward neural network, the long short-term memory model (LSTM) a recurrent neural network and the convolutional neural network (CNN or ConvNet) a deep learning model. The experimentations are carried out using a real-life dataset provided by a supermarket in Morocco. The results clearly show that the convolutional neural network gives slightly better forecasting results than the Long short-term memory network.


INTRODUCTION
A supply chain is an entire network composed of distributed organizations in interaction (suppliers, retailers, distributors, and clients) involved in the process of purchasing, inventory management, production, distribution and delivery of a product or a service from raw materials to a final costumer [1]. Supply chain management covers information flows, physical distribution as well as financial transactions. In other words, all the supply chain members share a common objective that is improving their businesses and profits while creating a value represented by the product or service delivered to the costumer [2].
The problem consists of the demand forecasting which is a crucial component in supply chain's process. Indeed, prediction consists of taking models from historical data in order to use them to predict future observations. Nevertheless, the uncertain character of the costumer's demand causes difficulties for decision makers who must react in an efficient and fast manner. Besides, with the advancement of technology, supply chains become more complex. So, in order to keep up with this vital evolution and for an effective supply chain, companies must be able to plan not only the present but also the future sales or demand. Certainly, sales forecasting is efficient for supply chains' planning activities including purchasing, inventory, production and distribution. It helps managers to make optimal decisions. As background in the literature, autoregressive integrated moving average (ARIMA), is a linear time series forecasting method that has been widely used in many areas to model different problems. Some of the applications of ARIMA are as: Fanoodi et al. [3] uses ARIMA to reduce demand uncertainty in a blood platelet supply chain. Ji et al. [4] uses a hybrid ARIMA and deep neural network model to forecast future prices of carbon. Matyjaszek et al. [5] implements an ARIMA model in order to forecast coking coal prices in the mining industry. Ohyver and Pudjihastuti [6] uses the ARIMA model to forecast rice price. These are only a few examples of the application of ARIMA that remains widely employed in multiple domains to make accurate forecasts. ARIMA is also used for demand forecasting. For instance, Oliveira et al. [7] proposes a double seasonal ARIMA model to make accurate short-term water demand forecasts. Al-Musaylh et al. [8] applies ARIMA, MARS, and SVR in short term electricity demand forecasting. Anggraeni et al. [9] compares between ARIMA and ARIMAX results for kids clothes demand forecasting. Amini et al. [10] applies ARIMA for an electric vehicle charging demand time series forecasting. And Wang et al. [11] implements a modification approach in seasonal ARIMA for electricity demand forecasting.
Neural networks on the other hand, are used for nonlinear time series modeling. Some applications in demand forecasting include: In the paper [12], neural networks are applied for demand forecasting of an engine oil. Sharma et al. [13] applies neural networks to predict energy demand, carbon dioxide emissions and wind generation. Matino et al. [14] implements neural network-based models for a blast furnace gas production and demand prediction problem. And Silva et al. [15] uses denoised neural networks to forecast tourism demand. In [16], Convolutional neural network is applied to tackle the problem of predicting shortterm supply-demand gap of ride-sourcing services, where three hexagon-based convolutional neural networks (H-CNN) are proposed. Ke et al. [17], propose a hybrid approach or the fusion convolutional long short-term memory network, where multiple convolutional long short-term memory (LSTM) layers, standard LSTM layers, and convolutional layers are fused aiming to forecast passenger demand under an on-demand ride service platform. Amarasinghe et al. [18] compares convolutional neural networks to perform energy load forecasting based on historical loads of a single residential customer to long short-term memories (LSTM) sequence-to-sequence (LSTM S2S), factored restricted Boltzmann machines (FCRBM), artificial neural networks (ANN) and support vector machines (SVM). The obtained results demonstrate that the CNN outperformed SVR while producing comparable results to the neural network and deep learning models.
The present paper tackles the proposed solution which is using time series forecasting models and deep learning models for demand prediction. In addition, the convolutional neural network has been rarely used in the literature. Using it in comparison with other models consists of the main contribution of the paper. The novelty presented in this paper is the performance evaluation and comparative study using a real dataset provided by a supermarket in Morocco. The predictive performances of four prediction methods are presented. Namely, the autoregressive integrated moving average (ARIMA) as statistical model, the multilayer perceptron (MLP) a feedforward neural network [19,20], the convolutional neural network (CNN or ConvNet) a deep leaning model and the long short term memory model (LSTM) a recurrent neural network. These models were chosen on order to compare between the different types of neural networks. Mainly the feedforward, the convolutional and the recurrent networks.
The remainder of this paper is organized as follows: The second part presents an overview of the proposed prediction methods based on time series methods. Then, the third section is dedicated to explain the adopted methodology and illustrate the obtained forecasting results in a comparative analysis based on numerical experimentations using a real dataset of a local supermarket. Finally, the concluding remarks and the future work are presented.

PROPOSED PREDICTION METHODS
Time series can be defined as a series of data points recorded and analyzed in a time order, it is a sequence taken at equally spaced time periods [21]. The only independent variable in time series methods is time. The analysis of time series is the process of extracting meaningful information and statistics from the data. The forecasting of time series is the process of estimating the future values of the data based on previously observed data plus some other features [22]. Contrary to supervised models forecasting where data is predicted using multiple features, in time series forecasting, a variable is predicted using past observations of the same variable. The Figure 1 shows different methods used in time series forecasting: Forecasting techniques are categorized into two families:  Parametric methods based on mathematical tools and statistical analysis using historical data such as linear and nonlinear regression, autoregressive integrated moving average. Nevertheless, those methods are complicated to use.

3321
 Non-parametric methods based on machine learning with the ability of learning and approaching any nonlinear function. Those techniques are mostly based on the use of artificial intelligence such as artificial neural networks that offer flexible parameters during the learning and implementation phases.

Autoregressive integrated moving average (ARIMA)
The ARIMA methodology was originally developed by Box and Jenkins in the 1970s. A model is denoted by ARIMA (p,d,q) and is composed of:  An Autoregressive model of order p (number of lags).  Integrated refers to the step of removing non stationarity by differencing the data. d is the degree of differencing.  A Moving average model of order q.

Autoregressive model
The autoregressive model AR(p) of order p states that the output variable depends on its previous valueswith some lag plus a random term.The autoregressive model AR(p) is defined as:

Differencing
A time series is called non stationary when the values of the data are dependent on time. A nonstationary time series can be changed into stationary by differencing the time series. Differencing refers to subtracting some past values of the data a number of times. d the degree of differencing refers to that number of times. Generally, a series might need first-differencing multiple d times to attain stationarity [23]. A differencing of degree 1 is as:

Moving average model
To deal with non-stationarity, we characterize the time series as the sum of a non-constant mean value plus a random error variable [23]:

= +
In smoothing methods, a variable is a function of some past observations, which means that the future value of the time series is the weighted average of some past observations. As such, the moving average MA(q) of order q can be written as the weighted average of the past q errors.

Autoregressive integrated moving average
ARIMA is a mixed model that combines both the differenced autoregressive and moving average models. The final form of a time series model, which depends on its own p past values and on the q past values of white noise error terms [21], is as: with differenced to a degree d. This represents the ARIMA(p,d,q) model.

Machine learning based methods
The artificial neural networks are popular machine learning algorithms that simulate how the human brain behaves and learns. Neural refers to neurons, they are cells contained in the human nervous system. The neurons are connected to one another with the use of dendrites and axons, while synapses are the regions connecting between these axons [24].
Artificial neural networks (ANN) are an artificial intelligence method inspired from the functioning of the biological neural networks characterized by a circuit of interconnected neurons organized as multiple interconnected layers: an input layer, one or more hidden layers and an output layer. ANNs are very powerful in the way that they can identify complex linear/non-linear relationships between inputs and outputs. Neural networks are used in various fields from classification, to forecasting, to approximation, to diagnosis, to image processing or even recognition.
There are two types of neural networks:  Feedforward neural network, or also called non-recurrent neural network where information flows only in one direction going from the input to the output layer all the way through the hidden layer or layers. However, they have no ability to maintain historical inputs as they only consider the current inputs.  Recurrent neural network, where information cycles in a loop in both directions, from the input layer to the output layer and the other way around. Some networks allow the persistence of information by taking into account the current input as well as inputs received previously.

Multi-layer perceptron (MLP)
This specific architecture of multi-layer neural networks is called a feed-forward neural network, because multiple "successive layers feed into one another in the forward direction from input to output" [24]. Therefore, the multi-layer perceptron network generates an output from some given inputs. They consist of multiple layers of nodes or neurons (at least three). One input layer, some hidden layers, and one output layer. Every node except the input ones are called neurons. Nonlinear activation functions are used to model the non-linearity of a given problem. The computation happens in the hidden layers.
Firstly, the first layer sends input data to the hidden nodes in the hidden layer. These nodes combine the data with a set of coefficients or weights that either amplify or minimize the input, then the resulting products are added. Finally, an activation function is applied to the sum. The activation function determines how much and whether the signal progresses through the network to affect the final result.

Long short-term memory (LSTM)
The LSTM network was introduced by Hochreiter and Schmidhuber in 1997 [25]. It is a recurrent neural network, specifically a gated network used in deep learning. Meaning it has feedback connections unlike the feedforward neural networks. It can be used in classification and forecasting of a time series and it is especially effective in solving sequential problems like speech and handwritten recognition.
LSTMs are able to learn long-term dependencies; in other words, they excel at remembering information for long periods of time. A LSTM unit is composed of a cell (that has a self-loop), an input gate, an output gate and a forget gate. The cell remembers values over intervals of time while the three gates control the flow of information into and out of it. An example of a cell unit is shown in the = ( + ℎ −1 + ) .
The cell receives as input the current input and the previous cell output ℎ −1 .It outputs ℎ .

Convolutional neural networks (CNN or ConvNet)
A convolutional neural network is a specific type of artificial neural network usually used for cognitive tasks such as image recognition, image processing and natural language processing and time series data and more. CNN is the most popular deep learning algorithm model used for image processing, it is simpler to train and more effective than the traditional neural networks since it has the ability to capture the temporal and spatial dependencies in an image by applying the appropriate filters with an independence from human effort. Indeed, the layers in CNN architecture are arranged in order to cover the entire visual field that helps to avert the piecemeal image-processing problem in traditional neural networks.
Similar to the multilayer perceptron (MLP) structure of neural networks, the layers in CNN are categorized into three types: an input layer, an output layer and a hidden layer. There are two types of hidden layers [27]:  Feature learning layers, performing three types of operations: convolution, pooling, and rectified linear unit (ReLU) on the input data;  Classification layers, composed of fully connected layers and normalization layers.

Autoregressive integrated moving average (ARIMA) 3.1.1. The proposed approach
The ARIMA (p,d,q) model is determined by choosing the three parameters p,d, and q. The plot of the data will determine the order of differencing d, the ACF plot will determine the lag q, and the PACF plot the lag p. Besides, the model is improved by using the grid search in order to choose the best p,d,q parameters. The models are evaluated with the RMSE, the Root Mean Square Error, that is the most frequently used metrics of prediction performance and calculated as the square root of the Mean Square Error, where the MSE is the arithmetic mean of the squares of the differences between forecasts and observations.

The ACF and PACF plots
In the Figure 3, demand is presented as a time series data having the day in the x-axis and the demand quantity on the y-axis. We notice that the time series might be stationary and so might not require differencing, but to be sure, at least a difference order of 1 will be applied. Next let us look at the autocorrelation function (ACF) and partial autocorrelation function (PACF) plots of the time series in  We initially take the order of AR term to be equal to as many lags that crosses the significance limit in the PACF plot. This suggests that a good start for the autoregressive portion of the model is AR(2) that is an auto-regression model with two lag observation used as input. In the same way that we looked at the PACF plot for the quantity of the AR part of the model, we can look at the ACF plot for the number of MA terms. Only one lag is above the significance line. So let us fix the order as MA(1).

Fitting the ARIMA(2,1,1) model
Now that we have determined the values of p, d and q, we can fit the ARIMA model. Using the ARIMA(2,1,1) implementation in statsmodels package in Python we fit the model and we get the following results. The Figures 5 and 6 show that the residual errors are near zero mean with a uniform variance. Next, let us calculate the root mean squared error (RMSE) and plot the actual vs predicted values. Using sklearn and statsmodels packages of python we fit an ARIMA(2,1,1) model. This results in an RMSE of 823.945.

Grid searching parameters
The process of training and evaluating ARIMA models on different combinations of model hyperparameters can be automated. This is called a grid search or model tuning in machine learning. We split the data set in two: 66% for the training set and 34% for the test set.We chose the p values to range from 1 to 10, the d values from 0 to 3 and the q from 0 to 3. The best ARIMA model that yielded the best RMSE is ARIMA(6, 0, 0) with an RMSE of 485.690. This is shown in the Figure 8. We notice that the lower the p value the higher the RMSE.

Multi-layer perceptron (MLP)
The multi-layer perceptron neural network is used to forecast the demand quantity in a simple supply chain composed of one retailer and one supplier. The data belongs to a Moroccan supermarket dataset. Predicting the demand of a given day d involves the demand quantities of the three previous weeks (d -7, d-14 and d-21).
The neural network is composed of three hidden layers each containing 10 neurons.The neural network is trained using the adam version of the gradient descent, the mse loss function and the relu function as an activation function. This results in an RMSE of 0.140 in the test set after normalization and 464.261 before normalization.

Long short-term memory (LSTM)
In a previous study [26], Long-Short Term Memory is used to forecast demand. In addition to the inputs proposed in the MLP model, the LSTM has a time dimension. It has a three-dimensional shape composed of the number of examples, the number of inputs, and the number of time steps [28].
In the proposed neural network, the demand of a given day "d" is predicted using demand quantities of the days: d-6, d-12, and d-18. Therefore, we model the problem as a three time-steps prediction on one input corresponding to the quantity of the day, and this for each output. This instead of considering that the data is composed of three inputs as we did in the MLP study.
The LSTM network is composed of 50 neurons in the hidden layer and 1 neuron in the output layer. The root mean squared error and the efficient Adam version of the stochastic gradient is used. This yielded an RMSE after normalization of 0.138 and an RMSE before normalization of 457.958.

Convolutional neural network (CNN or ConvNet)
The convolutional neural network (CNN or ConvNet) is a special type of neural network model used for working with two-dimensional and three-dimensional data like images, it can be applied with onedimensional data like time series. The proposed convolutional neural network is composed of the following layers:  A one-dimensional convolutional layer with 64 filters, a kernel size of 2 and an input shape of (3,1), the activation function consists of the relu function;  A one-dimensional max pooling layer with a filter of size 2;  A flatten layer;  A fully connected layer composed of 50 neurons with relu being the activation function. We train the model using the adam version of stochastic descent and the MSE loss function. The training after 200 epochs results in a RMSE of 0.138 after normalization, and 457.079 before normalization.

Discussion
In machine learning the prediction accuracy is used for discrete problems, meaning classification ones. Since we are dealing with a regression problem, we chose to evaluate the models using the root mean squared error RMSE, in addition to some other criteria. The RMSE measures how close the real values are to the predicted ones. Hence its use.
As shown in Table 1, the worst prediction results are obtained with ARIMA model which is understandable since it is a linear model. The neural networks proposed models, on the other hand, are nonlinear models that can approximate any function, hence the better results. CNN has shown the best RMSE, slightly better than the LSTM model. This might be thanks to its ability to do feature selection on its own.