Grid search of multilayer perceptron based on the walk- forward validation methodology

Received Jul 8, 2020 Revised Jul 27, 2020 Accepted Oct 15, 2020 Multilayer perceptron neural network is one of the widely used method for load forecasting. There are hyperparameters which can be used to determine the network structure and used to train the multilayer perceptron neural network model. This paper aims to propose a framework for grid search model based on the walk-forward validation methodology. The training process will specify the optimal models which satisfy requirement for minimum of accuracy scores of root mean square error, mean absolute percentage error and mean absolute error. The testing process will evaluate the optimal models along with the other ones. The results indicated that the optimal models have accuracy scores near the minimum values. The US airline passenger and Ho Chi Minh city load demand data were used to verify the accuracy and reliability of the grid search framework.


INTRODUCTION
Load forecasting is extremely important for electrical system by means of being the key process used to determine the optimal schedule for electricity generation, it is pivotal to ensure the operational reliability of the power grid and load forecasting plays a leading role in energy trading and investment activities. However, it should be emphasized that load forecasting is a very challenging task owing to the complicated and unpredictable nature of human behaviour, causing difficulties in predicting future electricity demand. Therefore, the more reliable models are required to achieve better forecasting results [1][2][3][4][5][6]. Accordingly, several models have been utilized for load forecasting, including multiple regression, exponential smoothing, stochastic time series, fuzzy logic, neural networks and knowledge-based expert systems [7][8][9][10][11][12][13][14][15][16][17][18][19][20][21]. Among these models, the multilayer perceptron (MLP) belonging to the Neural has been widely used. MLP architecture is similar to that of biological neural networks consisting of one input layer, one or more hidden layers and one output layer [15][16][17][18][19][20][21].
Several hyperparameters of MLP structure and MLP training are also used which influences on the quality of MLP model such as batch size, training epochs, optimization algorithm, network weight initialization, activation function in the hidden layer, and the number of neurons in the hidden layer. Thus, finding out good hyperparameters is an important procedure when using MLP model for load forecasting [22][23][24][25]. Besides, in order to apply MLP model, the common way is to split the data into training and test sets, which are used to build the forecast model and to evaluate the accuracy of forecast values, respectively. In addition, the testing set is fixed during forecast operation. Unlike the traditional way, the walk-forward  Ngoc) 1743 validation (WFV) methodology allows to get the best forecasts at each time step [25]. In present paper, the WFV methodology for the MLP model is proposed, then applied for the grid search model of these hyperparameters. The US airline passenger and Ho Chi Minh city load demand data were analysed through training and testing processes to verify the accuracy and reliability of the grid search framework. Some of the accuracy scores such as root mean square error (RMSE), mean absolute percentage error (MAPE) and mean absolute error (MAE) were also used. In the training process, the optimal models with certain hyper parameters was obtained to satisfy the requirement for minimum of the accuracy scores. In the testing process, these optimal models will be compared to other ones to confirm the grid search process. The models are implemented using Keras library with Tensorflow as backend in the Python environment with Google colab, a free GPU on the cloud for running large scale machine learning projects [25][26][27]. This paper is organized as follows. In section 2, a short introduction to MLP modelling and WFV methodology are presented along with proposing the grid search of MLP WFV methodology. In section 3, we focus on experiments and analyse the results. The conclusions are given in section 4.

RESEARCH METHOD 2.1. MLP neural network
Artificial neural networks in general, or MLP network in particular are built based on mathematical models of the way brains are thought to work. A typical MLP network consisting of an input layer, one hidden layer, and output layer; and a first node in a hidden layer network are shown in Figures 1(a) and 1(b), respectively [18][19][20][25][26][27]. The output of node i in hidden layer is obtained by: The i th ouput of MLP is obtained by the same way: where, xj, j=1, …, N is the input; yj, j=1, …, M is the output N is the dimension of input, M is the dimension of output, K is the number of node in the hidden layer. w, b are the weight term and bias term of node; f is the active function of node. The superscripts in the parameters w ( l ) , b ( l ) , a ( l ) , z ( l ) , f ( l ) ∀ = 1,2 indicate the values for the first layer (hidden layer) and the second layer (output layer). In the training process of MLP network, data comprising the input and output are fed into the network to construct an input-output mapping by adjusting the weights and biases at each iteration based on an optimization problem which is solved to minimize a loss function. The loss function can be defined as [22,28]: The error function E in the (3) is usually defined by the difference between the actual and desired output. The error function E of the MLP is normally chosen to be the mean square error, as given by: The regularization term R in (3) introduces a bias to reduce overfitting on the training data. Gradient descent (GD) algorithm is an iterative machine learning optimization algorithm to minimize the loss function. It is noticed that there are many gradient descent optimization algorithms as SGD, RMSprop, Adagrad, …

Walk forward validation methodology
In load forecasting practice, the WFV methodology gives the forecasting model with the best opportunity to make good forecasts at each time step. The sequential operation of the WFV methodology is shown in Table 1. Firstly, using the history data (Months) for training, the forecasting model makes a load forecasting for the next month (Month1) and then stored or evaluated against the known value. Continuously, the training data is expanded to include the known value (Months + Month1), the forecasting model is updated and the next week is forecasted (Month2). The process is repeated to the end to ensure that the training data was updated with the unkown values at each step [25].

The grid search of MLP WFV methodology 2.3.1. Tuning hyperparameters of MLP
There are many aspects of the MLP influence on the quality of the MLP network. They can be parameters of the data (such as features of input data, differencing of input data, ect.), or parameters of MLP configuration (such as number of hidden layers, number of nodes in each layers, activation functions, ect.), as well as parameters in the training process (such as optimization algorithms, batch, epoch, ect.). In our study, we focus on these hyperparameters to tune as follows [25]. a. Number of input (i) The number of input is pre-specified by the available data. For the load forecasting, the data is usually considered in the form of a univariate electric load time series with N observations {Xt1, Xt2, · · ·, XtN}, while the number of input i represents the number of lag observation used as input and defined as the dimension of input features as well. For example, for a monthly load time series, the input and output of MLP model with number of input i equal to 3 are described in Table 2. 1745 period of hourly time series of 24, the period of weekly time series is 7, the period of yearly time series is 12, and so on. b. Number of nodes (n) In most situations, there is no way to determine the best number of hidden nodes. This is related to the fact that the utilization of too few nodes may lead to high error while too many nodes -overfit. In this study, we use MLP network with one hidden layer, and then tune the number of nodes in this hidden layer. c. Number of batch (b) The batch size is the number of samples that will be propagated through the network. For example, if the data have 1000 samples and the batch size equals to 100, the algorithm takes the first 100 samples from the data and trains the network. Next, it takes the second 100 samples and trains the network again. This procedure repeats several times until the algorithm propagates through all samples. Advantages of using a batch are related to requiring less memory as well as improving the training speed of networks owing to updating the weights after each propagation.
The number of epochs is the number of times that the learning algorithm will work through all training dataset. Typically, the learning algorithm multiplies epochs until the values for weight and bias change insignificantly. An epoch is comprised of one or more batches. e. Optimizer algorithms (o) In neural networks, the most common optimization algorithms are GD. The Keras library provides some of optimization algorithms that usually include SGD (Stochastic gradient descent), RMSProp (Root Mean Square Propagation), Adagrad (Adaptive Gradient Algorithm), Adadelta, Adam, Adamax, Nadam. f. Differencing data (d) MLP can handle raw data with little pre-processing such as scaling and differencing. Nevertheless, when it comes to time series data, sometimes differencing the series can cause a problem easier than model. The differencing of time series data can be the first differencing or the seasonal differencing:  The first differencing: dy(t)=y(t)-y(t-1)  The seasonal differencing: dy(t)=y(t)-y(t-d), where d is the seasonal period of time series data.

MLP WFV methodology
The framework of MLP WFV methodology which combines the MLP model and WFV methodology is shown in Figure 2 below and can be expressed by following steps: Step 1: Firstly, the data ([y1, y2, …, yn-h, yn-h+1, yn-h+2, …, yn]) is split into history (y1, y2, …, yn-h) and testing data (yn-h+1, …, yn) using split1 function. The testing and history data have lengths of h and n-h, respectively, n is the number of original data.
Step 2: The history data is then put into the diff function which makes differencing history data according to the values of different order d. The operation of diff function is shown in Figure 2.
Step 3: Next, the split2 function will split the input data into the input component X_train and the target component Y_train. The first dimension of X_train is the number of lag observation used as input i, and the second dimension of X_train -the number of observation of data.
Step 4: The determination, training and prediction procedures of MLP will be performed as shown in Figure 2. To build the MLP model, the the Keras library was used, and the MLP model has been definited as a sequence layer. The MLP consists of input layer with input dimension i, one hidden layer with the number of nodes n, and the optimization algorithms o. The numbers of epoch and batch in the traning process correspond to e and b. The output of MLP is the forecasting data.
Step 5: The forecasting value from MLP model must be inverted by idiff function if the data was different from Step 2.
Step 6: The known value yn-h+1 is added into the history data. Then, the circle from step 2 to step 5 is repeated. Besides, we repeat it after obtaining the last forecasting value of fh.

The grid search model of MLP WFV methodology
Based on the tuning hyperparameters and the model of MLP WFV methodology, the grid search model of MLP WFV methodology was established as shown in Figure 3. The traning and testing processes have the same combination of tuning hyperparameters cfg and the same number of testing h. For the traning data, we obtained the optimal models which sastisfy minimum accuracy scores (RMSEmin, MAPEmin, MAEmin). In the testing process, these optimal models will be compared to all other models according to their accuracy scores in oder to evaluate the reliability of the grid search model of MLP WFV methodology.

RESULTS AND DISCUSSIONS 3.1. Data description
In order to enhance the reliability of experiment results, the US airline passenger and electricity demand data of Hochiminh City, Vietnam were studied in our experiments. The US airline passenger dataset provides monthly totals in thousands of US airline passengers from 1949 to 1960 and has been used in many researches [34]. The Ho Chi Minh city load demand dataset provides daily max load demand in MW for three months from October 2018 to January 2019. The US airline passenger and Ho Chi Minh city load demand data are presented in Figure 4.  Table 3. Because of monthly seasonality of US airline passenger data, there is a numeric value of 12 for the values of the number of testing data h, the value of input i and the value of differencing d. It is the same for Ho Chi Minh city load demand data with a numeric value of 7 due to the weekly seasonality. The combination of all tuning hyperparameters gives 224 cases corresponding to 224 possible models of MLP WFV methodology.   Table 4 shows the results of training and testing processes. As obviously seen that we have the same optimal models determined by the minimum of accuracy scores RMSE, MAPE and MAE in training process. For testing process, the column 'Optimal' shows the accuracy scores for the optimal model, and the columns 'Min', 'Average' and 'Max' -the min, the average and the max values for 224 possible models that can be generated. Figure 5(a) gives the forecasting and testing series of the optimal model, while Figure 5(b), 5(c) the forecast and testing series for the model of minimum and maximum accuracy scores in the testing process. Figure 6 indicates the distribution of accuracy scores for testing process.     Table 5 shows the results of training and testing processes. Obviously, the minimum of RMSE, MAPE and MAE gives the different optimal models in the training process. For testing process, the column 'Optimal' shows the accuracy scores for the optimal model, while the column 'Min', 'Average' and 'Max'the min, the average and the max values for 224 possible models that can be generated. Figure 7(a) gives the forecasting and testing series of the optimal model of RMSE, while Figures 7(b), 7(c) the forecast and testing series for the model of minimum and maximum accuracy scores in the testing process. Figure 8 indicates the distribution of accuracy scores for testing process. Figure 8(a) presents the box plot of RMSE component with the first column for the distribution of all the 224 possible models and the second column for the optimal model. The same distributed data are plotted in Figure 8(b) and 8(c).

Evaluation
As described above, the optimal models in training process were determined by minimizing accuracy scores such as RMSE, MAPE and MAE. The analysis of results listed in Table 4 and Table 5 indicated the existence of the optimal model that satisfies the criterion of minimum of accuracy scores RMSE, MAPE, MAE. In the case of US airline passenger data, there was the unique model that satisfies all RMSE, MAPE and MAE. In the case of Ho Chi Minh city load demand data, three different optimal models were found to satisfy correspondingly RMSE, MAPE and MAE.
The optimal model obtained in the training process does not ensure completely the best results in the testing process. Let's analyse the Table 4 in the case of US airline passenger data for the optimal model. When using the optimal model for the testing process, the accuracy scores of RMSE, MAPE 13.22%, 57.92 k, respectively), the accuracy scores of the optimal model are too small. Besides, analysing the boxplot of accuracy scores shown in Figure 6 clearly indicates that the accuracy scores of the optimal model are nearly the same as the minimum values of all the other models. Moreover, the forecast values shown in Figure 5(a) are well consistent with the testing values for the optimal model. We have the same results for the case of Ho Chi Minh city load demand data. These results clearly show that the optimal model obtained in the training process by applying the grid search model of MLP WFV methodology will give good values in testing process.
Noted that the characteristics of times series in cases of US airline passenger data and Ho Chi Minh city load demand data are different from each other. Indeed, the time horizon and value vertical in case of US airline passenger data are built by scales of months and thousands of passengers, respectively. Meanwhile,  ISSN: 2088-8708 for data of Hochiminh city load demand, daily and MW scales were used. Besides, the seasonality in cases of US airline passenger data is 12 (monthly) and of Ho Chi Minh city load demand data-7 (daily). The results in both cases showed good values that are promising for applying the grid search model of MLP WFV methodology at any times series.

CONCLUSION
Based on the MLP structure and Walk-Forward Validation Methodology, a framework for grid search model of MLP WFV methodology was proposed. In the training process, the minimum of accuracy scores of RMSE, MAPE, MAE was applied to specify the optimal models. In the testing process, the accuracy scores have been used to compare the optimal model with all other ones. Both the US airline passenger and Ho Chi Minh city load demand data were used for the analysis. The results indicated the existence of the optimal model that satisfies requirement of minimum for accuracy scores. In the testing process, the accuracy scores of the optimal model gave the good values close to the minima which were much smaller than the average values. In addition to that the max values were comparable to those of all other models. In this regard, the positive results obtained in this study suggest an effective way for load demand forecasting.