A comparison of activation functions in multilayer neural network for predicting the production and consumption of electricity power

ABSTRACT


INTRODUCTION
Electrical power is produced from different sources, some of them are renewables and others are non-renewables. Nowadays, consumers can become producers if they have other power resources, especially renewable such as solar or wind power, and this turned to be difficult to control. Due to that, the need for a stable system to manage this new attitude emerged. This system is standing for what we have and what we need. Many organizations are using information system to help their jobs by storing information and use that information to help in making decisions in the future. Planning by forecasting is to analyze historical data to save energy a usage [1]. This data had been collected over the time of the progress. This data is formatted minutely, hourly, daily, weekly, monthly, quarterly, or yearly. Also, it consists of univariate or multivariate depending on the input variables that affect electricity. Due to the complication of prediction, several models have been proposed. These models include engineering, statistical and artificial intelligence [2]. The most This type of ANN is excellent at prediction. To predict one or more values from the input variables is the aim of supervised learning. Regression that depends on example pairs of data is a form of supervised learning. This type of network is a fully interconnected neurons in different layers, the input layer, the output layer, and the hidden layers which have at least on hidden layer [3]. The hidden layer connect the input and the output layers. The input layer receives data that the neural network learns from and the output layer provides response to the input data. There is no exact role to determine the number of hidden layers and the number of neurons in the hidden layer and there were many studies that tried to figure these numbers. The neural network is learned by algorithms. There are many learning algorithms in the neural network but the most commonly used is back propagation of error. The error in prediction is fed backwards through the network during training to adjust the weights and minimize the error. This step is repeated until achieving the minimum error or reaching the specific number of epochs [4]. In the center of network stands the activation function. In theory, any function can be utilized as an activation function. However, the activation function have a linear and nonlinear character. Nonlinear activation is important in order to be capable to distinguish the complex relationships which exist in the feature space [5]. Whereas linear functions are particularly used in output layers especially for regression, non-linear activation functions can be utilized in hidden layer. Also non-linear can be used in output layers especially for classification [6]. Moreover, derivative of activation function is involved in the calculation for error which effects on weights of neuron connections.
We organized the rest of this paper as follows: Section 2 presents activation functions in comparison. In section 3 we present data sets description and methodology that used in this work in section 4. Experiments and their results covered in Section 5 Finally, Section 6 concludes the paper.

ACTIVATION FUNCTIONS IN COMPARISON
The choice of activation functions strongly influences the performance of neural networks. In this session, we describe the most used activation functions. Sigmoid [7] is a widely used activation function.
The output values are between zero and one. Its equation is: (1) Due to simplicity of sigmoid derivative, Sigmoid is fast to execute. But its problem is that gradient approaches to zero and the learning of network becomes difficult. The hyperbolic tangent function [8] is zero centered and its output between -1 and 1, SoftSign [8] function is closely related to hyperbolic tangent. SoftSign converges polynomially whereas hyperbolic tangent converges exponentially. SoftSign defined as (3) Rectified linear unit (ReLU) [9] defined as positive part of x: (4) It is simple, and fast. In addition, it rectifies vanishing gradient problem. The most aspect of ReLU is dead neurons which mean it is never been activated when x is less than zero. In [10], authors defined a leak rectified linear unit (LReLU) as,

165
It is similar to ReLU. The only difference is that LReLU flows when x<0 and that solves the dead neurons of ReLU from the author's point of view. SoftPlus [11]: (6) SoftPlus is just a smooth approximation to the LReLU but it is positive. Figure 1 shows the difference between among Sigmoid, Tanh and SoftSign, ReLU, Leak ReLU and SoftPlus. Figure. 1. Sigmoid, Tanh, SoftSign, Leak ReLU, ReLU and SoftPlus activation functions Gaussian activation function [12], defined as, Exponential linear unit (ELU) [13] is like ReLU and leaky ReLU for avoiding a vanishing gradient by the identity of positive values, (8) where α=1.0 Scaled exponential linear unit (SELU) [14] which is a modified type of ELU with two fixed parameters. It is defined as, where α ≈ 1.6733 and γ ≈ 1.0507 Google Brain introduces a new activation function which is called Swish [15] (10) Adjusted swish (E-Swish) [16] is a modified type of Swish and is defined as The difference between Swish and E-Swish is the position of constant β. The similarities and differences between them shown above in Figure 2. Swish and E-Swish activation functions are equal when β=1. According to [15,16] papers, they performed a good result in image classification and never examines before in regression. In this work we set β=1 for Swish activation function as the best achieved result in the original paper and β=1.5 in the E-Swish activation function for the same reason. Figure. 2. ELU, SELU, gaussian, swish and E-swish activation functions

Dataset description
In this paper we try to cover different types of data sets which are related to electricity. The variety in the quality of the data set is used. Consumption, production, multivariate and univariate are considered. Also, some of them depends on date and time. Moreover, we consider the relationships between the input and output variables.

Combined cycle power plant
Combined cycle power plant data set consists of four input variables temperature, ambient pressure, relative humidity and exhaust vacuum. Those input variables are used to predict electrical energy output. The data set was collected over six years (2006)(2007)(2008)(2009)(2010)(2011) and contains 9568 data points [17]. There is no missed data in this dataset. Figure 3 below shows the relationship between energy and its factors.

Energy efficiency dataset
the dataset consist of eight input attributes relative compactness (comp), surface area (sa), wall Area (WA), roof area (RA), overall height (OH), orientation (Orie), glazing area (GA), glazing area distribution (GAD) which are the building design parameters. The responses outcomes are heating load (HL) and cooling load (CL) which are power consumption [18].

Appliances energy prediction dataset
It consists of 19735 instances of 28 attributes. Those attributes are related to temperature, humidity, pressure, wind speed, days of week and day status [19]. It does not containe any missing values.

Individual household electric power consumption dataset
This dataset is time series which measures an electric power consumption in one household for a single residential customer. It contains the consumption power record that is taken in each minute over a period of four years between December 2006 and November 2010. The original dataset contains nine attributes but we focused on the datetime as a time series and the household global minute-averaged active power. There are missing values of data and we fill them by using the values that come before [20]. Table 1 shows the properties of data set.

METHODOLOGY
We use neural network with different depth of hidden layers to examine the most accurate activation function. Many studies tried to define the number of hidden layers and the number of neurons in each hidden layer but we know that no exact role can define that. In this paper, we concentrate on the activation function and we do not focus on the structure of neural network. Therefore, we defined four models of multilayer neural networks with different depth. The first model is sample with one hidden layer and different number of neurons according to the four different datasets. We specified the number of neurons by function [2]. ℎ = + 1 (12) where N is the number of input data The other models are defined with more hidden layers. Model 2 consists of two hidden layers with 30 and 20 neurons in each layer. The third is deeper than previous models with 9 hidden layers with 240, 200, 160, 120, 80, 60, 40, 30, 20 neurons in hidden layer respectively. The last model is designed with 11 hidden layers with 320,280,240, 200, 160, 120, 80, 60, 40, 30, 20 neurons. We tried to go deeper to determine the ability of activation function in deep neural network. We implement the stochastic gradient descent (SGD) [21] as the optimizer. The number of epochs is enough to reach the best solution. The initial learning rate is set to 0.1 and we reduced it by multipleying it by 0.2 at 40, 80, 120, 160 epochs.
The initialization of training is Glorot uniform initialization [8]. We utilized no dropout and 0.9 momentum. The parameters that used in the four models are equal for the all data sets except for the individual household electric power consumption data set because it has huge number of data and needs long time of execution. So, the number of epochs is reduced to 40 with Adam optimizer [22] as it considers faster than others. We use root mean squared error (RMSE) where Xobs is predocted values and Xmodel is actual values To train the neural network efficiently [24], we perform data preprocessing to transform input data into better form. The transformation was used by Min-Max normalization [25] which is one of the most used techniques and it is accomplished by:

EXPERIMENTS AND RESULT
In order to evaluate the activation function with the four models and datasets, all datasets divide into train set and test set. Each model trains by using 75% of the data and the rest for testing which are unseen values. The test set is usually used to gage the models but also training set shows the ability of learning. We compare the median of 9 runs in all execution models of datasets except for the Individual household electric power consumption dataset.
As shown in Table 2 that ReLU activation gets best result in model II, III and IV for training and testing set of combined cycle power plant data set. In addition, SELU gets best performance for training and testing set in model I. Table 3 shows that ReLU and Leak ReLU get the best performance in all models for energy efficiency dataset. The minimum train errors in model II, III, IV and the minimum test error in model II is gotten by ReLU. Leak ReLU achieves the minimum errors in the rest.  The performance of ReLU and Leak ReLU also outperform the other activation functions. Table 3 and 4 present the prediction of the same dataset. The only difference is that Table 3 presents the prediction of heating load response variable and Table 4 shows the prediction of cooling load response variable.
The correlation among variables in appliances energy is obviously clear that the relationships between input and output of the appliances energy prediction dataset are weak. The error is large compared to other datasets. In the original paper [19], multiple linear regression model gets (RMSE=93.21) for training set and (RMSE=93.18) for testing set. It is seen in Table 5   Individual household electric power consumption dataset needs long time to execute due to its huge size. Therefore, we reduced the epoch number to 40 instead of 200. In addition, the number of implementations is reduced to three instead of nine. Moreover, we just tested Model I and Model II due to the same reason. Table 6 shows that ReLU Activation function outperforms its counterparts. Table 7 summarizes the results of all expermints. It shows that ReLU and LReLU outperform the other activation functions. In addition, we can observe that ReLU learn in train data better than others do.

CONCLUSION
Prediction of power consumption and production is an important step to manage and control power utilities. This prediction helps the utilities to figure the quantity of consumed and produced electricity. There are many models designed for that purpose. Nowadays, artificial intelligence models such as neural networks (Feedforword neural network) proved its efficiency. Building strong model for prediction using multilayer neural network needs several important things one of them is choosing suitable activation function. We compare the most used activation functions on data sets related to electrical power. We present that ReLU and Leak ReLU outperform other activation function. ReLU activation achieves 13 over 18 of train set and 8 over 18 of testing sets and Leak ReLU achieves 3 over 18 in training sets and 8 over 18 in testing sets. ReLU achieves the smallest error in training set in all datasets. ReLU in combined cycle power plant, energy efficiency (Heating Load) and energy efficiency (Cooling Load) and individual household electric power consumption datasets performs the minimum RMSE in test set. In the other hand, Leak ReLU achieves the minimum test set error in appliances energy prediction dataset.