A novel wind power prediction model using graph attention networks and bi-directional deep learning long and short term memory

ABSTRACT


INTRODUCTION
Among the many clean and renewable energy sources, wind energy is significant.Wind energy's stochastic and unpredictable character, however, presents considerable difficulties for the power systems' dependable and secure functioning.Wind energy forecasting that is precise and reliable is essential for the integration of wind power into the grid.Numerous atmospheric characteristics are employed in numerical weather prediction (NWP) approaches, which are based on mathematical equations, in the study of weather forecasting [1].This technique uses equations drawn from the principles of physics and thermodynamics to create predictions.While NWP is an effective solution for many forecasting problems, it still has limitations, such as using an excessive amount of computing power [2].Additionally, forecasting using NWP may be susceptible to noise that may be included in the observations of meteorological variables because of the assumptions about meteorological variables utilized in this technique [3].Deep learning machine models can now be trained because of advancements in processing capacity, as an alternative to NWP, deep learningbased models have therefore been utilized more and more lately in weather forecasting tasks [4], [5].In particular, the approaches based on convolutional neural networks have been utilized in the research on  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 6, December 2023: 6847-6854 6848 weather prediction models [3], In addition, there are graph convolutional neural networks [6], and restricted Boltzmann machines [7].
Due to the increase in outstanding results obtained using data-driven approaches, the development of several deep-learning algorithms for weather forecasting problems has been documented in the literature [3], [8].One way to conceptualize meteorological data is as a time series of successive measurements.The challenges of time series forecasting have often been addressed by recurrent neural networks [9].For these tasks, bi-directional deep learning long and short term memory (BiLSTM) models are chosen [10], which may impair performance throughout the training period.In contrast, convolutional neural networks (CNNs) are popular and particularly effective for computer vision applications [11].While vanilla CNNs can handle the geographic aspects of the data, they cannot effectively exploit the temporal properties that are crucial for these weather forecasting applications.Salman et al. [3] have developed a 3D-CNN model for forecasting meteorological components.Lara-Benítez et al. [12] used vanilla CNNs in conjunction with temporal convolutional networks (TCNs), which is a kind of CNN that utilizes one-dimensional filters to concentrate on the temporal component of the data, to undertake prediction work in the field of meteorology.With the integration of TCN, CNNs can analyze spatiotemporal data.For evaluating graph data, graph neural networks (GNN) [13] have lately grown in prominence.Graph convolutional networks (GCNs ) [14] are a kind of GNN that combines the fundamentals of GNNs while maintaining a convolutional architecture by employing graph data rather than assuming that the data is Euclidean.There have been GCN adaptations using a learnable adjacency matrix in [6], although most GCN implementations focus largely on node classification tasks that need knowledge of the graph's nodes and edges.One benefit of this is that normal data can now be represented using an adjacency matrix, which is typically necessary for GNN implementations.However, to extract spatial information, it is desirable to learn the adjacency matrix during the training phase.Due to the transformer design presented in [15], attention methods have lately become quite popular.GNN frameworks have also integrated the attention mechanism.For example, the graph attention networks (GAT) [16] uses a self-attention mechanism similar to a transformer model, to determine the attention coefficient between the vertices of the graph.
Researchers often ignore the impact of spatiotemporal connections between wind turbines in favor of analyzing and modeling the past data for a wind turbine.Therefore, the development of spatiotemporal models with a high prediction of wind energy is crucial.In this article, we propose a novel spatiotemporal wind energy prediction model founded on GAT followed by a BiLSTM algorithm to combine temporal and geographical information of the data.GAT is a subtype of the GNN created specifically to analyze graphical data.It is a model for predicting wind power based on a variety of geographical data.It studies the geographical relationships between different nodes in the graph and how the characteristics of one node can influence the characteristics of another node.It makes optimal use of the attention mechanism to gather data from several nodes, while BiLSTM records temporal correlations between meteorological data collected at distinct periods.Our suggestion to improve the GAT-BiLSTM model makes it possible to calculate the attention of the nodes which produces GAT form in a more accurate and understandable.In our research, past weather data from the wind farm of Tetouan City in Morocco is used to estimate wind speed values for many hours into the future.
This paper is structured as follows: section 2 describes the suggested model.In section 3, you will find a description of the numerical experiments, a discussion of the results, and a comparison to other models.Section 4 contains the conclusion.

METHOD 2.1. Graph attention networks (GAT)
The spectral-domain GNN and the spatial-domain GNN are two subcategories of the mainstream GNN.The network topology of the spectral domain GNN is defined by filters, and the spatial domain GNN analyzes the node correlation to construct the network topology.Currently, the Laplace matrix is the only method used by the spectral domain GNN to deconstruct the graph and recover its specific information.
Since the neural network now only decomposes and extracts the characteristic information of the graph using the Laplace matrix, it has disadvantages [17]: i) The noise information in the graph has a simple impact on the eigenvalue, ii) It is challenging to extend the network since the filter it uses can only be applied to a certain structure, and iii) Both the time and financial costs of feature computation and analysis are very high.
GAT presented an attention method to solve the difficulty of GCN [18].As a result, GAT could give higher weights to important nodes, as neural network parameters and weights are learned simultaneously [19].At the moment, researchers create the GAT by collecting numerous attention mechanism networks.The feature sets and the transformed feature set of the original nodes constitute the network's input and output for the graph's initial attention layer [20].The fundamental principle of GAT is to find the spatial link of nodes, by calculating the weight coefficient of neighboring nodes.The model creates a shared weight matrix W [21] and uses it to examine and determine the coefficient of attention: or ℎ denotes the input characteristics,  denotes the weight matrix, and  denotes the attention mechanism.

Bi-directional deep learning long and short-term memory (BiLSTM)
Hochreiter and Schmidhuber [22] proposed the long-short term memory (LSTM) form of the recurrent neural network (RNN) in 1997, to improve the ability of temporal sequence data to capture longterm connections.It can get around the drawbacks of the vanishing/exploding gradient that standard RNNs experience.Recurrent networks may easily learn across many past time steps without gradient difficulties since the memory of the LSTM helps to maintain the mistake that will be back-propagated via the layers [23].Blocks are used to store values with varied durations in the LSTM's structure.Each block has two parts that allude to memories: the "Cell state" noted ct, which is notable for being able to recall prior information and displays long-term memory, and the "Hidden state," which only transmits information for the subsequent time step and displays short-term memory.
In contrast, BiLSTM, which consists of two LSTM models, is a powerful and enhanced version of the LSTM processing network.As it comes to the first one, the signal spreads ahead.Whereas in the second, it spreads oppositely.This implies that when processing the material, it concurrently considers both past and future context information [24].Figure 1 illustrates the construction of the BiLSTM network and demonstrates that each memory block has two LSTM layers.In other words, the output  can be determined by adding the values of two opposite hidden (ℎ ⃗  and ℎ ⃖⃗  ) as you move forward and backward through the layers.The corresponding formulas are provided as [25]: ℎ ⃖⃗  = (  , ℎ ⃖⃗ +1 ) where, (.) displays the LSTM function,   is the output layer bias,   is the final output of the BiLSTM network,  ℎ ⃗ ⃗  and  ℎ ⃖⃗ ⃗  are, respectively, the weights of the front and reverse LSTM.
The amount of information sent across the network increases when LSTM is applied twice in each of the two directions.As a result, learning twice as much long-term correlation information in the fundamental architecture of LSTM improves prediction performance [10].As a result, BiLSTM with dual formation capability vastly outperforms uni-LSTM in a variety of areas.power forecasting is the basic technique to control and distribute the energy of wind farms, it is more complicated due to the unpredictable and non-linear variation of wind power data.In this article, therefore, we will study wind energy forecasting in detail.The graphical predictor, which is shown in Figure 2, makes up the majority of the GAT set that has been provided.

Figure 1. Graph attention network layer
Spatial and temporal wind turbine datasets are pre-processed.The raw wind power data was split between the training set and the testing set.The spatiotemporal wind power prediction model is integrated using GAT followed by BiLSTM as the main predictor, thus, all fundamental weight parameters were defined using the training set.Spatial node feature data from unprocessed datasets are compiled using GAT.
An adjacency matrix is a common way to represent a graph.On the other hand, if the architecture of the graph is unknown, it is possible to determine the weights of the edges of the graph.It has been made learnable in previous works [6], and this has been demonstrated to enhance the capacity of the prediction model.It may also shed light on how the characteristics of one node may be modified by those of other nodes.The typical GAT model requires knowledge of the adjacency matrix of the graph to work [26].Comparing this model to the one in [26], we make the adjacency matrix learnable.So, we give the network the freedom to fully understand the spatial relationships that exist between the nodes in the graph.To calculate the attention scores for the GAT, this newly acquired information is later utilized.When learning is underway, the adjacency matrix is subjected to the same processes as in ( 5) [6].
The identity matrix is therefore added to the adjacency matrix  to take into account the potential self-loop link at each vertex.The updated adjacency matrix  is then normalized using the min-max method.Vertex feature values in the traditional GAT formulation [26] are thought of as being one-dimensional.GAT computes attention ratings based on the vertices and their one-dimensional properties, which are then used to create a fresh depiction of each network node [26].We presume that the collection of node characteristics is given as ℎ = {ℎ 1 , ℎ 2 , . . ., ℎ  }, or  is the number of nodes.ℎ  ∈ ℝ  where  is the total number of characteristics found in each node.By using a shared weight matrix, initially, a linear conversion of the attribute values to a specified  ′ number of attributes.A self-attention approach is then used to calculate the attention scores of each vertex relative to the others, and add to the transformation of the feature values as a dot product.More exactly, the attention score  , is calculated in the following manner with the  −ℎ and  −ℎ vertex combination.
Int J Elec & Comp Eng ISSN: 2088-8708  In (6) demonstrates the transformation of the characteristics of the  −ℎ and  −ℎ pair of vertex properties are transformed into ' dimensions and concatenated, i.e [ℎ  ||ℎ  ] ∈ ℝ 2' .Here, ℎ  ∈ ℝ  and ℎ  ∈ ℝ  consider the feature vectors of the  −ℎ and  −ℎ vertex respectively. ∈ ℝ ×' is the associated weight matrix that modifies the properties of each vertex linearly.In GAT, a parameterized single-layer feedforward neural network models the self-attention process  ∈ ℝ 2' .A scalar is created by the attention process.The LeakyReLU nonlinearity is then used, with a 0.2 negative slope.The attention score is then normalized with a softmax.The network adjacency matrix determines the neighborhood of the  −ℎ vertex, where all other vertices are taken into account while applying the softmax.In contrast, in our case, the adjacency matrix is learned throughout the learning phase without restricting the connections between the graph's vertices.As a result, in our example, all of the nodes are linked and the adjacency matrix components are used to determine the strength of connections between nodes.The attention scores are multiplied by the linearly changed features of the vertices, and the sum is then calculated over adjacent vertices.Thus, as opposed to the traditional GAT formulation, the combination of the self-attention coefficients  and the concatenated nodes feature [ℎ  ||ℎ  ] will produce a vector of size , or the number of weather variables, as opposed to a scalar as it does in the traditional GAT formulation [26].Subsequently, the following is done to compute the new representation of the  −ℎ vertex: where    is the difference in attention ratings between the  −ℎ and  −ℎ nodes.The final representation of the nodes that we obtain after applying K-independent attention techniques and concatenating their outcomes is as (8).
The final representation of the  −ℎ node is then determined as (9).
The diagonal matrix D is computed here according to the results of the acquired Â in (5).The symmetric normalization i.e.,  ̂−1 2  ̂ ̂−1 2 , is then done after this.To determine the new feature representation ℎ ̃ the characteristics of the  −ℎ node given by ĥ  is multiplied by the modified matrix that results.After traversing a spatial layer composed of a flow GAT layer, the output is modified and routed into a BiLSTM network.The GAT layers have been described extract spatial information from meteorological data, Although the BiLSTM is used to enhance the model's capacity to identify temporal correlations in the data and how distinct observations at different time steps impact one another.

RESULTS AND DISCUSSION
In this section, we will provide the findings and talk about the numerical analysis.We used information from the wind farm in Tetouan, Morocco, which was gathered on each wind turbine and primarily included the following elements: wind power, wind turbine status, and weather.The original meteorological data used to create this article has been pre-processed, mostly adding the contents of wind speed, wind direction, and ambient temperature.These three categories of meteorological data are essential for creating a very precise spatiotemporal wind power prediction using the proposed GAT-BiLSTM wind forecast model.
Our simulation dataset was divided into training and test sets.As a result, we used 70% of the training phase to introduce the model and 30% of the testing procedure to assess its effectiveness.The suggested approach is put into practice using Python.
To improve and measure the efficiency of the forecasting model during a winter month, we conducted comparative research using the three fundamental algorithms: back-propagation (BP) [27], support vector machine (SVM) [28], and extreme learning machine (ELM) [29], which are the most often used in wind energy forecasting.We included comparison research for a winter month, to investigate the effectiveness of the suggested model and its capacity to perform better throughout the crucial season.We contrasted the results of the BP, SVM, and ELM algorithms, respectively, and the models' investigated observations for a month in the winter, and the prediction results are shown in Figures 3 to 6.According to these data, the GAT model provides accurate prediction measures for a month of winter because its curve closely resembles that which was measured.Table 1 lists all forecasting outcomes for a month of winter, based on the four models.The table displays the time convergence of each approach in seconds as well as the MSE score, which can be calculated as (10): where the number of instances in the test set is , the predicted output is  ̂, and the measured output is   .The results are displayed in Table 1 highlighting the execution of the proposed GAT-BiLSTM algorithm concerning the MSE error.Due to their ability to significantly distinguish between test values and projected results of other models, BP and SVM produced the largest MSE.For BP and SVM, the MSEs are 0.9251 and 0.8689 respectively.Also, the ELM error is about 0.3486.All these models have errors greater than 0.1573 produced by GAT-BiLSTM.
In addition, the biggest time convergence value is reached by BP and SVM, which take a long time to compute and need several iterations.On the other hand, ELM and the proposed model give the lowest values, but it still gives a smaller convergence time.Although the suggested GAT-BiLSTM method produced comparatively better predictions than the persistent models and converged quickly, all of these results demonstrated the reliability of the approach.

Figure 3 .Figure 5 .
Figure 3. Wind power predicted by BP Figure 4. Wind power predicted by SVM

Table 1 .
Comparison between models for a month in winter