Multi-task learning using non-linear autoregressive models and recurrent neural networks for tide level forecasting

ABSTRACT


INTRODUCTION
Tide level could have significant impact on human life, and the study of ocean phenomena is an essential part of coastal engineering, coastal ecosystem, and human activity.In the field of coastal engineering, tide level data are valuable for the construction of ports, offshores and cross-sea bridges [1]- [4].For coastal ecosystems, tide level data can be crucial for sediment movements and pollutant tracing and monitoring [1], [5].In the domain of human activity, tide level data can be used as important information in fishing, doing recreational activities [6], and potentially developing tidal energy [7], [8].Therefore, it is important to effectively model and forecast tide levels.In doing so, tide level data are usually observed and recorded as time series.A classical way to make tide level forecasting is by implementing harmonic analysis method.Such a traditional way of forecasting can be ineffective if the data are incomplete (e.g., with some data being missing) [9].Harmonic analysis methods usually also demand a substantial amount of parameters because such a method needs to use not only astronomical but also non-astronomical features [2], [10].To overcome these drawbacks, an alternative forecasting method is required.
Various algorithms and models have been proposed and explored to improve the accuracy of time series forecasting results.One of the most popular approaches for forecasting is using neural networks.Forecasting using artificial neural networks (ANNs), combined with other models, has attracted extensive Int J Elec & Comp Eng ISSN: 2088-8708  Multi-task learning using non-linear autoregressive models and recurrent neural … (Nerfita Nikentari) 961 attention.Specifically, nonlinear autoregressive with exogenous input model (NARX) and nonlinear autoregressive moving average with exogenous input model (NARMAX) have been widely applied to complex system identification, modelling and time series forecasting [11].For example, Aguirre et al. [12] employed both nonlinear autoregressive (NAR) and multi-layer perceptron models to investigate two fundamental issues which underlie periodic time series forecasting tasks (e.g., daily load forecasting): pattern mapping and dynamical prediction; the results are interesting and useful for designing more effective predictive approaches for short-term periodic time series forecasting.Wu et al. [13] proposed a combination of genetic algorithm, backpropagation and NARX for tide level prediction.Muñoz and Acuña [14] developed NARX and NARMAX models combined with shallow and deep neural networks (DNNs) to forecast daily demand data and air quality conditions.The performance of NARMAX and radial basis function (RBF) neural networks was examined by Omri et al. [15] for water flow depth forecasting.Gu et al. [16] proposed a neural network enhanced NARMAX model to predict the disturbance storm time (Dst) index and the model showed better performance than either NARX model and a typical neural network model alone.A novel cloud-NARX model is presented by Gu et al. [17] for the auroral electrojet (AE) index forecasting and prediction uncertainty analysis.The aforementioned methods, combining ANNs and other models, demonstrate promising results for time series prediction.However, most of these methods are designed for single task learning (STL), where a model trained using a set of time series data can only be used to make prediction of the same time series process, but cannot be used for other time series predictions, and therefore cannot benefit from sharing knowledge among related tasks [18], [19].In many real scenarios, two or more different events or processes may be closely associated with each other; therefore, it is desirable to design a new framework that can be used to deal with more than one different but similar datasets simultaneously.
To extend STL models to multi-task learning (MTL) cases, this paper proposes a new MTL framework by combining nonlinear aggressive models and recurrent neural network (RNN).The proposed MTL framework performs multiple tasks simultaneously, allowing for sharing information between related tasks.For RNN, an MTL scheme can be implemented by sharing information of the structures of the network models (e.g., numbers of layers and numbers of nodes in each layer) and their training process.Sharing information between multiple related tasks can boost learning efficiency and improve the generalization ability of the resulting models.These performance improvements are achieved through knowledge sharing between different tasks [19]- [21].
Several MTL RNN models have been proposed for forecasting purposes in the literature.In study [22], MTL and long short-term memory (LSTM) RNN models were combined for wind power forecasting; the prediction accuracy was increased by more than 23.13% in comparison with the existing STL forecasting models.In study [23], MTL and RNN were used to forecast urban traffic flow; the forecasting accuracy was improved around 10% to 15% over baseline models.In study [18], another RNN variant, gated recurrent unit (GRU), was combined with MTL for traffic flow and speed forecasting; the model does not only improve the forecasting accuracy but also solve the problems caused by enlarging the dataset.In study [24], the performance of MTL with LSTM (MTL+LSTM) was compared with that of MTL+GRU for health assessment and remaining useful life forecasting; it shows interesting results where LSTM gives smaller loss and simpler model while GRU can perform well with less training time.MTL based on bidirectional LSTM (BiLSTM) was proposed to forecast cooling, heating, and electric load [25]; the MTL+BiLSTM model is able to increase the accuracy significantly and improve the time efficiency for training the model.
Over the past years, NARX and NARMAX models have been applied to STL, and RNN has been introduced to solve multi task learning problems, but relatively few works have been done to combine nonlinear model autoregressive models and RNNs, and applied them to MTL.This study aims to develop new models for tide level forecasting.The proposed approach is as follows.It first combines three RNN variants (LSTM, GRU, and BiLSTM) with NARX and NARMAX, respectively.Then, it compares the performance of all the resulting models.Finally, it determines the best models for tide level forecasting by evaluating the accuracy of these models, with and without using the proposed MTL scheme.
The study conducts comprehensive comparisons between three types of recurrent neural networks (LSTM, GRU, and BiLSTM), paired with NARX and NARMAX models, and analyze their performances for forecasting tide level at different locations.The main contributions of the work are summarized as follow: a.The proposal of NARMAX and NARX modelling framework, combined with an MTL scheme, for forecasting tide levels at many different locations simultaneously.MTL with hard parameter sharing concept is by using the hidden layer together for entire tasks but the output layers will be allocated separately for different tasks.Hard parameter technique shows the possibility of lower chance of overfitting and smaller loss or error because the MTL is able to hold all the knowledge and information, sharing all the parameters and training all tasks jointly [26]- [28].In soft parameter sharing, each task has it is own specific hidden layer with independent parameters.The distance between the model parameters of different tasks is then regularized to make the parameter to be similar.
Although every task has it is own model with its own setting, the distance between the model parameters of every dissimilar task is added to unite the objective functions [26], [27].Several approaches have been proposed to explore the sharing mechanism in MTL, for example, supervised learning, unsupervised learning, semi-supervised learning, reinforcement learning, active learning, online learning, parallel and distributed learning, and multi-view learning [29].

NARX and NARMAX models
NARX and NARMAX are models that not only utilize exogenous (external) input variables, but also the lagged versions of the system's own output that enters the model structure through output delays [11].Both models introduce nonlinear functions to learn the system input-output relationships but the NARMAX model also includes prediction errors to improve the modelling performance [17].The introduction of "error variable" or "residual variable" makes NARMAX more powerful for nonlinear system identification [30].The most commonly used basis functions in NARX and NARMAX modelling are polynomials, which usually lead to transparent, interpretable and parsimonious models [11], [31], [32].Another widely used nonlinear representation to apply with NARX and NARMAX is neural networks, which usually result in black box models.A modelling framework combining NARX or NARMAX and neural networks may be called grey box model identification [11], [33].For single input, STL systems, NARX model can be formulated as (1): where (), () and () are the system output, input, and noise, respectively;   ,   ,   are the maximum lags in the output, input and noise; (. ) is a nonlinear function, and d is a time delay which is typically set to be d=0 or d=1.For multi task leaning systems, NARX can be formulated as (2): where  = 1, 2 … , , with m being the number of outputs; r is the number of inputs.Correspondingly, for STL and MTL systems, NARMAX models can be respectively formulated as (3): and In this work, three variants of RNN, that is, LSTM, Bi-LSTM and GRU, are used to implement the nonlinear functions f and fi (i =1, 2, …, m) and adopt the hard parameter sharing MTL framework.

The NARX-RNN-MTL frameworks
The structure of the proposed NARX-RNN-MTL framework is presented in Figure 1, where  ̂ () ( = 1,2, … , ) are the predicted values of the output, n and r are the number of samples and number of locations, respectively.The proposed model has an input layer that accepts the sequence data of tide level measured at  locations; the next layer is built using RNN.The RNN layer can be GRU, LSTM or BiLSTM, and the last layer, which is a fully connected dense layer with the output, provides predicted values of the tide levels.

The NARMAX-RNN-MTL frameworks
Similar to the NARX-RNN-MTL model, the NARMAX-RNN-MTL model is also implemented by using GRU, LSTM or BiLSTM and the forecasting results are produced by the dense layer.The difference between these two models is that the NARMAX architecture includes 'noise' as an input.Note that noise cannot be measured but can be estimated using model prediction error.The architecture of the NARMAX-RNN-MTL model is presented in Figure 2.
To identify an NARMAX model, we use the prediction error from NARX-RNN-MTL as additional inputs.To differentiate between noise and prediction error, we use () and () to represent noise and model prediction error, respectively.The model prediction error of the NARX-RNN-MTL model is: where  ̂ () is the one-step ahead prediction calculated from NARX-RNN-MTL model and   () is the corresponding actual signal.For an MTL system with r inputs and m outputs, the i th output of the NARMAX-RNN-MTL model calculated based on the associated NARX-RNN-MTL model is: . . .

EXPERIMENTS 3.1. The task
This study is concerned with a multi-task problem, that is, to forecast tide level in five stations (5 tasks) at the same time, namely, Harwich, Lerwick, Millport, Portrush and Weymouth, using the proposed NARX-RNN-MTL and NARMAX-RNN-MTL models.The MTL model is designed for one-step ahead prediction of tide level; here one-step is equal to 15 minutes.The datasets measured for the five stations, from January 1, 2022 to May 31, 2022, with a sampling period of 15 minutes, are used for model building.In doing so, the data were split into three parts: 60% for training, 20% for validation and 20% for testing.
Experiment on different numbers of more tasks will be further performed.Specifically, the proposed models are used to forecast six more stations, namely, Aberdeen, Devonport, Fishguard, Holyhead, St Mary and Stornoway, to further assess the model's generalization performances for solving many tasks, ranging from 6 to 11.The datasets for these six stations were measured in the same period and with the same sampling period of 15 minutes.These datasets were extracted from the website of the British Oceanographic Data Centre (BODC) [34].

Experimental settings and metrics performance
To achieve good model performances, the network hyper-parameters were determined through simulations by testing a set of parameters shown in Table 1.For a single task, for example, the prediction of a single individual time series   () (i =1,2,…, m), without using shared information from any other signals, the loss function can be defined as (7): where  ̂ is the predicted value,   is actual value and n is the number of observations.
Note that this study is concerned with dealing with multiple tasks simultaneously; the loss function for model training should accommodate the losses of all the tasks.Keeping this in mind, we use the averaged root mean square error (aRMSE) as a joint loss that is defined as (8):

NARX and NARMAX combine with GRU, LSTM, and BiLSTM
We compare NARX and NARMAX using three different RNN models, with a variety of time lags.Two optimizers, namely stochastic gradient descent with a momentum (SGDM) and adaptive moment estimation (ADAM), are used to optimize the associated models.Joint loss value or average value of root mean square error (RMSE), based on the combinations of the candidate parameters presented in Table 1, was simultaneously carried out on the datasets for the five stations (i.e., Harwich, Lerwick, Millport, Portrush and Weymouth) for finding the optimal network model parameters and determined the best model structure.The performance of the NARX-RNN-MTL and NARMAX-RNN-MTL models were then compared based on the values of the RMSE of five tasks.Three RNN structures, namely, LSTM, BiLSTM and GRU were used for building the NARX-RNN-MTL and NARMAX-RNN-MTL models.
We compare the performance of the three NARX-RNN-MTL models, with 16 specific lags and trained with SGDM and ADAM, for the five tasks.The comparison is based on joint loss or average RMSE value which indicates the performance of each model.The details of these experimental settings are shown in Table 2. From the comparison we have the following findings: a.The two lowest average RMSE values are 0.10598 and 0.15848, which are produced by the model of NARX GRU using SGDM and ADAM, respectively.b.The distribution of average RMSE for NARX GRU and NARX LSTM using SGDM are relatively small compared to NARX GRU and NARX LSTM using ADAM.From Table 2, it can be seen that the minimum and maximum average values of RMSE for NARX GRU and LSTM GRU using SGDM are 0.10598, 0.15848, 0.15774 and 0.21189 respectively, while for NARX LSTM and NARX LSTM using ADAM, the values are 0.15848, 0.25016, 0.17758 and 0.35830, respectively.c.NARX GRU using SGDM outperforms the following models for all the settings of time lags (ranging from 1 to 4): i) GRU using ADAM, ii) LSTM using SGDM and ADAM, and iii) BiLSTM using SGDM and ADAM.d.For NARX BiLSTM, the relatively lower RMSE distribution is gained by using the ADAM optimizer.

Baselines
In order to validate the overall performance and effectiveness, we tested and compared the proposed method with the following baseline methods, including STL and MTL without using NARX or NARMAX model: i) MTL implemented with GRU, LSTM and BiLSTM using SGDM and ADAM optimizer and ii) STL implemented with GRU, LSTM, and BiLSTM using SGDM optimizer and ADAM.The comparison results of the individual RMSE and the lowest joint loss RMSE are tabulated in  5 shows that all the baseline models using SGDM deliver better results than using ADAM; the SGDM optimizer has best performance measured in either average RMSE or individual RMSE. Figure 3 provides a graphical illustration of joint loss, measured as the average RMSE of different models.The bar plots show that GRU networks display better performance than the other two RNN variants (LSTM and BiLSTM).To further evaluate the performances of NARMAX-GRU models for more tasks, we conducted experiments, where six more tasks of predicting tide levels at stations 6-11 were included and performed simultaneously together with the other five tasks (for stations 1-5).The joint losses of the NARMAX-GRU models for these six tasks are shown in Figure 4, where it can be noted that the prediction error increases with the increase of the number of tasks but still maintains the errors at a stable level.

DISCUSSION
From the results and comparisons presented in section 4, it can be noticed that NARX/NARMAX models, paired with GRU, showed better performance than paired with LSTM and BiLSTM for all the experimented scenarios.Another noticeable observation is that the SGDM optimizer was proved to be more effective than ADAM (in terms of RMSE) for both single task and multi-tasks.The only occasion where SGDM showed a poor performance was when BiLSTM was applied to solve a single task problem as shown in Table 5.The maximum lags for both NARX and NARMAX models were limited to the range from 1 to 4. It appeared that models with smaller lags usually produced slightly better prediction performances.However, it was noted that neural network models with a relatively smaller lag usually needed more hidden nodes, meaning that the training of the models needed more time.It is also worth mentioning that the network models that include NARX or NARMAX as a sub-model have lower average RMSE values in comparison with models that do not include NARX or NARMAX as a sub-model.

CONCLUSION
This study proposes a new class of NARMAX-RNN models, namely, NARMAX-LSTM, -BiLSTM and -GRU, combined with MTL learning for multiple tide level forecasting simultaneously.Experimental results revealed that NARMAX-GRU trained with SGDM outperformed the other two RNN variants; the NARMAX-GRU model requires relatively small lags but may need a relatively larger number of hidden nodes.The optimal NARX-GRU structure involves 300 hidden nodes, the maximum lag for input is   = 1, for output is   = 1, and the RMSE values is 0.10598.For the NARMAX-GRU model, the best setting is as follows: 100 hidden nodes, the maximum lag for input is   = 2 and for output is   = 1, and the RMSE values is 0.09961.The results showed that NARMAX-GRU outperformed its counterpart NARX-GRU.We also compared the model performances with and without using the MTL scheme.It turned out that NARMAX-GRU has the lowest joint loss values.The three RNN models without using the MTL scheme displayed poor performance compared to NARX and with MTL the scheme.One limitation of this work is that the proposed model still needs manual fine-tuning to find the best hyper-parameters, e.g., time lags for each of the model variables and the number of hidden nodes, to build the best models.In future, we will design MTL models that can better fine-tune the training process by using transfer learning.In addition, the data used model for model training and forecasting are not only univariate but also multivariate and multidimensional.

Figure 4 .
Figure 4. Joint loss error of MTL NARMAX with many tasks learning using non-linear autoregressive models and recurrent neural … (Nerfita Nikentari) 969

. Multi-task learning MTL
b.A comparison studies of NARMAX and NARX, paired with three RNN model structures with MTL schemes, to improve tide level prediction performances.is a transfer learning method where multiple related tasks are solved in parallel by sharing information between them.The task relatedness is defined based on the recognition of how each task is related, based on which MTL models are designed, trained and implemented.The classic methods to perform MTL can be categorized into soft parameter sharing and hard parameter sharing.

Table 5 ,
where the lowest average RMSE value is for MTL using NARMAX and GRU with SGDM optimizer.Based on average Multi-task learning using non-linear autoregressive models and recurrent neural … (Nerfita Nikentari) 967 RMSE value this model outperforms all the baseline models.It can be noticed that this lowest average value does not guarantee that the individual RMSE of each task is also has smaller value.For example, for Tasks 1 and 5, the NARMAX-GRU-MTL model shows the best performance with the lowest individual RMSE values, while for Tasks 2, 3 and 4, GRU-STL, GRU-MTL and NARX-GRU-MTL show the best performance, respectively.Table

Table 4 .
The optimal settings of the NARX-RNN-MTL and NARMAX-RNN-MTL models and their performances

Table 5 .
Comparison of RMSE values of of different models Comparison of joint loss error of different model structures builts based on three RNNs, namely, GRU, LSTM, and BiLSTM Int J Elec & Comp Eng, Vol.14, No. 1, February 2024: 960-970 968 Figure 3.