Enhanced transformer long short-term memory framework for datastream prediction

ABSTRACT


INTRODUCTION
In the era of big data, traditional machine learning methods can be computationally burdensome and complex, making them unsuitable for processing such large-scale datasets.To achieve accurate predictions for data, traditional machine learning techniques often struggle to handle the challenges posed by big data, including its sheer volume, complexity, and high-dimensional nature [1].On the other hand, data-driven methods utilizing deep learning have attracted interest due to their capacity to perform statistical analysis and information extraction automatically and successfully on large-scale, multi-source, and high-dimensional data., thereby overcoming the limitations of traditional prediction methods [2].Recurrent neural networks (RNNs) are a kind of neural network that is particularly good at handling sequential data.have a feedback connection, in contrast to conventional feedforward neural networks, which enables them to keep an internal memory of prior inputs.This memory enables RNNs to effectively capture temporal dependencies and patterns in sequential data.However, traditional RNNs experience the "vanishing gradient" problem, where the gradient signal weakens with time and makes it difficult to adequately capture long-term dependencies.To overcome this restriction, variants like long short-term memory (LSTM) and gated recurrent unit (GRU) were developed.Incorporating gating mechanisms that selectively remember or forget information, these models are better able to capture and spread pertinent information over longer sequences [3].Additionally, in neural networks (NNs), the pre-assignment of Int J Elec & Comp Eng ISSN: 2088-8708  Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief) 831 parameters defines the network's topology and has an impact on how computationally intensive training and prediction are.Therefore, optimizing the parameters is crucial for achieving excellent performance.However, like other deep learning (DL) networks [4], LSTM also faces the challenge of parameter selection, which often requires hand-engineered adjustments.Manual parameter adjustment is difficult, particularly when dealing with vast amounts of data and very deep network structures.To address this issue, a grid search (GS) is employed to look for the ideal settings for multi-processor long short-term memory (MPLSTM), leading to predicting the datastream flow.This approach aims to build a suitable model structure and increase the MPLSTM's prediction accuracy.A multi-processor LSTM framework for real-time data stream processing is the main goal of this study.It conducts a comprehensive analysis of MPLSTM using a real-life dataset, offering valuable insights into monitoring the parallel approach.

THE PROPOSED DATASTREAM MULTIPROCESSING LSTM FRAMEWORK
In this section, a framework for real datastream analysis is presented that harnesses the strengths of MPLSTM along with other techniques to attain superior accuracy in real-time data processing.Figure 1 demonstrates the framework's overall architecture and gives a visual representation of the intricate details that underlie its operational procedures.The framework is made up of several parts, including an output layer, a hidden layer, and a layer for data input.

Data acquisition and preprocessing
The pre-processing stage is essential for getting the data ready for input into MPLSTM.This phase involves several steps, including data cleaning, normalization, splitting, and reshaping [5].Firstly, data cleaning is performed to remove any missing or inconsistent data that can adversely affect the model's performance.Secondly, normalization, splitting, and reshaping the input data into a form that is compatible with the LSTM unit to scale the data and bring it within a specified range to avoid bias in the model's performance [6].

The learning model
After splitting the dataset into train and test, the training set is divided into chunks and processed in parallel using the multiprocessing pool.The pool manages a pool of worker processes, automatically assigning tasks to available workers and handling the communication between the main process and the worker processes.After that, the proposed MPLSTM framework is trained with various hyperparameters adjusted through multiple experiments until reaching a stable state, optimizing the weights with a grid search algorithm for best performance.

Parallelization using multiprocessing pool
Parallelizing LSTM-based models using multiprocessing [7] enables faster processing of input sequences, efficient resource utilization, scalability, and flexibility in model design and optimization.It can be particularly useful in scenarios where large sequences or computationally intensive models need to be processed within a reasonable time frame.To apply multiprocessing, there are some steps [8] that must be followed: i) split the data into smaller chunks that can be processed independently, ii) create a function that will be executed in parallel by multiple processes, iii) this function will take a data chunk as input and perform LSTM processing on that chunk, iv) inside the function, create an instance of the LSTM model and train or predict on the input chunk, v) set up a multiprocessing pool, vi) the pool manages a group of worker processes that will execute the parallel processing function, vii) specify the number of worker processes to utilize, typically based on the available hardware resources, viii) collect the results from the parallel processes, and ix) however, it is important to note that the level of parallelism achievable depends on factors such as the number of available CPU cores, memory capacity, and the size of the input sequence.Parallel execution can be increased potentially by having a higher number of CPU cores, allowing for the execution of more processes in parallel.Similarly, the presence of sufficient memory capacity is crucial to accommodate the running of data and processes in parallel without being constrained by memory-related issues.

Dropout layer
To increase the network's speed and sturdiness, dropout regularization has been incorporated [9].A regularization method is frequently employed in neural networks.It randomly deactivates a fraction of the neurons in the previous layer during each training iteration.This dropout of neurons helps prevent overfitting [10] by reducing the reliance of the network on specific neurons and encourages the learning of more robust and generalizable representations.During inference, the dropout layer is typically turned off, and the full network is used for making predictions.By incorporating dropout layers, the network becomes more resilient to overfitting and can improve its generalization performance on unseen data [11], [12].

Dense layer
A fully connected layer is a fundamental component of a neural network.It consists of multiple nodes, or neurons, where every neuron is linked to every other neuron in the layer below.An activation function is applied to the weighted sum of the inputs from the layer before to determine each neuron's output in a dense layer [13].To maximize the network's performance on the specified task, these weights and biases are learned throughout the training phase [14].
The model consists of several dense layers that are fully connected to all the activations in the former layer.These dense layers combine the complicated feature maps to produce a feature vector that is flattened.The occurrences are then categorized using the softmax [15] output probabilities produced by the last dense layer.

Adam optimizer
This section highlights the importance of parameter optimization in improving the model's performance.MPLSTM was trained using the Adam optimizer algorithm.A well-liked optimization technique frequently employed in deep learning is the Adam optimizer [16].It combines the advantages of both the adaptive gradient algorithm (AdaGrad) and root mean square propagation (RMSProp) algorithms by adapting the learning rate for each parameter individually.This adaptive learning rate helps in achieving faster convergence as well as improved performance while training.The Adam optimizer is used to reduce the categorical cross-entropy loss function with a learning rate of 10 −4 .By using the Adam optimizer, MPLSTM by successfully updating its weights and biases, can reduce the loss function and boost its accuracy overall.

Prediction
In the prediction phase, the input data is fed forward through the network, The softmax function is applied in the output layer to create a probability distribution over the classes [17].The anticipated class label is normally determined by the class with the highest probability.Softmax makes sure that the projected probabilities are restricted between 0 and 1 and add up to 1.Because of this, it is appropriate for multi-class classification problems in which each instance belongs to a single class.
where n is the total number of classes,   is the raw output value for class .By applying softmax, the neural network can provide a probability-based prediction, allowing for decision-making based on the highest probability class.

Evaluation
Datastreams often exhibit changes in the class distribution of incoming instances regularly.These metrics provide a more comprehensive assessment of the model's performance, considering the evolving nature of the data stream and allowing for timely adaptation and monitoring.To evaluate the results, MPLSTM uses the following evaluation metrics: classification accuracy: it compares the predictions of MPLSTM with the actual target values from the dataset [18].
Where    : This refers to the count of instances where MPLSTM correctly predicts the target value,    : This is the total number of instances for which predictions were made by the MPLSTM.The result will be a value between 0 and 1, representing the percentage of correct predictions made by MPLSTM.
Then, three error metrics mean square error (MSE), root mean squared error (RMSE), and mean absolute error (MAE) Loss are used to assess the model's performance.These measures are employed to evaluate several facets of the model's precision and prognostication [19].
Where n is the number of samples, y  ,  ̅  the predicted and actual values respectively.RMSE measures the average magnitude of the prediction errors by taking the square root of the mean squared difference between the predicted y  and actual values  ̅  .It gindicateshow accurately the model predicts the desired variable.While MAE measures the average absolute difference between y  ,  ̅  .

LSTM enhancement
Inspired by the transformer model's innovations [20], we enhance LSTM by incorporating transformer principles.This fusion includes self-attention and cross-attention mechanisms [21] similar to transformers, improving LSTM's ability to capture complex data dependencies, especially in large datasets.The resulting TransLSTM architecture combines LSTM and transformer strengths, making it adaptable and powerful for real-world applications and predictions.Figure 2 illustrates TransLSTM: input encoding converts tokens to continuous vectors, positional encodings provide context and positional information, transformer encoder blocks process sequences with multi-head self-attention and feedforward networks, LSTM Integration captures sequential dependencies, attention mechanism combines information from both sources, and the output layer produces final predictions.

RESULTS AND DISCUSSION
This section outlines the comparative study conducted to assess the performance of MPLSTM.The experimental procedure employs statistical analysis to evaluate the results obtained across all datasets, comparing MPLSTM with several state-of-the-art algorithms for data stream classification.The results of this study provide important new information about the performance of the proposed MPLSTM framework and its competitive position in the field of data stream classification techniques.

Dataset
A total of 29 different time-series datasets were used in this study and came from the UCR repository, which is accessible to the public [22].Stream clustering [23], anomaly detection [24], and data stream density estimation are just a few of the applications for which these datasets have been used in research in the past.Each dataset comprises of instances of a one-dimensional time series with a built-in grid structure.
The IMDB dataset [25], introduced by Maas et al. [26], is a prominent benchmark for sentiment classification.It comprises 25,000 reviews in both the training and test sets, each limited to 30 reviews per movie for diversity.This balanced dataset contains an equal number of positive and negative reviews, establishing a 50% accuracy baseline if predictions were random.

Case study 1
The study encompassed a comprehensive exploration of various established techniques, aiming to encompass all algorithm families proposed in the literature for the given problem.Table 1 provides an overview of the evaluated classifiers, organized by their respective families, and includes the abbreviations used throughout this paper [27], [28].The results obtained from the conducted experiments are presented and discussed.Additionally, the processing time on each dataset is analyzed, considering the significance of speed-up in a data streaming scenario.Hyperparameter selection typically involves using rule-of-thumb parameters or proven combinations from previous studies.However, a systematic approach like grid search (GS) [29] is employed for meticulous hyperparameter selection.Grid search is favored due to its simplicity, parallelizability, and effectiveness in low-dimensional spaces.It entails discretizing hyperparameter value ranges and systematically testing all possible combinations.This approach explores diverse model configurations.Before training the final models, a validation run optimizes hyperparameters based on accuracy assessments.The training process ends when the maximum epoch limit is reached.MPLSTM configuration details are summarized in Table 2.In this paper, the Adam optimizer is chosen post-validation for its computational efficiency and slightly superior test results.A batch size of 32 is used for all models, and the sparse categorical cross entropy as a loss function [30] is employed.This loss function calculates the negative logarithm of the predicted probability for the true class index when applied to class indices, showing the model's level of assurance in the accuracy of its class prediction.
where n represents the classes' number,  represents the true label or target value of the ith class, and  ̂ represents the predicted probability for the corresponding class.

Performance evaluation
Batch sizes above 30 exhibit stable accuracy and processing times regardless of the number of batches, offering flexibility in parameter selection.However, batch sizes below 30 significantly degrade performance, hindering model adaptability to evolving data streams.Very small batch sizes overly focus on individual examples, preventing learning of overall data distribution changes.MPLSTM achieves high accuracy across various datasets, showcasing LSTM's suitability for time-series data streaming.
Convergence of training and validation loss lines during model training is a positive indicator, signifying learning progress.MPLSTM reduces processing time significantly through parallel processing, enhancing accuracy and predictive capabilities.The trade-off with increased computational time should be considered based on application requirements.Figure 3 demonstrates parallel processing consistently outperforming sequential processing across 29 datasets, ensuring MPLSTM's effectiveness.

Figure 3. A comparison between sequential and parallel execution processing times
The FacesUCR dataset exhibits a speedup of 2 times when processed in parallel, indicating a significant improvement in processing time compared to sequential processing.Similarly, for the Pendigits dataset, the parallel model achieves a speedup of 1.7, further demonstrating the efficiency of parallel processing over the sequential approach in this case.Several other datasets, such as PhalangesOutlinesCorrect and TwoPatterns, also exhibit notable speedups larger than 1.5 times when processed in parallel.These findings further emphasize the effectiveness of MPLSTM in reducing processing time.The observed speedup across multiple datasets as shown in Figure 4 underscores the model's ability to leverage parallelism efficiently, resulting in faster dataset processing.
By harnessing parallel processing, the MPLSTM demonstrates its capability to significantly improve performance and expedite data analysis tasks.Upon closer analysis, the processing time of individual datasets, such as the ECG5000, demonstrates notable improvements in processing time as shown in Figure 5, although the magnitude of the speedup may not be extremely high.However, when examining the Pendigits dataset, with its larger size and increased complexity, the benefits of parallel processing become increasingly pronounced.Consequently, the speedup achieved becomes substantially larger.Figure 6 displays the learning curves, which depict the accuracy improvement across each epoch for two different datasets ECG5000 and Pendigits.These curves visually demonstrate how the model's accuracy increases as the training progresses.The learning curves for the two datasets show how well the model was trained and how well it could learn from the data.The consistent improvement in accuracy over the course of the epochs implies that the model is not just remembering the training data but also generalizing well to new data.This is encouraging for the model's ability to predict outcomes using fresh data.
Similarly, Figure 7 illustrates the decrease in loss across each epoch for the same mentioned datasets.All these learning curves provide crucial insights into the model's performance and offer valuable guidance for enhancing its architecture and training procedure.The learning curves provide valuable insights into the model's performance and offer guidance for optimizing its architecture and training process.By addressing overfitting and considering early stopping, the model's accuracy can be further improved while maintaining good generalization capabilities.In Table 3, The proposed MPLSTM framework's effectiveness was assessed using MSE, RMSE, and MAE as evaluation metrics.It is desirable to have low values for these metrics as they indicate better performance.In this study, the framework yielded promising results with low values for example, it gives MSE=0.237 for the ECG5000, RMSE=0.583 for the PhalangesOutlinesCorrect, and MAE=0.074 for the pendigits.This implies that the predictions made by the MPLSTM model were close to the actual values.The model performed well because it was able to learn the patterns and relationships in the data.This led to accurate predictions and shows that the MPLSTM framework is an effective way to address this problem.Table 4, The accuracy table showcases the performance evaluation results for MPLSTM on the UCR dataset.It provides a comprehensive overview of the accuracy achieved by the framework in predicting the target variable.The table offers a detailed breakdown of the accuracy scores across different metrics or experimental configurations, allowing for a comprehensive analysis of the framework's performance.Researchers and practitioners can refer to this table to assess the effectiveness and reliability of MPLSTM in accurately predicting the target variable on the UCR dataset.

TransLSTM evaluation
The training history curves in the provided case study offer insights into the performance of two different models, LSTM and TransLSTM, across multiple epochs.In Figure 8  Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief) 839 computational efficiency.Experimental evaluations, conducted on real-world datasets including the UCR dataset, validate the superior performance of MPLSTM compared to traditional regression models.The framework's ability to capture temporal dependencies and long-term patterns in streaming data is demonstrated through accurate predictions, as evidenced by accuracy measures and loss calculations.MPLSTM emerges as a promising approach for datastream prediction, showcasing improved performance and outperforming existing results in terms of accuracy and loss.

Figure 1 .
Figure 1.The details of the proposed MPLSTM framework

Figure 4 .
Figure 4. Dataset speedup curve when applying the MPLSTM framework Int J Elec & Comp Eng ISSN: 2088-8708  Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief) 837

Figure 6 .
Figure 6.The Accuracy curves of training and validation sets in two datasets ECG5000 and Pendigits

Figure 7 .
Figure 7.The loss curves of training and validation sets of two datasets ECG5000 and Pendigits , the first and the third curves represent training loss for LSTM and TransLSTM, while the second and fourth curves represent validation loss for LSTM and TransLSTM, respectively.These curves depict the evolution of training loss over epochs, showing a decreasing trend, and indicating learning from the training data.TransLSTM consistently achieves lower training loss and outperforms LSTM in validation loss, indicating better generalization to new data.Similarly, the other figure displays training and validation accuracy curves, with the first and the third curves representing training accuracy for LSTM and TransLSTM, and the second and the fourth curves representing validation accuracy.Both models exhibit an increasing trend in training accuracy, demonstrating efficient learning from the training data as well as the capacity to generalize to new, untried data.TransLSTM achieves higher training and validation accuracy, highlighting its superior data modeling capabilities.

Figure 8 .
Figure 8.Comparison between LSTM and TransLSTM training and validation loss and accuracy

Table 1 .
Utilized models for case study1

Table 2 .
Hyperparameters used in tuning MPLSTM framework Enhanced transformer long short-term memory framework for datastream prediction (Nada Adel Dief)

Table 4 .
Accuracy of the top 7 classifiers for the UCR datasets compared with the MPLSTM framework In this study, LSTM is integrated and the Transformer model to create TransLSTM, a novel architecture.TransLSTM leverages the Transformer's success in handling sequential data and capturing longrange dependencies.This fusion enhances LSTM's ability to model complex relationships and temporal dependencies in sequential data by incorporating self-attention and cross-attention mechanisms from the Transformer.The investigation demonstrates how TransLSTM can address LSTM's limitations, potentially leading to more accurate and efficient predictions.This case study highlights the innovative potential of combining diverse neural architectures for enhanced predictive capabilities.