Stock market prediction of Bangladesh using multivariate long short-term memory with sentiment identification

ABSTRACT


INTRODUCTION
Stock market forecasting is crucial for understanding the state of a nation's economy due to its significant impact. Despite its complexity and challenges, efforts to predict stock market trends persist. However, the task is made difficult by the many external factors that influence the market [1]. The stock market in Bangladesh is no different. Advanced technology has made it easier for investors to access crucial stock market data. Analysts now use sophisticated tools and computers to predict the future trend of the stock market [2]. Every year new methods and algorithms for stock market prediction are introduced by researchers. The utilization of machine learning and data analysis as new resources is widespread in this field [3].
Stock market trends are influenced by a variety of technical and external factors. Technical factors are simple to find, but external factors are exceedingly challenging. Additionally, the stock markets in developing nations like Bangladesh may be influenced by more difficult variables, like news sentiment. As a result, relying just on the technical factors to predict the stock market will not yield reliable results. Technical parameters likeopen, high, low, close, and volume as well as external ones like: news sentiment score, inflation, gross domestic product (GDP), exchange rate, interest rate, and balance have all been taken into account in this research.
The main goals of this work are as follows: i) to predict the stock market price trend using technical and external factors and ii) to identify the correlation between these external factors and stock market price trend. The main objective of this paper is to predict the stock market of Bangladesh more accurately by

BACKGROUND AND RELATED WORKS
Technical indicator analysis has already contributed significantly to the field of stock market forecasting. Numerous publications on stock market forecasting of Bangladesh using technical factors are also available. Atsalakis and Valavanis [4] suggested a machine learning model that predicted the stock market with nearly 20% accuracy using only technical indicators as input. Kara et al. [5] used artificial neural networks and support vector machines with technical indicators to predict the Istanbul Stock Exchange trend. The stock market of Bangladesh is also predicted using the linear classification model [6]. This paper advised to consider logistic regression to predict Bangladesh stock market.
The GDP or economic growth of a country may also have an impact on the stock market [7]- [11]. The relationship between a nation's stock market and economic growth was discussed in a number of articles.
A key external indicator for forecasting stock market patterns is economic growth or GDP [12]. The pace of inflation may also have an impact on stock market return [13]- [16]. Stock market return price is positively correlated with the inflation rate.
News sentiment has a bigger aspect on stock market trends. Positive and negative news sentiment can play a vital role in building the stock market trend [17]- [20]. Kalra and Prasad [21] used the classifier naive Bayes to detect positive and negative news sentiment and achieved up to 91.2% accuracy. In Bangladesh stock market, it depends on the Bangla financial news. So, Bangla financial news articles from some daily newspapers of Bangladesh can be used. Social media sentiment is used for stock market trend prediction, but data collection is difficult due to informal writing styles. Improved neural network algorithms were utilized in Billah et al. [22] to forecast the Bangladesh stock market. They used external features for prediction. long short-term memory (LSTM)-based stock prediction is more effective than other time series-based methods. For Bangla language sentiment analysis, supervised machine learning with domain-specific Lexicon data dictionaries (LDD) provides better results. Bhowmik et al. [23] a rule-based Bangla text sentiment score (BTSC) algorithm that can classify the Bangla text sentiment. Figure 1 shows the process for stock market prediction of Bangladesh using multivariate LSTM with sentiment identification. It starts with data collection, preprocessing, and sentiment identification. Then, an LSTM model is created, trained, and evaluated, and finally, the output is generated.

Data collection
The multivariate LSTM neural network model uses 11 variables as inputs. These are: technical variables (open, high, low, close, and volume), news sentiment variable (news sentiment score), and economic variables (inflation, GDP, exchange rate, interest rate, and balance). The resulting dataset CSV file has 11 columns for these variables. The 11 variables can be grouped into three categories for data collection: − Technical Dhaka Stock Exchange data from (3/3/2014 to 12/29/2021). − Financial news articles from The Daily Ittefaq (3/3/2014 to 12/29/2021) [24]. − Yearly economic data from International Monetary Fund (IMF) database (2014 to 2021).

Technical stock data collection
Data for the historical DSE index was gathered from the Dhaka Stock Exchange's official website. The dataset includes daily numbers for open, high, low, close, and volume of DSE from 2014 to 2021. Variables were created for the daily open, high, low, close, and volume columns in the main dataset.

Financial news articles for sentiment analysis
The "Beautiful Soup 4.4.0" library in Python was utilized to gather 7,695 financial news articles from The Daily Ittefaq. This is an efficient web scraping tool that can be used to obtain public data from any website. All the financial Bangla news articles from 3/3/2014 to 12/29/2021 was collected [24]. There were several news reports on the same day. The news articles were gathered and saved in CSV format. After collecting all the news articles, a sentiment identification algorithm model was applied. The results were combined to create a sentiment score for each date [23]. These sentiment scores served as the foundation for the news sentiment column in the dataset.

Yearly economic data
Economic data for Bangladesh was obtained from the IMF's global economic database for the period 2014 to 2021. The dataset contained information on various economic indicators, including inflation, GDP, exchange rate, interest rate, and current balance. We used this data to populate the economic variable column in our main dataset.

NEWS SENTIMENT IDENTIFICATION
The methodology for news sentiment identification came from Bhowmik et al. [23]. This work proposed the BTSC algorithm, a rule-based system for Bangla sentence-level news sentiment analysis. Some words may behave differently in sentiment analysis of the Bangla language depending on the relevant domain.
For example, in the phrase "শেয়া র ইস্য ু কর়া " [issued shares], the word "ইস্য ু" [issue] has a different meaning in business articles domain than our everyday life. Therefore, a business domain-specific extended sentimental lexicon dataset was created in order to obtain an accurate result in sentiment analysis on financial Bangla news items. Steps are described below for Bangla news sentiment identification: i) a finance domain-specific weighted sentiment lexicon dictionary was constructed in the Bangla language; ii) used modified BTSC algorithm to analyze sentimental scores from Bangla financial news articles [23]; iii) 7,695 financial Bangla news articles were collected from the year 2014 to 2021; iv) after collecting the data, a sentiment analysis process was used to gather sentiment scores for each day from 2014 to 2021 [24]; and v) these date-wise news sentiment data was merged with our existing technical indicator dataset.

Creation of financial lexicon data dictionary
The financial lexicon data dictionary is a list of words used to calculate the sentiment of financial news articles. Bangla words were collected from an online Bangla dictionary API and manually categorized into 6 weighted groups. To accurately determine the sentiment of sentences, a lexicon data dictionary is required. This project's lexicon data dictionary only contains Bangla words and includes words with positive sentiment and words with negative sentiment. Table 1 includes some examples from the financial LDD.

Bull words
This word collection is called bull words because, from a financial standpoint, they are considered to have positive connotations. These words are typically associated with upward market trends, increasing stock prices, and overall economic growth. In this sense, bull words are viewed as desirable and are often used by financial analysts and investors to convey optimism about the state of the economy.

Bear words
Bear word list is the opposite of positive sentimental words in financial sentiment analysis. For the purpose of evaluating the sentiment around business news, every phrase on this list is regarded as having a contradictory sentiment. Bear word lists typically consist of words that are associated with downward trends in the stock market, such as recession, inflation, unemployment, and bankruptcy.

Negative words
Negative word list has words like "ন়া ", "নয", and "শনই" which can make a full sentence negative in the Bangla language. These negative words can have a significant impact on the overall sentiment of a sentence, even if the other words in the sentence are positive. The negative word list is a crucial tool for sentiment analysis in the Bangla language.

Coordinating conjunction words (Co con.)
In the Bangla language conjunctions like "স্থ কন্তু", "আেষপ", "এবং", "অথব়া " plays an important role in sentence making. They should have their own weighted effect value in sentiment analysis [23]. By assigning weighted effect values to conjunctions in Bangla language, resulting in more accurate sentiment analysis.

Subordinating conjunctions (Sub con.)
Another kind of conjunctions list with words like "অস্থ ধকন্তু", "এমনস্থ ক", "স্থ বষের্ত". These conjunctions are often used to indicate a shift in tone or emphasis in a sentence and can play a significant role in shaping the overall sentiment. By assigning weighted effect values to these conjunctions, financial analysts can further refine their sentiment analysis, providing even more accurate insights into the sentiment of financial news and information.

Adjectives and adverbs (Adj.)
We listed some adjectives and adverbs like "স্বচ়া ইষত", "অস্থ ধক", "স্বষ ়া স্থ ধক" as they are used to glorify the sentence sentiment more than other simple words. We categorized them into 3 weighted categories: high, medium, and low. Words with high weight have the greatest impact, words with medium weight have a moderate impact, and words with low weight have the least impact.

Data tokenization
To generate the news sentiment analysis, data preprocessing was performed on the collected data. The PyPI package Bengali natural language processing toolkit (BLTK), an open-source licenses (OSI)-approved Python library, was used for this preprocessing [25], which is licensed under the Massachusetts Institute of Technology (MIT) license. Data preprocessing was done to make the data more suitable for the news sentiment identification model. The first step in data preparation was tokenization, which involved splitting the news into individual sentences and then words. Sentiment analysis was performed at the sentence level for all financial news articles.

Data normalization
Data normalization is the process of removing the characters which are not necessary for sentiment analysis. Characters like ",", "।", ";", "#", "!", "@", "%", "$" have no meaning in sentiment counting. All special characters were removed from the tokenized datasets and question marks were also excluded. Unlike the BTSC algorithm, question marks were not taken into account in sentiment scores [23].

Stop words
Various stop words were removed from normalized data. When analyzing the sentiment dataset, stop words are regarded as having no impact. For example, words like "জনু" [for], "পর্ষ ন্ত" [up to], "ন়া গ়া ে" [by], "স্থ নষত" [to take], "হত" [would have been] have no impact on sentiment analysis. The BLTK stop word list was used to remove stop words from the dataset [25]. For stop word removal, the BLTK package's level hard was utilized.

Data stemming
Data stemming means to remove any suffix/prefix of a word and convert the word to its root form. Suffixes and prefixes like "টি," "ট়া র," "গুষল়া ," and "গুস্থ ল" were removed from the words. For example: "শেয়া রব়া জ়া ষর" will be "শেয়া রব়া জ়া র", "বছষরর" will be "বছর" after stemming the word. Stemming the words helped to match the input data with the LDD dataset words for sentiment analysis.

Parts of speech tagger (POS tagger)
A POS tagger categorizes data into its parts of speech form. Given that the LDD consists of lists of adjectives, adverbs, and conjunctions, it is important to accurately identify the components of speech. Detecting them is crucial for implementing the algorithm of news sentiment analysis. In the Bangla language, there are 5 basic parts of speeches -স্থ বষের্ু, স্থ বষের্ণ, স্বষ ন়াম, স্থ িয়া পে, অবুয [23]. Among them স্বষ ন়া ম has no impact on sentiment score. Stop words were removed during the normalization stage.

Sentiment score counting
The BTSC algorithm has been modified for use in the financial Bangla news sentiment identification domain [23]. In the first step, a loop is run to read CSV news files from the dataset folder, which contains 7,095 news articles in CSV format. A variable "News_score" is declared and initialized to zero.
In the third step, a second loop is executed to tokenize each news article at the sentence level, splitting the entire news into sentences. The sentences are then normalized and stop words are removed. A variable "Sentence_score" is declared and initialized to zero. The sentences are further tokenized at the word level, with each word subjected to stemming and POS tagging functions [25].
In step 10, the LDD word lists are scanned. In step 11, if a word is identified as a bull word, a score of +1 is added to the sentence score. Bear words, conjunctions, adjectives, and adverbs are also checked against the LDD lists, and a score is assigned if a match is found. The coordinating conjunction score is +2, the subordination conjunction score is +1.5, while the scores for adjectives and adverbs (high, medium, and low) are +3, +2, and +0.5, respectively. In step 25, if a word is found to be in the negative words list, its existing sentence score is checked. The pseudocode is given in Table 2. If the score is positive, it is multiplied by -1. This is because if the sentence score is already negative and multiplied by -1, the result would be a positive sentiment score, which is incorrect. The loop ends in step 28. The full news score is calculated by summing the sentiment scores of each sentence in step 29. The remaining for loops end in steps 30 and 31.  The simulation of the modified BTSC algorithm is presented in Table 3. Here, the news has a negative score, indicating that it has a negative sentiment. Here, (*-1) indicates multiplying the existing sentence_score by (-1).

Date wise news sentiment score counting
The scores of all the news for each day were tallied, and the total score was divided by the number of news articles on that day to obtain the average news sentiment score for each date. This allowed us to obtain the average news sentiment score for each date, which was useful in identifying trends and patterns in news sentiment over time. We incorporated the sentiment results into our main dataset, organizing them by date.

MULTIVARIATE LSTM MODEL
Recurrent neural networks (RNNs) excel in learning patterns in time series data, making them effective for time series prediction. LSTM neural networks have an architecture that allows for more accurate prediction of time series patterns, both in the short and long term. A multivariate LSTM was chosen over a univariate LSTM, as it takes into account the correlation of multiple variables for more accurate predictions.

Time series analysis using LSTM in python
The LSTM neural network for stock market prediction used the Dhaka Stock Exchange dataset from 2014 to 2021 for training. The trend was then predicted. The last date in the stock market data was December 29, 2021, and the stock price for December 30, 2021 was predicted using the LSTM model. The steps for the prediction with the multivariate LSTM neural network model are as follows: i) reading the dataset, ii) feature selection and scaling, iii) data cleaning and transforming; iv) training the LSTM neural network, and v) predict next day's price.

Prerequisites
The model building and prediction process utilized a Python 3 environment and necessary Python packages. The following standard packages were installed and used: pandas, NumPy, math, and matplotlib. Pandas was used for data frame operations and matplotlib for plotting graphs and results.

Dhaka Stock Exchange dataset
The CSV file contains the stock market data of the Dhaka Stock Exchange from 2014 to 2021. Technical variables, such as open, high, low, close, and volume, are considered, along with external variables, such as news sentiment score, inflation, GDP, exchange rate, interest rate, and balance. The "Open" variable serves as the output, as it allows for easy evaluation of the stock market trend through the daily opening price of the stock market. The remaining variables act as input variables. Preview of the dataset is shown in Table 4. Table 4. Preview of DSE stock market dataset

Feature selection and scaling data tokenization
The data must be cleaned and scaled to achieve accurate predictions from the model. 11 input features (open, high, low, close, volume, news sentiment score, inflation, GDP, exchange rate, interest rate, and balance) are selected for feature selection and the opening price serves as the output prediction feature. To improve accuracy and speed up model training time for prediction, the dataset must be scaled. The MinMaxScaler approach from the scikit-learn package is used to standardize the input data to a range of 0 to 1. Finally, the data is unscaled to retrieve the original values from the dataset.

Data cleaning and transforming
Data cleaning involves finding and filling missing values in the dataset. In this case, there are no missing values in the stock market dataset. The LSTM sliding windows approach was used to train and validate the dataset through a neural network. The data was split into two parts: 80% for training the model and 20% for testing.

Training the LSTM neural network
The multivariate LSTM neural network model consists of 4 layers: The first layer is an LSTM layer. It takes our mini-batches from the sliding window process as input and returns the whole sequence. The second LSTM layer again takes the returned sequence from the first layer as input. The third layer is a dense layer that consists of 5 layers. The last dense layer returns the predicted value.
The sequence length of the model is 50, resulting in a mini-batch matrix of 50 steps and 11 features from the sliding window process. The size of the first layer's input neurons is equal to the size of the minibatch input data. The total input layer of the LSTM neural model comprises 550 neurons, calculated as (50*11). The epoch count is set at 50.

Predict next day's price
After training the multivariate LSTM neural network, the next day's opening price can be forecasted by processing the necessary data from the dataset and running the model. The minimum required input for the model is 50 steps of the dataset from the sliding window process, due to the sequence length of 50-time steps. The model outputs the predicted opening price for the next day, following the end of the input dataset.

Model performance
The train and validation dataset were input into the LSTM model, and the model was run with both the training and validation data for 50 epochs. A comparable run was performed without external variables. The resulting training and validation loss curve without the use of external factors is displayed in Figure 2. The training and validation loss curve using all 11 variables is displayed in Figure 3. We used different patterns to represent the curves of both the training and validation loss.

RESULT AND ANALYSIS
Comparison of the accuracy is shown in Table 5. The first model, which only employs technical indicators achieved an accuracy of 62% using multivariate LSTM. In contrast, the second model utilized both technical and external features. This model achieved a significantly higher accuracy of 86% using multivariate LSTM. External factors enhanced the accuracy of LSTM-based stock market trend predictions by around 24%. Figure 4 graph displays the results of a multivariate LSTM neural network model that utilizes historical technical data variables (open, high, low, close, and volume) as input data, without the incorporation of external factors. The predictions of the LSTM model are plotted against the original opening price in Figure 4. The results indicate that the model, relying solely on technical variables, can capture a long-term upward trend, but fails to accurately predict short-term trends, which is a crucial aspect in daily trading. Only the last few days are plotted in the figure for clearer visualization. Figure 5 graph presents the results of an LSTM model that includes both technical data variables and external factors. A total of 11 variables were utilized as input. The predictions of the LSTM model are again plotted against the original opening price in Figure 5. The results demonstrate that the incorporation of external factors improves the accuracy of the model's predictions of the stock market trend.
Accuracy of the prediction was only measured graphically. Error calculation methods, such as mean percentage absolute error (MAPE), mean absolute error (MAE), and root mean square error (RMSE), were intentionally avoided as they would not accurately detect the accuracy of the result. The model predicted the opening price of the next day after the end of the training dataset, which was 30 December 2021, as 6721.46 taka. In reality, the opening price at the Dhaka Stock Exchange was 6731.15 taka.

CONCLUSION AND FUTURE WORK
The aim of this paper was to predict the stock market in Bangladesh and determine the correlation between external factors such as news sentiment and other economic aspects for improved prediction. News sentiment alone has a significant ability to detect many socially affecting factors on the stock market trend. In the future, a multivariate multilayer LSTM or bi-LSTM neural network may be used for better prediction. Public mood-driven asset allocation is also a possibility. To gather sentiment data, multiple news sources can be utilized and social media sentiment can also play a crucial role. The combination of these external factors with financial news sentiment has the potential to produce more accurate results in multivariate LSTM stock market prediction.