Detecting emotions using a combination of bidirectional encoder representations from transformers embedding and bidirectional long short-term memory

ABSTRACT


INTRODUCTION
Recently, emotion detection in text is one of the challenging topics in automatic language understanding (natural language understanding (NLU)) [1].Understanding human emotions using text without facial expressions is a complex job, so research in this area is important to develop [2].Meanwhile, social media is growing rapidly and becoming a place to convey the emotions of its users in the form of text comments [3].Therefore, a system that can understand the context of the sentence is needed so that the system can detect emotions in social media comment texts automatically.In addition, the detection of emotions in the text develops in various languages [4]- [6].Text emotion detection in one language cannot be automatically applied in another language.This is because a language has a different structure from other languages [7].Previous studies have developed emotion detection in texts.Several methods have been used to detect emotion in text, including multinomial naive Bayes (MNB) [8], hybrid support vector machine (SVM) and k-nearest neighbors (KNN) [9], and logistic regression (LR) [10].These investigations implement machine learning techniques.Methods of machine learning have limitations that necessitate the use of labeled data for the development of classification models, and the ensuing accuracy performance must be enhanced.Meanwhile, research related to deep learning is also developing.Several deep learning techniques are used for text emotion detection, namely bidirectional encoder representations from transformers (BERT) [11], [12], bidirectional long short-term memory (BiLSTM) [13], [14], long shortterm memory (LSTM) [15], convolutional neural network (CNN) [16].This research has limitations because they used only one deep learning method for the emotion detection process in the text.These deep learning methods can be combined in order to create a more precise deep learning model by leveraging the benefits of each technique.
Combinatorial research on deep learning has been established previously such as the BiLSTM-CNN combination [17] for English text classification.The combination of BERT and BiLSTM can be used for text emotion detection and sentiment analysis in English [18], [19].These studies have limitations that can only be applied in English.However, previous research has not combined BERT and BiLSTM to detect emotions in Indonesian texts.Indonesian has a different language structure from other languages such as English, so this research cannot be applied to different languages.The development of emotion detection in Indonesian text requires a dataset and applies pre-processed data in Indonesian.Currently, there is no research that combines the BERT-BiLSTM method for Indonesian text.The development of emotion detection in Indonesian text using a combination of deep learning methods, namely BERT-BiLSTM, is needed.
Text emotion detection utilizes text classification techniques to classify text into distinct emotions.Word embedding is one of the phases that comprise the classification of text.Word embedding is the process of converting text into a vector representation of a specific point in space [20].Word embedding techniques include BERT [21], Word2Vec [22], and GloVe [23].Word embedding is crucial to text classification because it can affect the performance of the classification model [24].Inappropriate use of word embeddings will lead to increased computation times and suboptimal classification performance.
The main contribution of this research is using the BERT-BiLSTM model to identify emotions in Indonesian text.In addition, this research contributes to the comparison of word embeddings, namely BERT, Word2Vec, and GloVe, which are employed in text classification.This study also compares deep learning classification models, namely BiLSTM, LSTM, and CNN.Word embedding is combined with a classification model to obtain a model for emotion detection in text.The purpose of this study was to obtain a combination of word embedding and classification models with the best performance for detecting emotions in Indonesian texts.The combination of the best models will improve the classification performance so that it is more optimal.This study employs a social media comment dataset in Indonesian, namely Twitter for Commuter Line and Transjakarta reviews [11].

RESEARCH METHOD
This section describes the research methodology employed in this study.This study does not describe the data collection phase because it utilized datasets from prior studies [11].Using the previous dataset enables comparisons between the findings of this research and those of earlier investigations.The research methodology includes five steps: text pre-processing, word embedding, modeling, hyperparameter tuning, and performance evaluation.Figure 1 depicts the study methodology for detecting emotions in Indonesian texts.

Text pre-processing
Text pre-processing is the process of selecting text data to be more structured so that the data is ready for the next stage of processing [25].The stages of pre-processing text on Indonesian tweets include case folding, tokenizing, filtering, stop word removal, and stemming.Case folding is used to generalize the use of capital letters to lowercase letters.In the case folding stage, characters other than letters and numbers will be deleted.Furthermore, tokenizing is the process of breaking sentences into words or tokens.The natural language toolkit (NLTK) library is used for the tokenizing process.Then, filtering is used to retrieve important words from the tokenizing results.Filtering will remove tweet attributes such as URLs, links, and mentions.The stop word removal stage removes common words that often appear but have no meaning.The list of stop words for Indonesian was built on this research.Finally, the stemming process involves mapping and parsing the form of a word into it is standard form.This study's stemming procedure uses the Sastrawi python library.This library implements the Nazief and Adriani algorithm [26] which is suitable for Indonesian.

Word embedding
The procedure of converting text to vector format is known as word embedding [27].Word embeddings permit words with similar meanings to have equivalent visual representations.In addition to using BERT embedding, this study also uses other word embedding, namely Word2Vec and GloVe.BERT embedding is an efficient bidirectional transformer-based contextualized technique because it can be pre-trained and fine-tuned.In this study, Bert embedding with the type of BERT base multilingual cased model (multi_cased_L-12_H-768_A-12) was used.Meanwhile, Word2Vec is a statistical technique for efficiently learning a single word embedding from a text corpus.This study uses the Gensim library [28] for word2vec word embedding.Finally, global vectors for word representation (GloVe) are an expansion of the word2vec technique for efficiently learning word vectors.This research applies GloVe embedding to the type of Stanford's GloVe 6B 100d word embeddings [29].

Modelling
The emotion detection model in Indonesian text is built at this stage.This research uses Jupyter Notebook, Python 3.10, a Core i7 processor, 16 GB of RAM with a speed of 3,600 MHz, and a GPU Nvidia Cuda for model development.This study used the high-level Keras application programming interface (API) model class and Tensorflow 2.0.This study employed deep learning models for emotion detection are BiLSTM, LSTM, and CNN.Each model will be combined with the word embedding that has been described in the previous stage.This study uses nine scenarios combined with word embedding and emotion detection models, namely BERT-BiLSTM, BERT-LSTM, BERT-CNN, Word2Vec-BiLSTM, Word2Vec-LSTM, Word2Vec-CNN, GloVe-BiLSTM, GloVe-LSTM, and GloVe-CNN.The architectural design of one of the scenarios for the emotion detection model, the combination of BERT-BiLSTM, demonstrates in Figure 2.
BERT-BiLSTM combines BERT as the upstream part, namely the word embedding stage, and BiLSTM as the downstream part, namely the classification model development stage.Figure 2 shows the BERT-BiLSTM architectural design where the Indonesian text input is processed in the BERT embedding.Then the process is continued with model development using BiLSTM, which has two layers: the first layer is forward LSTM which captures the next sequential data linkage, and the second layer captures the previous sequential data linkage.Finally, the output will be in the form of text that has been classified based on the type of emotion, namely happiness, sadness, anger, fear, and surprise.The architectural design in other scenarios also applies in the same way by combining word embedding (BERT, Word2Vec, and GloVe) with deep learning classification models, namely BiLSTM, LSTM, and CNN.
In this investigation, the BiLSTM architecture employs a 256-unit bidirectional layer.Then, proceed with the 64-unit fully connected layer with the rectified linear unit (ReLU) activation function.The last layer is the output layer, with a single unit and an activation function called sigmoid to produce emotion output in the text.The LSTM architecture is also similar to the BiLSTM architecture, only differing in the LSTM layer.
Convolutional+ReLU is followed by a Maxpool layer that repeats up to three times for feature detection in this study's fundamental CNN architecture.Then, the flattening procedure is performed, which consists of transforming the matrix into a vector as a completely connected input layer.Additionally, a dropout layer with a dropout of 0.5 was introduced to prevent overfitting.In addition, the fully connected layer (FCL) with the activation function ReLU is added to the total number of neurons, which is 256.The final CNN layer employs sigmoid activation to generate emotion detection output for Indonesian text.

Hyperparameter tuning
The hyperparameter optimization phase determines the parameters of the optimal experimental scenario.This study employs the parameters loss function, optimizer, batch_size, model activation, and epoch.In this study, the loss function employed is binary_crossentropy.Meanwhile, Adam is the optimizer used.Used Batch_Size is 1,024.The used Epoch is 12 with a validation split of 0.1.In BERT embedding, the length of the Max_sequence is 128.In this investigation, the ratio between training data and test data is 9:1.

Performance evaluation
In the model's evaluation phase, a confusion matrix is used to figure out how well the system can find emotions in Indonesian text.The confusion matrix has four values, namely true positive (TP), false positive (FP), false negative (FN), and true negative (TN) [30].As shown in (1)-( 4), the accuracy, precision, recall, and F1-measure calculations are performed during the evaluation phase.

RESULTS AND DISCUSSION
The dataset based on the tweets of Transjakarta and Commuter Line users using Indonesian.This study employs the dataset utilized in previous research [11].Table 1 illustrates the dataset that was used in this research.This study categorizes emotion text into five types of emotions: namely happiness, anger, sadness, fear, and surprise.The data scenarios used in this study are Commuter Line, Transjakarta, and Commuter Line+Transjakarta.In the Commuter Line+Transjakarta data scenario, we combine the two types of data.The total data on the Commuter Line is 20,395 tweets, and on the Transjakarta data, it is 10,649 tweets.
In general, Figure 3 shows the classification model for emotion detection in Indonesian that obtains the best accuracy is BiLSTM, followed by LSTM, and finally CNN.While for word embedding, BERT embedding obtained the highest accuracy, followed by Word2Vec, and finally GloVe.In comparison to other data scenarios, the accuracy of the Commuter Line+Transjakarta data was highest in the scenario used.This is because by combining the two existing data, there are more text features for emotion detection in Indonesian when compared to using only one of the data.The more text features that can detect emotions in data, the higher the accuracy performance.
Precision, recall, and F1-measure are also used to measure how well the emotion detection model works.Measurements of precision, recall, and F1-measure will determine how effectively the model recognizes emotions in Indonesian.Table 2 demonstrates the values of precision, recall, and F1-measure in each data scenario.This value is the highest when compared to other models, as indicated by the table's bold font.
Table 2 shows that the BERT-BiLSTM model generates the best precision, recall, and F1-measure values compared with competing models in the Commuter Line, Transjakarta, and Commuter Line+Transjakarta scenarios.In all data scenarios, BERT-BiLSTM, BERT-LSTM, and BERT-CNN are the models that produce the highest precision, recall, and F1-measure values.The order of the highest precision, recall, and F1-measure values in Word2Vec embedding is Word2Vec-BiLSTM, Word2Vec-LSTM, and Word2Vec-CNN.Whereas in GloVe embedding, the GloVe-BiLSTM model obtained the highest precision, recall, and F1-measure values, followed by the GloVe-LSTM model, and finally the GloVe-CNN model.
In general, Table 2 shows that the model with the best precision, recall, and F1-measure values is BiLSTM, followed by LSTM, and CNN.Meanwhile, the word embedding that produce the best precision, recall, and F1-measure performances are BERT, Word2Vec, and GloVe, respectively.The data scenario that obtains better precision, recall, and F1-measure values than the other data scenarios is the combined data of Commuter Line+Transjakarta.The findings from the comparison of precision, recall, and F1-measure values are in line with the previously mentioned accuracy performance.
The BERT-BiLSTM model produces superior evaluation results compared to other models.Because the BERT-BiLSTM model integrates word embedding BERT as an upstream part and BiLSTM as a downstream part as a classification model for emotion detection in text.BERT has the ability to understand the contextual relationships in each word and can read long texts better.Meanwhile, BiLSTM can build a model with input in the form of sequential data, such as text.BiLSTM has the advantage of being able to store the selected memory because it has two layers: the first layer captures the previous sequential data linkage, and the second layer captures the next sequential data linkage.The combination of BERT and BiLSTM models combines the benefits of BERT and BiLSTM in order that the model can detect emotions in Indonesian texts more accurately than other models.
This study uses three classification models, namely BiLSTM, LSTM, and CNN, to detect emotions in Indonesian texts.BiLSTM gets the highest achievement score in contrast to other classification models because it will execute input both from the past to the future and from the future to the past.Whereas LSTM only stores past information because it receives only past input.BiLSTM uses two hidden states combined and can store information from the past and the future at any time.BiLSTM shows excellent results because it can comprehend context more on sequential data.So, this causes BiLSTM will perform more effectively than LSTM.BiLSTM and LSTM models are types of recurrent neural network (RNN) models that usually process sequential data and pay attention to the input sequence so that they are suitable for text data.Meanwhile, the CNN model is a model that is usually applied to image data because CNN has a convolution and pooling layer that can perform feature extraction on the image well.However, the CNN model also shows good performance in text data classification, although its performance is still below that of BiLSTM and LSTM.The word embedding used in this study shows BERT achieves superior results to Word2Vec and Glove.This is due to the fact that BERT produces word representations that are dynamically informed by surrounding words.BERT generates word representations that are affected by the words following and preceding them in a sentence.If the same word has multiple meanings (polysemy) in two separate sentences, BERT will produce two vectors for that word, which is used in two distinct contexts.In order for BERT to generate more precise features and enhance model performance, we must implement the suggested modifications.In contrast, each word in Word2Vec has a fixed representation irrespective of it is context.Word2Vec will generate the same single vector for polysemic words in both sentences; therefore, it cannot accurately represent the sentence context.Word2Vec is superior to GloVe due to the fact that the embedding it generates indicates whether words are used in similar contexts.Word2Vec can determine the linear relationship between adjacent vector space words.This maximizes the probability of any term being close to the target, thereby enhancing accuracy.In the meantime, the Glove embeddings pertain to the probabilities that two words appear together in the corpus.GloVe is incapable of understanding the linear relationship between vector space concepts.This makes GloVe less precise than Word2Vec.This study's findings were also contrasted to those of previous research [11].Using BERT or BiLSTM, previous studies have developed a classification system for emotions.The values of accuracy, precision, recall, and F1-measure from previous studies are compared to those of the model proposed in this study in Table 3.
Table 3 shows that the BERT-BiLSTM model proposed in this study got better values for accuracy, precision, recall, and F1-measure in all data scenarios than other models.The next models that have the highest performance metric values are BERT and BiLSTM, which were proposed in previous studies.This study's findings have the potential to enhance the performance of emotion detection systems compared to previous research.Consequently, this study has the advantage of enhancing the accuracy, precision, recall, and F1-measure evaluation results for emotion detection in Indonesian texts.

CONCLUSION
This study successfully identified emotions in Indonesian tweets using the Commuter Line and Transjakarta review datasets.This study is a combination of nine different word embedding (BERT, Word2Vec, and GloVe) and emotion detection (BiLSTM, LSTM, and CNN) scenarios.The BERT-BiLSTM model generates the highest accuracy on Commuter Line, Transjakarta, and Commuter Line+Transjakarta data with values of 88.28%, 88.42%, and 89.20%, respectively.In general, the classification model that obtains the best accuracy is BiLSTM, followed by LSTM, and finally CNN.While for word embedding, BERT embedding obtained the highest accuracy, followed by Word2Vec, and finally GloVe.Precision, recall, and F1-measure are used to evaluate the performance of the emotion detection model.The BERT-BiLSTM model also generates the highest precision, recall, and F1-measure values compared to other models in each data scenario.The results of this study show that BERT-BiLSTM can enhance the classification model's performance compared to previous studies that only used BERT or BiLSTM for emotion detection in Indonesian texts.
In future research, researchers suggest combining the BiLSTM model with other classification models, such as CNN, because combining the two deep learning models is expected to improve performance in emotion detection in text.The dataset in this study can also be expanded not only using social media reviews but also using formal text from Indonesian-language documents so as to expand the benefits of the emotion detection model built.

ISSN: 2088- 8708 Figure 1 .
Figure 1.The research methodology for detecting emotions in Indonesian texts

Figure 2 .
Figure 2. The architectural design of the BERT-BiLSTM model

Table 1 .
shows a comparison of the accuracy of emotion detection in Indonesian text.The dataset in this study Detecting emotions using a combination of bidirectional encoder representations … (Aji Prasetya Wibawa) 7141

Table 2 .
The comparison of precision, recall, and F1-measure