Multi-label text classification of Indonesian customer reviews using bidirectional encoder representations from transformers language model

ABSTRACT


INTRODUCTION
Customer review is a critical resource to discover useful information about user experiences on a particular product (or service). Such information is important for a company to help them making a good decision about their products. A review text may contain user's opinion about several aspects of a product, where each aspect may accept different sentiments from the user. Here is an example of Indonesian customer review of hotel experiences that contains different sentiments for different aspects of hotel: "kamarnya nyaman dan bersih, tetapi TV nya terlalu tinggi jadi kamu tidak bisa nonton" ("The room is comfortable and clean, but the TV is too high, so you can't watch it"). In that review, the aspect of "cleanliness" has a positive sentiment, but the aspect of "TV" as one of the hotel's facilities has a negative sentiment.
Aspect category detection or aspect classification is one of the subtasks from aspect-based sentiment analysis (ABSA) [1]. For the aspect classification task, the aspects contained in the text review are identified and the polarity of each aspect is then determined by sentiment classification. The results of this system are ISSN: 2088-8708  Multi-label text classification of Indonesian customer reviews using bidirectional … (Nuzulul Khairu Nissa) 5643 classification?; ii) How is the effectiveness of our second model using IndoBERT end-to-end model for multilabel text classification?; and iii) How much different is the effectiveness of our models using monolingual language model IndoBERT compared to those using multilingual language model mBERT for multi-label text classification?

RESEARCH METHOD
The aspect classification task is formulated as multi-label text classification. This section explains dataset, multilabel classification task, system components (BERT, CNN, and XGBoost), research method, evaluation method and experiment details. The general architecture for our two suggested strategies will also be illustrated in this section.

Dataset
We use Airy Rooms hotel reviews dataset in Indonesian language from Azhar et al. [2]. We divided the dataset for this study into train, valid, and test sets according to the standardization of IndoNLU documentation [20]. The dataset consists of 2,854 reviews, that was divided as follows: 2,283 for the train dataset, 286 for the test dataset and 285 for the validation dataset. Each review consists of text and a set of assigned labels. Table 1 presents the labels distribution of Airy Rooms hotel reviews dataset.

Multi-label classification task
Classification task is generally described as: "Given a train set made up of pairs ( , ), discover a function ( ) that maps each attribute vector , to its associated label , with = 1, 2,3, … , , where the amount of training examples is n" [23]. In multi-label classification, each input sample may correspond to more than one labels. More specifically, for each input sample, there exist a set of labels "M" to which the input sample belongs [24]. Figure 1 presents an illustration of the multi-label text classification formulation for aspect classification. The aim of this study is to predict or categorize the aspect category of the customer review in the Indonesian dataset of hotel customer reviews. Given a customer review, the text embedding for the review is initially generated. The embedding results are then inputted into the classifier that will produce the predictions of the aspect labels of the customer review. For example, given a hotel review displayed in Figure 1, two aspects are classified from the review: "cleanliness" (kebersihan) and "Wi-Fi" (Wi-Fi). More detailed explanation

System components 2.3.1. BERT
The current state-of-the-art for many natural language processing (NLP) applications is BERT, which stands for bidirectional encoder representations from transformers [14]. BERT is expected to learn a word's context right-to-left to predict the previous word or in left-to-right to predict the next word in a sequence [16]. There are various kinds of BERT models which have been developed, such as: m-BERT [14], distil-BERT [25], XLMRoBERTa [26], IndoBERT (base, lite, and large) from Wilie et al. [20], IndoBERT (base) and IndoBERTweet from Koto et al. [21], [27]. For more detailed explanation on each of these methods, please refer to the original paper. Table 2 summarizes the hyperparameter settings that were used in previous work to build all pre-trained language model BERT versions that are utilized in this work. Subscripts W and K in the table denotes the respective model was trained by Wilie et al. [20] and Koto et al. [21], respectively.
Multilingual BERT (M-BERT) is a single pre-trained language model on the concatenation of monolingual Wikipedia corpora from 104 languages, including Indonesian language [14]. As a result of the m-BERT model being pre-trained on high number of languages, it expands the applicability of this model, and researcher can use it to solve the task in various languages. The DistilBERT is a distilled variation of the BERT; it is smaller and operates faster. It is also capable of retaining 97% of BERT's ability to understand a language [25].
One of the multilingual model that has been trained on 100 different languages is called XLM-RoBERTa [26].
Just two years ago, a huge number of Indonesian corpus were used to pre-train the BERT model, and the resulting model called IndoBERT [20], [21] and were publicly available for research purpose. A lot of attention has been paid to the exploration of IndoBERT for several text processing task. The large-scale Indonesian dataset used to train IndoBERT by Willie et al. [20] was compiled from texts found on websites, news, blogs, and social media. This dataset consist of around 4 billion words, with around 250M sentences [20]. 220 million words from the Indonesian web corpus, news articles, and Wikipedia were used to train IndoBERT by Koto et al. [21]. Koto et al. [27] also released the IndoBERTweet, a BERT language model that has been pre-trained with 409M word tokens from Indonesian Twitter dataset.
In this study, we build our model for aspect classification using the IndoBERT pre-trained model developed by Wilie et al. [20] and Koto et al. [21], [27]. For our multi-label text classification task, we utilized IndoBERT using two strategies: i) using IndoBERT as feature representation only and ii) using IndoBERT as end-to-end model (i.e., it serves as feature representation as well as classifier). In addition, we also do further analysis on the comparability of our models' results with those using multilingual BERT, such as m-BERT, distil-BERT, and XLM-RoBERTa.

CNN
It has been shown that CNN is one of the models that performs great for the multi-label text classification as investigated in [2], [11]- [13]. The type of convolution used in text-processing tasks is called one-dimensional convolution. It involves mapping the input text into a set of embedding vectors that correspond to the text's word order [28]. In order to extract indicative information, over the sequence of word embedding vectors the convolutional layer moves a sliding window of size k, while performing a linear transformation along with a non-linear activation function. The pooling layer selects only the information that is suitable for prediction for each window [28]. In this study, we used a single channel convolutional layer following [11], with rectified linear unit (ReLU) activation function. The model of CNN single is detailed in Figure 2.
The convolution window's length is determined by the kernel size. For this study, the kernel will slide along the input embedding and examine two words at a time because we used a kernel size of 2 (this essentially corresponding to the bigram features) [11].

XGBoost
An expanded variant of the gradient boosting ensemble method is called XGBoost, or extreme gradient boosting [29]. To create an efficient learning machine for the ensemble decision three approach, the gradient boosting technique successively combines the results of weak classifiers. The components of XGBoost are an optimization objective function, a parameter adjustment, and a learning model. It is feasible to carry out objective function optimization and reduce model complexity by optimizing the penalty function and minimizing the loss function [30].

Research method
The research method section will illustrate the overall architecture of our proposed strategies. We also explain in more detail our two proposed strategies for identifying all aspects contained in each customer review by performing multi-label text classification. The framework of our first and second strategies are illustrated respectively in Figures 3 and 4.

IndoBERT-CNN-XGBoost model
For the first strategy, we used IndoBERT as embedding, CNN as feature extraction method, and XGBoost as the classifier. Figure 3 depicts the research flow of this model. First, the dataset has to be transformed into the BERT input format. We preprocessed the input data using the BERT Tokenizer. Because the task that we used in this study is multi-label classification, every sentence must have a [SEP] token added at the end and a [CLS] token added at the beginning. To fit the maximum sequence length of 128 tokens, each sentence must be truncated or padded. Besides that, we used the 'attention_mask' which consists of ′0′ (denotes not token) and ′1′ (denotes token). From this step we got the 'input_ids' and 'attention_mask' for each sentence.
Suppose that we used the IndoBERT embedding from Wilie et al. [20], which has a vocabulary size of 30522 and a dimension of 768. We got the 30522×768 size of embedding matrix. Then, each input's 'input_ids' and 'attention_mask' are embedded by the embedding layer. An embedding vector , where ∈ is used to represent each word and stands for the word embedding dimension. To represent the sentence as a whole, the word vector representations of each word is then concatenated. We can define the sentence representation as = 1: = 1 ⊕ 2 ⊕ … ⊕ , where is defined as the input text's maximum length [11]. Input for a single sentence is thus represented as × matrix [15].
After we have obtained a vector representation for each review, then we used it as a feature to train the CNN model to get the refined version of the text feature. To solve the multi-label text classification, we replace the CNN-trained model's output layer with XGBoost as the top-level classifier. The objective here is we want to extract the text matrix's refined version produced by CNN and utilize it as a feature to train the XGBoost classifier. In this work, we also experimented with a few other classifiers, such as random forest and naïve Bayes algorithms. We use problem transformation method classifier chain (CC) to perform multi-label classification using machine learning classifier [8], [15]. From our preliminary experiments, the results of CC strategy is better than BR strategy, which is consistent to previous work [5], [31]. The CC strategy is effective because it can overcome the problem of label dependency [6]. Therefore, in our experiment in this work, we use CC strategy to conduct problem transformation of our data for multi-label classification.

IndoBERT end-to-end model
For the second strategy, we used IndoBERT to build end-to-end model for multi-label classification. Here, IndoBERT is used as text embedding as well as classifier. Two versions of IndoBERT were used: IndoBERT that was pretrained by Wilie et al. [20], and IndoBERT and IndoBERTweet that was pretrained by Koto et al. [21], [27]. In our experiment, we also compared our results with those using multilingual BERT, such as m-BERT [14], distil-BERT [25] and XLM-RoBERTa [26] that were trained in previous work using multilingual corpus (the different configurations of each model have been explained earlier in section 2.3.1). The research process of the second strategy is illustrated in Figure 4. 5647 process, the [CLS] output from the final hidden layer, which is represented as a vector with dimensions of 768, will be entered through a fully connected layer and then calculated using the sigmoid activation function. The outputs from sigmoid activation function give a value between 0 and 1, which represents the probabilities of each of the 10 predicted aspect labels. In this study, the predicted label output is decided using a threshold value of 0.5 [32].

Evaluation method
In this study, we use micro F1-score, hamming loss and accuracy for the model evaluation. Micro F1score calculated using the value of false negative (FN), true positive (TP) and false positive (FP). The micro F1-score which obtained from the average number of calculations from each aspect is defined as (1) [33].
where the total number of aspects is , the aspect index is , FP value on an aspect with index is , FN value on an aspect with index is and TP value on an aspect with index is . The mismatches between the actual and the predicted aspect labels are measured using hamming loss (HL), which is determined as (2) [15].
where , is the actual aspect label and ̂, denotes its predicted aspect label. is the total number of aspects and is the number of sample size.
Accuracy is the probability of a label that has the same value between actual data and predictive data. The (3) is the formula for accuracy in multi-label classification [6].
where the total number of data is , the actual aspect label set is and prediction aspect label is ̂ .

Experiment details
In our experiment, we use several baseline methods such as: random forest, naïve Bayes, XGBoost, CNN, CNN-RandomForest, CNN-Naivebayes, and CNN-XGBoost. For each of these models, we also experimented with the variation of text embeddings: Word2Vec and IndoBERT. The CNN-XGBoost model with Word2Vec embedding was actually the best performing method in Azhar et al. [2], while the CNN model with IndoBERT embedding was the best performing method in Khasanah and Krisnadhi [11]. For our second model, we experimented with some versions of IndoBERT, such as: IndoBERT (lite), IndoBERT (base), and IndoBERT (large) from Willie et al. [20]; and IndoBERT and IndoBERTweet from Koto et al. [21], [27]. Further, a comparison with some multilingual BERT, such as m-BERT, distilBERT, and XLM-RoBERTa, was also performed. For a summary of the configurations of each BERT model, we have detailed them in Table 2.
The model architecture was created using Python 3.7 and the model was trained using a NVidia Tesla T4 with single core. For the hyperparameter of the CNN, we follow the single CNN architecture and settings from Khasanah and Krisnadhi [11], such as using Adam optimizer, batch size of 64, the dimension of the input 512, dropout rate of 0.5, max epoch of 70, learning rate of 0.001 and kernel size of 2. For the machine learning top classifier, we used random forest, XGBoost and naïve Bayes with the CC strategy. For the hyperparameter settings of BERT model, we followed Devlin et al. [14] recommendations, such as using dropout probability rate of 0.1, learning rate of 2e-5 and use Adam optimizer. We set the train and valid batch size of 32, and maximum input length of 128. For the baseline methods using Word2Vec embedding, the training parameters are a window size of 5 and vector size of 512.

The results of our first model: IndoBERT-CNN-XGBoost
Based on Table 3, the results show that deep learning model CNN is more effective than machine learning models, i.e., random forest, naive Bayes and XGBoost, for aspect classification task. Further combining deep learning model CNN with machine learning methods can increase the effectiveness of the model. It appears from the results of CNN-random forest, CNN-naive Bayes, and CNN-XGBoost models that outperform the results of CNN model. This finding is consistent with the one reported in [2]. It indicates that the use of CNN as feature extraction method results in more refined features that enable the machine learning classifier to classify the aspects from a review text more accurately.
Among all models, CNN-XGBoost models consistently gain the best results in terms of micro F1-score, hamming loss, and accuracy in each type of text embedding. Our model using IndoBERT as text embedding for CNN-XGBoost model is shown to significantly outperform the CNN-XGBoost model of Azhar et al. [2] that uses Word2Vec embedding, by achieving the micro F1-Score of 0.8992, hamming loss of 0.0404 and accuracy of 0.7228. This finding shows the effectiveness of our first model in using IndoBERT as text representation in CNN-XGBoost model. These results are also consistent with the IndoBERT in IndoNLU benchmark results obtained by Wilie et al. [20], in which one of their findings is the contextualized pretrained models significantly outperformed the static word-embedding-based models, the advantage of contextualized word embeddings over static word embeddings is illustrated by this [20]. Note that the scores of Word2Vec-CNN-XGBoost models presented in the original paper of Azhar et al. [2] are slightly different from those displayed in our table because they used a different data split from ours (here, we utilize the distribution of the Airy Room dataset split provided by IndoNLU [20]). However, we have ensured to follow similar hyperparameter settings used by Azhar et al. [2] to generate the results of all Word2Vec-CNN-machine_learning variations in our table.

The results of our second model: end-to-end IndoBERT
We used five types of monolingual models, IndoBERT, to do the multi-label text classification. The models are IndoBERT (base, lite, and large) from Wilie et al. [20], IndoBERT (base) [21] and IndoBERTweet from Koto et al. [27]. Table 4 presents the findings of these models. Among all monolingual models, IndoBERT, used in this experiment, the IndoBERT-large model by Wilie et al. [20], achieves the best value of micro F1-score 0.92828, hamming loss 0.02953 and accuracy 0.76112. This can be explained because IndoBERT-large has a much bigger number of parameters in its neural network architecture, which is almost three times bigger than the number of parameters for IndoBERT-base, IndoBERTK and IndoBERTweetK Table 2. This makes IndoBERT-large is more accurate in capturing the structure and semantics of the data. This result shows an improvement in accuracy to the baseline Word2Vec-CNN-XGBoost model by up to 19.19%.
Besides that, we also conclude that in general, the performance of our second models using end-to end IndoBERT model are significantly better than the performance results of our first model presented in section 3.1. earlier. This indicates that using IndoBERT as an end-to-end model by fine-tuning the initial pretrained model with our specific task is more effective than using it as text embedding only. This result confirms the effectiveness of IndoBERT to directly solve various text processing tasks [21].

5649
Beside using the monolingual BERT model, we also conduct a comparative experiment using multilingual BERT models, the results are presented in Table 5. The common multilingual models, m-BERT [14], distil-BERT [25], and XLM-RoBERTa [26], were used.
Based on the findings in Table 5, the m-BERT outperforms the other two multilingual pre-trained language models, with micro F1-score of 0.90901, hamming loss of 0.03705, and accuracy of 0.71768. We might infer from the findings of Tables 4 and 5, that the results of monolingual BERT are still more effective than the multilingual BERT. This is due to the monolingual model is trained using a single language only, so that the model can be more focused and accurate in learning the characteristic of the language on the training data. However, in other languages when the monolingual BERT model is not available, we argue that using the multilingual BERT model is suggested, especially m-BERT since its results are still more effective compared to the results of machine learning or deep learning models CNN using Word2Vec displayed in Table 3.

Qualitative analysis
Based on the findings of the model evaluation that was done in the sections before, the IndoBERTlarge model gives the best performance results. Therefore, we carried out the qualitative analysis using the evaluation results of that model. The qualitative analysis offers methods for assessing, analyzing, and interpreting the significant patterns in the data.
First, we analyze the results from the IndoBERT-large model using a confusion matrix. The multilabel text classification model was tested using 286 data and the analysis was carried out for each aspect.  Based on the confusion matrix in Figure 5, it can be identified that the TP and TN values are generally bigger than the FP and FN values, and it can be seen from the color difference in the confusion matrix, where the TP and TN values generally have lighter colors when compared to FP and FN. Based on the confusion matrix in Figure 5, we can also obtain the accuracy and micro F1 score values for each aspect, which are shown in Table 6.
Based on Table 6, it can be identified that the 'Wi-Fi' is the most accurately predicted aspect with accuracy of 0.9964 and micro F1-score of 0.9875. It can be happened because when a review discusses the 'Wi-Fi' aspect, customers will tend to explicitly mention the word 'Wi-Fi'. So as a result, the model will better understand and more accurately predict that aspect. For example: "hotelnya bagus, makanan cukup, Wi-Fi kurang merata" ("the hotel is good, the food is enough, the Wi-Fi is uneven"). In that review, the 'Wi-Fi' aspect is mentioned explicitly. Furthermore, the most inaccurate aspect is 'cleanliness', with accuracy of 0.9257 and micro F1-score of 0.9181. Because in some cases, the reviews that discuss the aspect of 'cleanliness', do not explicitly mentioned the word 'cleanliness' or 'kebersihan' and sometimes the customers associate it with other aspects or elements. For example: "banyak nyamuk, selebihnya oke2 aja" ("lots of mosquitoes, the rest is quite good"). In that review, the 'cleanliness' aspect is not explicitly mentioned, but the review associates the aspect of 'cleanliness' with the condition of too many mosquitoes.
From Table 6, we can also see that in some cases, the IndoBERT-large model still could not correctly predict all aspects in a review completely. There are some reviews in which the model can only predict some of the aspects correctly. Based on our further analysis, the prediction results have 226 reviews that are correctly classified and 60 reviews that are not correctly classified in complete. The misclassified results are caused by the system that fails to understand the meaning of the review. Table 7 shows two examples of reviews whose aspects could not be correctly classified in complete. In test data (1), the model can predict correctly two out of three aspects contained in the review. In test data (2), the model cannot predict the only one aspect contained in the review.
In test data (1), we found that the aspect 'kebersihan' (cleanliness) is not identified in the prediction results, which can be happened because the model misunderstands the data. The model cannot identify that word 'kutu' (bedbugs) is related to the aspect 'kebersihan' (cleanliness). Next, in test data (2), the aspect category 'kebersihan' (cleanliness) also cannot be detected in the prediction results. We analyzed that this is caused by the typo word "kabersihan" which should be written "kebersihan". This makes the model could not capture the meaning of the review well.

CONCLUSION
In this study, we proposed two strategies using monolingual pre-trained language model BERT on Indonesian language (i.e., IndoBERT) for identifying aspects in the customer review dataset, by performing multi-label text classification. First, we used IndoBERT as text embedding for CNN-XGBoost classifier. Second, we used the IndoBERT as text embedding as well as the classifier in an end-to-end model. Moreover, as part of an in-depth examination of this study, a multilingual BERT model was also exploited. According to the results of our studies, our proposed strategies significantly outperform some of the state-of-the-art baselines. The use of IndoBERT as embedding for the CNN-XGBoost model give some improvement over some machine learning and deep learning models, with micro F1-Score of 0.8992, hamming loss of 0.0404 and accuracy of 0.7228. IndoBERT as contextualized pre-trained models can give better text representation when compared to the context-independent word-embedding model like Word2Vec. Next, the use of IndoBERT models as embedding as well as classifier to solve multi-label text classification can further significantly enhance our first model which uses IndoBERT for text embedding only. The IndoBERT-large outperformed the other IndoBERT models, according to the results, with micro F1-Score of 0.92828, hamming loss of 0.02953 and accuracy of 0.76112. It has been demonstrated that this approach may improve the accuracy of a Word2Vec-CNN-XGBoost baseline by up to 19.19%. When we compared IndoBERT with the multilingual BERT (m-BERT, distil-BERT and XLM-RoBERTa), we found that the monolingual BERT is slightly more accurate than multilingual BERT. Here, the best-performing monolingual BERT model (i.e., IndoBERT-large) gains 6% higher accuracy compared to the best-performing multilingual BERT model (i.e., m-BERT-base). Some suggestions that can be conducted for future work, include: the use of another alternative architectures such as recurrent neural networks (RNNs) or other transformer-based architectures. Other than that, the imbalanced label dataset can be handled using a good and complex synthetic oversampling technique.