Indonesian multilabel classification using IndoBERT embedding and MBERT classification

ABSTRACT


INTRODUCTION
Social media is internet-based media, which refers to individual interactions and activities in sharing media to exchange information [1].The rapid advances in information technology have led to increased social media activity, affecting the number of interactions on social media.Social media's openness and speed have unwittingly facilitated information dissemination [2].The spread of information and the many users who interact are prone to triggering debates due to the ease with which open discussions form between social media users.
Social media users often use toxic comments when debating and cornering an individual or group.So, this often triggers debates that cause great fights on social media.These toxic comments use words that contain elements of hate speech, radicalism, pornography, and or defamation.These comments can easily be spread directly, causing some users to be vulnerable to mental disorders due to unhealthy and unfair debates on social media.With the rapid development of social media activities, keeping online interaction and communication conducive is essential for social media platforms.Making a model that automatically classifies toxic comments, such as hate speech, radicalism, pornography, and or defamation, can help make it easier to keep activities and interactions between users conducive.This can help reduce the negative impact of unhealthy debate on social Because of the data from text, the natural language processing (NLP) method can be used to process it.NLP is used in a wide variety of topics involving computational processing and understanding human language [3].With advances in machine learning and deep learning, various neural networks have been widely used to solve NLP tasks.One example is using artificial neural networks (ANN), long short-term memory (LSTM), and convolution neural networks (CNN) to analyze the sentiments conveyed by Twitter users [4], [5].Apart from analyzing sentiment, models such as CNN, LSTM, or machine learning models such as ANN and logistic regression can also be used to classify toxic comments from social media [6].
However, neural networks and deep learning models still need to improve in simultaneously analyzing the meaning and semantics of words in sentences.In addition, the resulting model tends to be prone to overfitting on small data [7].Based on this, pre-trained models can learn language representation as a whole.bidirectional encoder representations from transformers (BERT) are one of the pre-trained models widely used for various NLP tasks.BERT is a model that uses the encoder architecture of the transformer and is designed to use a bidirectional representation [8].An example of the application of BERT is research conducted by D'Sa et al. [9], by classifying toxic comments on Twitter.In this study, a comparison was made using the FastText and BERT word embedding methods classified by the CNN model and bidirectional LSTM (BiLSTM).In addition, a comparison was also made using the entire word embedding model and classification using BERT.The optimal result of this study is using BERT on word embedding and classification with an F1 value of 84% in multi-class classification.Another study by Nabiilah et al. [10] also used a pre-trained model with BERT architecture.This study compared several pre-trained models that had been trained with Indonesian language corpus data, namely multilingual BERT (MBERT), Indonesia BERT (IndoBERT), and Indonesia robustly optimized BERT pretraining approach (IndoRoBERTa) small.The optimal result of this study is to use IndoBERT with an F1 value of 0.88978.
Previous research also aligns with recent developments in applying NLP for text classification cases using social media data, where pre-trained models can provide model classification results with better performance in recognizing text patterns.However, the increased access to communication by various groups on social media causes the text and language conveyed to be increasingly unstructured and difficult to analyze the pattern.So, to improve the performance of the model in analyzing the meaning and context of the text conveyed, the development of pre-trained model architecture needs to be further developed.Therefore, this research performs multilabel classification of toxic comments using a BERT-based pre-trained model.The striking difference from this research is using different pre-trained models at several stages.In previous studies, the pre-trained model is usually used simultaneously as a feature extraction and classification model or a feature extraction model (word embedding) such as FastText, or Glove combined with a pre-trained model as a classification model.However, this research needs to do this, where the feature extraction and classification process is carried out using two different pre-trained models, IndoBERT as a feature extraction and MBERT as a classification model.In this study, other pre-processing processes were carried out, namely by translating emoticons and slang words.Because the data comes from social media, words or sentences are often produced in a non-standard form or not following the correct Indonesian spelling.It is necessary to carry out these two different processes.
Toxic comments are a form of spreading or expressing cornering or harassing certain users or groups.The things that are used as comments are usually based on physical form, race, religion, or ethnicity [11].Research related to the classification of toxic comments conducted by Zhao et al. [12] compared several classification models such as BERT, a robustly optimized BERT pretraining approach (RoBERTa), and cross-lingual language model (XLM).The optimal result of this study is using BERT with an F1 value of 0.8824.Other research related to the classification of toxic comments using Indonesian data has been carried out by Azhar and Khodra [13].The data used in this study are binary by comparison with the CNN, XGBoost, and BERT models.The optimal result is using BERT with an F1 value of 0.9765.In addition, Leite et al. [14] also used the BERT model.The types of BERT used in this study were monolingual BERT and multilingual BERT.The optimal result of this study is using multilingual BERT with an F1 value of 0.76.
The study conducted byGuillaume et al. [15] used the Reddit dataset to classify toxic comments using several classification models, such as HateBERT, BERTweet, and RoBERTa.The results of this study's classification were with an F1 value of 0.9519 using HateBERT, 0.9603 using BERTweet, and the optimal result was 0.9673 using Roberta.Saraiva et al. [16] also used the BR-BERT and MBERT models to classify toxic comments.The results of the model classification are with an F1 value of 0.76 using BR-BERT and 0.75 using M-BERT.Research conducted by Khan et al. [17] also uses the MBERT model in analyzing various product sentiments and reviews of user services on social media.The optimal result of this research is the F1 value of 81.49%.

RESEARCH METHOD
This section describes the dataset used and the proposed steps in performing multilabel classification of toxic comments of social media users in Indonesia.The proposed method uses a Transformer-based pre-trained model, such as BERT, developed in several languages, especially Bahasa Indonesia.In addition, to process less structured social media data, this research also applies more complete data pre-processing by applying several other processes, such as translating emoticons and translating slang words.

Dataset
The dataset used in this study is data from users' comments on Indonesian various social media such as Instagram, Twitter, and Kaskus.This data was collected, processed, and labeled manually by Izzan et al. [18].The characteristic of this dataset is multi-label, where each comment can be grouped into one or more than one label.Table 1 contains an example of a multi-label dataset.− Pornography describes comments that contain obscenity or sexual exploitation that violates the norms of decency in society.− Hate Speech, describing comments based on identity sentiments concerning ancestry, race, religion, nationality, ethnicity, or specific groups.− Radicalism describes a comment with an understanding or flow that wants political and social change and renewal with an extreme attitude, even using violence.− Defamation, describing comments by attacking the honor or good name of a person or a particular group.The number of datasets used in this study is 7,773 data.Then, the data is divided into training, validation, and test data.The data is divided into 81% train data, 9% validation data, and 10% test data.

Proposed method
This study uses a different approach by using feature extraction and classification with different pretrained models.In the feature extraction stage, the IndoBERT pre-trained model, while the classification stage uses multilingual BERT (MBERT).The focus of this research is on multi-label datasets in Indonesian. Figure 1 contains the flow of the process carried out in this study.
The process carried out in this study uses a pre-trained model, which is a development of BERT, namely IndoBERT and MBERT.Both models are models that have been trained using Indonesian.IndoBERT is a model specifically trained to use the Indonesian language with a dataset called INDOLEM [19].Meanwhile, MBERT is a multilingual model of BERT which is trained in many languages, one of which is Indonesian [20].Both models can be used as a feature extraction or classification model or simultaneously as a feature extraction and classification model.The use of pre-trained models for feature extraction and classification is expected to help models better analyze and study contextual relationships

Preprocessing
Preprocessing is the initial stage for carrying out the text classification process.This stage aims to clean data that has a lot of noise.Processed data is usually unstructured and contains many words or characters that are repeated and unnecessary in the classification process [22].The preprocessing stage consists of several steps, as shown in Figure 2.

IndoBERT embedding
Word embedding is the process of converting words into vectors.Traditional word embedding using the bag of words or term frequency-inverse document frequency (TF-IDF) methods.In addition, word embedding can also be done using static and contextual word embedding.Word2vec, glove, or short text for the static embedding method.Contextual embedding uses pre-trained models such as BERT, IndoBERT, and Elmo [23].IndoBERT is a BERT model trained using masked language modeling with an Indonesian language dataset [24].IndoBERT has 12 hidden layers and has been trained with words from the Indonesian Wikipedia (74 million), news articles from Kompas, Tempo, and Liputan6 (55 million), as well as the Indonesian web corpus (90 million).In the embedding process, the first step is to add special tokens at the beginning and end of the sentence, namely the tokens [CLS] and [SEP].These special tokens are used to separate sentences that are prefixes and suffixes.Then the tokenization process is carried out, which converts sentences into words or tokens.The tokenization process uses word piece tokenization method.This method uses vocabulary owned by the model and vocabulary outside the model.Each pre-trained model has a fixed dimension in vector representation which is 768.Then the word that has been tokenized will be mapped with the vocabulary contained in the corpus.Figure 3 contains the flow of the embedding process using IndoBERT.The sentence will be converted into a representation of 12 tokens, each comprising 768 embedding tokens.Before the feature extraction process, segment embedding and positional embedding processes are carried out to make the model contextually understand the meaning.In segment embedding, the vector representation that occurs is only index 0 and index 1.If all input consists of only one sentence, then the embedded segment is index zero.In addition, positional embedding applies a lookup table of size (n, 768) where n represents the number of long sentences.The first row is a vector representation of each word in the first position.The second row represents each word in the second position, and so on.The combination of the three embedding processes is called input embedding, a solution to make the pre-trained model adaptable to NLP tasks.

MBERT classification
Multilingual BERT (MBERT) is the development of the BERT single language model, which has been trained using a monolingual corpus in 104 languages.MBERT is refined using specific training data from one language, and an evaluation process is carried out in a different language, making it possible to use it across languages, and even MBERT can carry out cross-language generalization tasks well [25].The MBERT model has also been trained using the Indonesian language so that it is possible to use it for tasks in Indonesian.

RESULT AND DISCUSSION
The experiment was carried out using the PyTorch library with the Python programming language.Because the model is pre-trained, the experimental process requires to use the Google Colab Pro platform to meet large resource requirements and memory allocations.The number of epochs used in this study was 5, the batch size was 16, and the learning rate was 5e-5.All models trained in this study use the same number of an epoch, batch sizes, and learning rates.The model classification results are shown in Figure 4, which consists of two different types of model classification results.Figure 4  Experiments on training data show that the entire training model can continue to learn at each epoch so that the resulting classification results continue to increase, even the IndoBERT model tends to have a high F1 value of 0.976.Meanwhile, if tested using validation data, the resulting F1 value tends to experience an increase and decrease, but in a range of values that are not too high.In addition, all models also tend to have an F1 value below 0.90, where IndoBERT_MBERT has the highest F1 value of 0.89.From the training and validation processes, the entire model tends to have a stable F1 value so that the possibility of overfitting can be avoided.The final results of the three models are shown in Table 3 which contains the results of model evaluation using data testing.The proposed model in this study also has a better f1 value than the previous study conducted by [10], with an F1 value of 0.889 on data testing.This study also uses the same dataset as the research conducted by [10].

CONCLUSION
Based on the experiments that have been carried out, the proposed model in this study can improve the model's ability to classify toxic comments from social media users in Indonesia.It is possible to use a combination of pre-trained models that have been trained in Indonesian, such as IndoBERT for feature extraction and MBERT for classification models.The proposed model also tends to have a stable F1 value and does not experience overfitting, compared to using one pre-trained model in the feature extraction and classification models.For further exploration, further research can apply a combination of pre-trained models such as Indo Electra and Indo Roberta as feature extraction with machine learning models such as k-nearest neighbors (KNN) and support vector machine (SVM) or deep learning models such as CNN and LSTM.


ISSN: 2088-8708 Int J Elec & Comp Eng, Vol.14, No. 1, February 2024: 1071-1078 1074 between words.So, the model can better determine the pattern for classifying comments.The evaluation stage is carried out by calculating the model classification in the form of an F1 score derived from the calculation of true positive, true negative, false negative, and false positive [21].

−
Noise removal: to remove characters such as numbers, punctuation marks, or excess spaces.− Case folding: to change capital letters to non-capital letters.− Translated emoticons: to translate emoticons into words.− Tokenizing, to change sentences word for word.− Translated slang words to change non-standard words into standard words according to Indonesian spelling.− Stemming: to remove affix words, so the resulting words are basic words.− Stopword removal: to remove words that frequently appear in sentences and do not have specific meanings.

Figure 3 .
Figure 3. IndoBERT embedding process (a) contains the F1-score of the training data and Figure 4(b) contains the F1-score of the validation data.

Figure 4 .
Figure 4. Comparing model classification results on (a) training data (b) validation

Table 1 .
Table 2 contains the number of data divisions.Example of the dataset

Table 3 .
Evaluation results on testing dataFrom the results of the classification model on data testing, the proposed model in this study, namely by using a combination of feature extraction from IndoBERT and the MBERT classification model, obtained optimal results with an F1 value of 0.9032.The proposed model also tends to have a stable F1 value Indonesian multilabel classification using IndoBERT embedding and MBERT … (Ghinaa Zain Nabiilah) 1077 on data testing, validation, and training.From the resulting values, the proposed model is more stable and does not experience overfitting because the F1 value between the training and testing data has a very small difference of around 0.01.Whereas in the MBERT and IndoBERT models, the difference in the value of f1 generated on the testing and training data is 0.04 and 0.07, so the model tends to experience mild overfitting.