Text classification model for methamphetamine-related tweets in Southeast Asia using dual data preprocessing techniques

Received Aug 28, 2020 Revised Jan 4, 2021 Accepted Jan 18, 2021 Methamphetamine addiction is a prominent problem in Southeast Asia. Drug addicts often discuss illegal activities on popular social networking services. These individuals spread messages on social media as a means of both buying and selling drugs online. This paper proposes a model, the “text classification model of methamphetamine tweets in Southeast Asia” (TMTA), to identify whether a tweet from Southeast Asia is related to methamphetamine abuse. The research addresses the weakness of bag of words (BoW) by introducing BoW and Word2Vec feature selection (BWF) techniques. A domain-based feature selection method was performed using the BoW dataset and Word2Vec. The BWF dataset provided a smaller number of features than the BoW and TF–IDF dataset. We experimented with three candidate classifiers: Support vector machine (SVM), decision tree (J48) and naive bayes (NB). We found that the J48 classifier with the BWF dataset provided the best performance for the TMTA in terms of accuracy (0.815), F-measure (0.818), Kappa (0.528), Matthews correlation coefficient (0.529) and high area under the ROC Curve (0.763). Moreover, TMTA provided the lowest runtime (3.480 seconds) using the J48 with the BWF dataset.


INTRODUCTION
Southeast Asia is considered a centre of methamphetamine production because of many related arrests, which continue to rise annually after increasing four-fold from 1998 to 2014 [1]. Drug addicts often talk about activities related to methamphetamine on popular social networking services. Some tweets are published on social media for the purposes of buying and selling drugs online. However, little research has examined the development of text classification models for tweets relating to methamphetamine [2]. This study's objective is to propose a new data preprocessing technique for methamphetamine-related tweets in Southeast Asia.
For this purpose, we have introduced a model called the "text classification model of methamphetamine tweets in Southeast Asia using dual data preprocessing techniques (TMTA)". A critical process in the development of the TMTA was data preprocessing using the bag-of-words (BoW) model, a basic, classical, straightforward technique, popular for data preprocessing in text classification. This method considers the frequency of each word as a classification feature known as one-hot representation. Each word is represented by a sparse vector consisting of its index and frequency [3,4]. As features may potentially run indicating the importance of the word. TF-IDF determines the weight of the word (w) in a document (d) that appears in the document, based on (1) [10,11].
Tomas Mikolov developed Word2Vec as a tool for NLP. This tool, which employs deep neural networks that train word associations to synonymous words, is used to create a pre-trained word embedding model that is trained from the corpus. Word2Vec has two different algorithms: The Skip-gram model and continuous bag-of-words (CBOW). Those models represent features that use the vector number. Synonym words can be found by using the cosine similarity function between the two vectors [12].
Cosine similarity is a statistical technique used to measure the similarity between two documents ( 1 , 2 ) represented by numeric vectors in the projection space. A cosine similarity value closer to one suggests similar documents; alternatively, a value that is closer to zero suggests dissimilar ones. Cosine similarity is calculated as shown in (2) [13].
Data classification is the process of creating machine learning models in which a relationship exists between the features and classes of a dataset. Popular data classification algorithms are SVM [14], J48 [15] and NB [16]. SVM is a classification algorithm designed for binary-class problems. SVM classifiers create a decision boundary in a hyperplane that divides the data into two classes in the feature space using a nonprobabilistic binary based on a linear function. The function determines a decision boundary that maximizes the margin between the support vectors. However, functions defining the decision boundary can be polynomial and radial based. The advantage of the SVM classifier is that it does not cause an overfitting problem from the model memorizing too many of the training set. Therefore, the model cannot classify the test dataset to its best ability [14]. In comparison, J48 is a Decision Tree classification algorithm. J48 classifiers select the feature with the highest information gain value, which is then used as the root node of the tree. The model is created using a top-down greedy search that selects features from the root node. The J48 classifier is suitable for large datasets because of its lower runtime [15]. Finally, NB classifiers use a conditional probability calculation. P (A | B) is the conditional probability or probability that event B occurs first and is followed by event A. P (A ∩ B) is the joint probability or the probability that event A and event B will both occur. P (B) is the probability that event B will occur. The NB classifier makes it easy to train models using a dataset with a large number of features, such as text datasets. The conditional probability calculation is shown in (3) [16].
Performance measurements are the measurements of text classification models that assess their accuracy. However, this process may sometimes end up revising the model and evaluating the text mining process until the model is the most accurate. Accuracy is calculated from the correct classification of the model that considers all classes divided by all data, as shown in (4) [17].
F-measure is an overall value that measures the correlation between precision and recall values, as shown in (5) [18].
AUC is the area under the receiver operating characteristic (ROC) curve graph. AUC is the area under the 2D graph to the x-axis (representing the FP) and the y-axis (representing the TP), as shown in (6) [19]. The Kappa coefficient is a statistic used to examine the consistency of the results of classification between two classes. The dataset used in the experiment does not have to have a normal distribution or nonparametric statistics. Po is the observed probability of agreement, and Pe is the hypothetical expected probability of agreement, as shown in (7) [20].
MCC is a measure of the efficiency classification results that is used with two-class datasets. The MCC value determines the balance of classification results with a value between -1 and +1 being calculated using TP, TN, FP and FN, as shown in (8) [21,22].
Runtime performance is calculated from the 3 components of the actual working time: train time, test time and model time [23].

PROPOSED ALGORITHM
The BWF algorithm was a domain-based feature selection technique performed using the BoW dataset and Word2Vec. This algorithm filtered the features of the BoW dataset to produce a new dataset for the creation of a text classification model. The advantage of this algorithm was that it created a BWF dataset smaller than the BoW dataset. The BWF algorithm included two steps. The first step involved creating the BoW dataset, consisting of the set of an instance where a bow such that each bow was instance 1 to instance n, as shown in (9).
W was a set of features in the BoW dataset where W contained the set of features starting from feature 1 to feature w, as shown in (10).
The second step involved a domain-based feature selection technique, performed using BoW and Word2Vec. The domain-based feature selection technique used three steps:  Word2Vec was used to produce a pre-trained word embedding model from the methamphetamine tweet dataset. We used the Skip-gram model, an algorithm that generated the pre-trained word embedding model using Word2Vec. Tomas Mikolov suggested this algorithm, which was superior for infrequent words. Those words consisted of technical terms, slang name and synonym name. The Skip-gram model selected infrequent words to calculate the vector number. Thus, infrequent words had a higher-quality vector number than when using CBOW [12]. The pre-trained word embedding model consisted of 100dimensional features represented by vector number. We defined the 100-dimensional features in focusing on runtime competencies that were used to create the pre-trained word embedding model from a large corpus.  The set of domain-based features (SDBF) was created by measuring the cosine similarity between domain keywords in the pre-trained word embedding model. Our research used the keyword "methamphetamine" as the common name of methamphetamine.
The SDBF was sorted by descending cosine similarity. If the cosine similarity was equal to or greater than 0.8, those features were selected for inclusion as filter features of the BoW dataset. The SDBF contained the set of features starting from feature 1 to feature w', as shown in (11).
The BoW dataset was filtered to keep only the features in the SDBF. Next, the BoW dataset was considered based on the summed frequency in each instance of the dataset. If the sum frequency of an instance was equal to zero, that instance was deleted from the BoW dataset. This research used the R programming package to implement the BWF algorithm [24]. The proposed data preprocessing technique consisted of the BWF algorithm, as shown in Figure 1.   Figure 1, the result of the BWF algorithm was a new dataset, called the "BWF dataset", which used the same text representation outcomes from the BoW dataset. This dataset was used for text classification in that the word frequency was used for the feature of the training with the classifier algorithm. However, the BWF dataset had fewer features and instances than the BoW dataset. The BWF contained the set of vectorization (bow'), where each vectorization was from instance 1 to instance m, as shown in (12): Proof: Let w be the number of features in BoW. Let SDBF be the set of features. SDBF derives from the cosine similarity using the threshold of 0.8. Let w' be the number of features in SDBF. The BWF dataset is derived from BoW with only the features in SDBF. Thus, the number of features in the BWF dataset must be at most w'. Moreover, the BWF dataset is produced by removing (instance of) BoW in which the sums of all feature frequencies are equal to 0. Therefore, the number of instances in the BWF dataset must be less than that of BoW.

RESEARCH METHOD
This research consisted of two objectives. The first was the development of the "BWF" dataset. The second was the development of the TMTA, which consisted of the following steps: tweet collection, data preprocessing, classification, performance testing and hypothesis testing, as shown in the overview of the research framework in Figure 2.

Tweet collection 3.1.1. Synonym identification
This procedure involved the identification of keywords related to methamphetamine consisting of the common name, slang name and street name. These were collected and identified by the UK police [25]. In addition, we used the common name of methamphetamine to measure cosine similarity with Google News vectors [26] to look for additional slang names that had not been collected and identified by the UK police.

Tweet retrieval
Tweet retrieval is the selection of short text on Twitter related to methamphetamine that was posted by users in Southeast Asia, specifically Thailand, Indonesia, and Myanmar.

Tweet labeling
Tweets were labeled by an expert from the Royal Thai Police Forensics Office into 2 classes: Nonabuse or abuse. Non-abuse tweets mentioned the penalty for using methamphetamine or its use as a medicine. The abuse class contained tweets about the illegal use of methamphetamine, including tweets promoting the use of methamphetamine, such as encouraging substance abuse to reduce obesity.

Methamphetamine tweet dataset (MTD)
We collected 2,899 tweets from online social media related to methamphetamine in Southeast Asia that an expert from the Royal Thai Police Forensics Office subsequently labeled. These data were divided into two classes: 2,170 instances of non-abuse and 729 instances of abuse, for a total of 23,175 words. The output of this step was MTD, whose properties are shown in Table 1.

Data preprocessing
This process consisted of corpus preparation, text representation and BWF.

Corpus preparation
Corpus preparation included stop word elimination and stemming. Stop word elimination involved removing some words that were not important and did not need to be further analyzed. Stop word elimination consisted of making all words lowercase, cutting markers, cutting tabs, cutting stop points and cutting stop words, such as "on", "in", "to" and "the". Stemming was the modification of words that had the same stem meaning but were written differently, such as "eat" and "eating". Stemming reduced the number of features of the methamphetamine dataset [27].

Text representation
The process of text representation was a part of NLP that converted text to vector. Vectorization created a set of vectors number representing text tweets that were used to create a text classification model. The classifier could operate on the text vectors. We used data preprocessing techniques consisting of BoW, TF-IDF and BWF, using BoW, a popular text vectorization model, as a baseline. If words appeared in the tweets, then the frequency was counted as 1; otherwise, it was counted as 0 [3,4]. The TF-IDF algorithm, a data preprocessing technique that replaced the text with weight values, calculated the weight of importance that words used as a feature for each tweet. We determined that an important feature should not appear in every tweet. The TF-IDF method is widely used in text mining research [10,11], while BWF represents the new data preprocessing technique that our research proposed. This algorithm performed the domain features selection of the BoW dataset.

Classification
Classification was the process of creating text classification models. In this study, the classification algorithms SVM [14], J48 [15] and NB [16], classifiers found in the Weka software, were used to create the text classification models. The Weka version 3.9 program, which is open source and widely used in research for this purpose, was used to develop the text classification models [28,29].

Performances testing
We used 10-fold cross-validation for the measurement of TMTA performance using various metrics: accuracy [17], F-measure [18], AUC [19], Kappa [20], MCC [21,22] and runtime [23]. The 10-fold crossvalidation technique is a popular method to obtain reliable test results because all data points are used for training and validation; each data point is used to be tested exactly once [28].

Hypothesis testing
The Wilcoxon Rank Sum Test was used to investigate 5 different performance measurements (accuracy, F-measure, AUC, Kappa, MCC) between the proposed and candidate models to determine the differences in 5 performance measurements at a significance level of 0.05 [30,31].

RESULTS AND DISCUSSION
This section describes and discusses the experimental results. It includes four sub-chapters, presented according to the two objectives and based on the characteristics of the BWF dataset, information gain, classification performance and hypothesis testing.

Characteristics of BWF dataset
The feature reduction performance using the BWF algorithm was compared with two popular techniques: BoW and TF-IDF. As Table 2 shows, the BWF dataset had fewer features (969) and instances (2,446) than the BoW and the TF-IDF datasets. The BWF algorithm was highly efficient at feature reduction. The experimental results demonstrated that the BWF dataset included 969 features out of the total 23,175 features in the methamphetamine tweet dataset. Table 2 shows the BWF dataset, which had a smaller number of features and instances than the BoW and TF-IDF datasets. Those features were filtered features of the BoW dataset using SDBF. Therefore, the BWF algorithm was effective at handling the semantic words associated with methamphetamine, such as slang names or synonyms for methamphetamine. This implementation was different from the BoW, as the latter reduces features by removing infrequent words.

Information gain
Information gain was applied to measure the quality of the features used to create a Decision Tree. The information gain tests for the BWF dataset identified several important features, including "meth", "lab", "crystal", "ice", "smoke", "police", "news", "report", "sexy" and "fat". The words "meth" and "lab" were important features in the BWF dataset as they were used in tweets that mentioned laboratory-produced methamphetamine. The words "crystal" and "ice" are slang names for methamphetamine; both had high information gain, indicating the features' potential for the prediction classes using the Decision Tree. Ten important features are shown in Table 3. Table 3 shows the experimental results of the information gain that was used to test the feature quality of the BWF dataset. High information gain indicated the important features for the prediction classes based on the Decision Tree. Those features had strong power in classifying the classes based on the Decision Tree. Information gain showed important features such as "news", "police" and "report" in the non-abuse class tweets; in contrast, "fat" and "sexy" were features of the abuse class tweets.

Classification performance
The classification performance comparison of the three preprocessing techniques used to produce BoW, TF-IDF and BWF datasets are shown in Tables 4, 5 and 6. First, the performance of the SVM classifier with the BoW dataset had the highest accuracy (0.813), F-measure (0.803) and MCC (0.465). However, this classifier used with the BWF dataset had the highest AUC (0.720) and Kappa (0.461). Moreover, the BWF dataset had the lowest runtime (0.820 seconds) with the SVM classifier. Table 4 displays the classification performance comparisons of the three preprocessing techniques combined with SVM.
The decision tree using the J48 classifier with the BWF dataset had the highest scores in all measures, including accuracy (0.815), F-measure (0.818), AUC (0.763), Kappa (0.528) and MCC (0.529), and the lowest runtime (3.480 seconds). Table 5 presents the classification performance comparisons for this classifier. The NB classifier with the BoW dataset had the highest accuracy (0.795), F-measure (0.789), Kappa (0.428) and MCC (0.430). However, when combined with the BWF dataset, this classifier had the highest AUC (0.819) and the lowest runtime (0.400 seconds). The classification performance comparisons are shown in Table 6. Based on the classification performance comparisons in Tables 4, 5 and 6, the proposed model that combined the J48 classifier with the BWF dataset showed the best performance for the TMTA based on the four measures of accuracy, F-measure, Kappa and MCC. In comparison, the SVM classifier with the BWF dataset was the best based on runtime, and the NB classifier with the BWF dataset provided the highest AUC. The results from Tables 4, 5 and 6 compare the performance measurements for SVM, J48 and NB, revealing that the model built on the J48 classifier and using the BWF dataset was the best. In short, this model provided the best performance measurements (accuracy, F-measure, Kappa, MCC). The highest accuracy was shown in terms of the correctness of the data classification using this model. The BWF dataset included 1,827 instances of non-abuse tweets and 619 instances of abuse tweets. This model could be predicted to correct 1,565 non-abuse tweets and 428 abuse tweets. Additionally, this model provided the highest F-measure values. This result showed that the model demonstrated accurate classification of the interest class, which was the abuse tweets. The AUC values of J48 with the BWF dataset were close to 1 as shown in Table 5, indicating that the classification results of J48 with the BWF dataset had high true positive values. The findings revealed that J48 with the BWF dataset highly classified the abuse class (here, an invitation tweet to consume methamphetamine). Table 5 shows the model generated using J48 with the BWF dataset, which had the highest Kappa and MCC values, suggesting high consistency in classification between the two classes (abuse or non-abuse).
The BWF dataset was fitted to the J48 classifier because the features in the BWF dataset were similar to the keyword "methamphetamine". Table 3 shows the features that had high information gain. Therefore, those features were used as a condition for classification based on the Decision Tree, and then the J48 classifier was used as a subset of the Decision Tree.

Hypothesis testing
As depicted in Table 7 the Wilcoxon rank sum test results suggested that the proposed model based on the J48 classifier using the BWF dataset was the best. This model was presented as TMTA because the five performance measurements (accuracy, F-measure, AUC, Kappa, MCC) were significantly higher than for the six-candidate models with a P-Value of 0.043. However, J48 with the BWF dataset yielded performance measurements that were not significantly higher than NB using the BOW and BWF dataset with a P-Value of 0.225. The Wilcoxon rank sum test results for the performance measurements are shown in Table 7.  Table 7 shows the results of the Wilcoxon rank sum test, which was tested at a significance level of 0.05. The measured values for the accuracy, F-measure, AUC, Kappa and MCC of the proposed model were compared with the eight candidate models. The experimental results suggested these five performance measurements of the proposed model were better than for the six candidate models at a significance level of 0.05 with a statistical confidence level of 95 percent.
Therefore, the J48 classifier using the BWF dataset was used in developing the TMTA because this model provided the highest four performance measurements (accuracy, F-measure, Kappa and MCC) and provided a low runtime as shown in Table 5. Furthermore, this model provided significantly higher performance measurements than the six-candidate models as shown in Table 7.
Previous research created text classification models using tweet data based on SVM, J48 and NB classifiers. Although SVM with TF-IDF is still widely used for the development of text classification models [6][7][8][9], we found that the TMTA, using J48 with the BWF dataset, provided higher values for performance measurements than SVM with TF-IDF. In particular, the TMTA using J48 with the BWF dataset had a lower runtime than such widely used techniques as BoW and TF-IDF.

CONCLUSION
We proposed a new model, called the TMTA, to identify whether a Twitter tweet was related to methamphetamine use or abuse based on data extracted from Twitter in Southeast Asia. A vital process in the TMTA is data preprocessing. This research addressed the weakness of BoW in terms of feature selection using the BoW dataset and Word2Vec. A novel data preprocessing technique, the BWF algorithm, used the text vectorization method in the same way as the BoW dataset; however, the proposed BWF algorithm was applied using the feature selection of the BoW dataset to produce a BWF dataset. This approach resulted in a smaller number of features than such widely used techniques as BoW and the TF-IDF datasets. The new dataset was used for the TMTA dataset. The development of the TMTA consisted of four steps. First, we collected data with keywords related to methamphetamine from the Twitter data stream. Second, data preprocessing techniques were applied, including corpus preparation and text representation consisting of BoW, TF-IDF and BWF. Third, we experimented and proposed a text classification model using three candidate classifiers: SVM, J48 and NB. Lastly, we compared the performance of the various text classification models that were created from the above three classifiers using three data preprocessing techniques. The performance measurements included accuracy, F-measure, AUC, Kappa, MCC and runtime. Additionally, the TMTA model development used the J48 classifier with the BWF dataset. This model produced the highest values for accuracy (0.815), F-measure (0.818), Kappa (0.528) and MCC (0.529), high AUC (0.763) and low runtime (3.480 seconds) using the J48 classifier. These results showed that the proposed TMTA was fitted to the Twitter dataset collected in this study. The TMTA using J48 with the BWF dataset provided higher performance measurements than such traditional techniques as SVM with TF-IDF. Consequently, the TMTA using the J48 classifier could be converted to an if-then rule-based decision tree. This rule might be implemented for prototype software to help the police of the narcotics control board identify short messages related to drug abuse.
The BWF algorithm can be used for data preparation stemming from the development of a text classification model based on a different domain, such as amphetamine use in Thailand or illegal advertisements for nutritional supplements. Police have found tens of thousands of amphetamine networks on social media that have cozened young juveniles into becoming members for distributing amphetamine. These networks offered promotions that were paid after transacting drugs. Furthermore, illegal advertisements for nutritional supplements are a problem in Thailand and have been widely sold using social media in that country. Most products (e.g. sexual enhancement products for men) exaggerate their properties. Both problems might be addressed to new investigations in future.