Complete agglomerative hierarchy document’s clustering based on fuzzy Luhn’s gibbs latent dirichlet allocation

ABSTRACT


INTRODUCTION
Clustering is one of the tasks in data mining to analyze large amounts of data and is able to generate hidden information that is very useful for decision making. Clustering is an unsupervised classification technique that grouping data with similarities into a cluster [1]. The techniques in clustering include partitioning, hierarchical, grid-based, and density-based method [2]. These methods have been used in various applications such as K-Means for SME Risk Analysis Documents [3], KPrototype for clustering big data based on MapReduce [4], K-Means for earthquake cluster analysis [5], Ward's linkage method for classifying the languages [6], grid and density-based for trajectory clustering [7], and DBSCAN for categorizing districts [8]. The hierarchical clustering method uses a tree concept, which is divided into agglomerative and divisive approaches [9]. Agglomerative is known as bottom up method, while divisive is known as top down method [10]. In general, agglomerative algorithm have three characteristics such as single link, complete link, and average link [11]. The difference between these algorithms is how to determine the distance between data that will be merged. The data that will be merged has diverse forms such as text, images, sound or video. For text-shaped data, the text is processed first through several steps such as tokenization, filtering, lemmatization or stemming [12].
The results of text processing will be used to generate terms of indexing, which is a vocabulary extracted in the collection of texts, and determine a weight for each term [13]. The terms and its weights will be used to determine the distance between data to be merged in agglomerative algorithm. There are several methods of term weighting in text processing [14]. For vector space model, there is a commonly used Term Frequency-Inverse Document Frequency (TF-IDF) [15], [16]. To overcome the weakness of TF-IDF in addressing synonym and polysemy in natural language, Latent Semantic Indexing (LSI) was developed.
Researches on text clustering with Agglomerative Hierarchical Clustering (AHC) algorithm has been done with those term weighting schemes. The AHC algorithm with TF-IDF has been used to cluster the web pages [17], construct taxonomies from a corpus of text documents [18], construct multi-keyword ranked search scheme [19], context aware document clustering [20], automatic taxonomy construction from keywords [21]. The AHC algorithm has also been developed with LSI for document clustering [22], clustering of news articles [23], information retrieval [24]. The weakness of LSI is overcome by developing a topic-based weighting term called Latent Dirichlet allocation (LDA). LDA is a generative probabilistic model of a corpus, which documents are represented as random mixtures over latent topics, and each topic is characterized by a distribution over words [25]. A document in a corpus is not only identified as a single topic, but can be identified as several topics with their respective probabilities [26]- [28].
LDA has been developed based on a hierarchy, known as hLDA [29], but this method is not able to capture the hierarchical relationship that is formed [30]. Therefore, research needs to be done to classify documents hierarchically by using hierarchical clustering method and LDA for weighting term. The research that integrates LDA into the hierarchical clustering method, especially agglomerative, has already been done. X. Li, H. Wang, G. Yin, T. Wang, C. Yang, Y. Yu, D. Tang [31] used LDA for inducing taxonomy from tags based on word clustering. AHC framework is used to determine how similar every two tags, and then LDA is used to capture thematic correlations among tags that resulted by AHC. D. Tu, L. Chen, G. Chen [32] used LDA to extract the most typical words in every latent topic and apply a multi-way hierarchical agglomerative clustering algorithm (AHC and WordNet) to cluster candidate concept words. The problem is those papers discussed about English text. Until now, the performance of using the LDA method and agglomerative hierarchical clustering in Indonesian text has never been published. If both of these methods are proven to have good performance in clustering Indonesian texts, then it can also be used on other text mining tasks, for example for document summarization.
To overcome this problem, in this research, AHC and LDA are used to cluster documents, where LDA is not used for clustering, but used to generate the weight of terms contained in document text. This research has differences with other related researches for Indonesian text. First, the term weighting calculation used Luhn's Idea to select the terms of text by defining upper cut-off and lower cut-off, and then extracts the feature of terms using Gibbs Sampling LDA combined with the term frequency values and fuzzy Sugeno logic. While, in other research, P.M. Prihatini, I.K.G.D. Putra, I.A.D. Giriantari, M. Sudarma [26] used only TF-IDF for term weighting calculation. Second, the calculation of the distance between documents for AHC is topic-based because it uses the value that resulted by Fuzzy Luhn's Gibbs LDA. Third, the document clustering with AHC uses three characteristics: single link, complete link and average link based on Fuzzy Luhn's Gibbs LDA, and then compares the best AHC characteristics with measurement metrics. While, in other researches, Yuhefizar, B. Santosa, I.K. Eddy, Y.K. Suprapto [33] used Euclidean for distance calculation and single linkage for document clustering. M.A.A. Riyadi, D.S. Pratiwi, A.R. Irawan, K. Fithriasari [34] used single link, complete link and Ward's link based on autocorrelation distance. The discussion in this research is divided as follows. Section 2 discusses about research method. Section 3 discusses about the results and its analysis. Section 4 discusses about the conclusion of this research.

RESEARCH METHOD
This research consists of several steps, such as document text processing, term weighting with Fuzzy Luhn's Gibbs LDA, documents clustering with Fuzzy Luhn's Gibbs LDA, and evaluation, as shown in Figure 1.

Document text processing
In this research, the documents used are news text files obtained from Indonesian news website. Each file is delimited into a collection of terms in the tokenization process. In the filtering process, each term is filtered using a stop words list resulting in a meaningful set of terms. The terms generated by the filtering process, some are already in the form of basic word, some are still have affixes. To make all terms in a uniform shape, all terms are parsing into basic words through the stemming process. In this research, stemming uses the deletion of affixes method based on rules and basic dictionary. The stemming algorithm used is a modification of Nazief-Adriani [35].

Term weigthing with Fuzzy Luhn's Gibbs LDA
In this research, term weighting is done through term selection and feature extraction. The term selection is based on the concept of Luhn's Idea, where each term is calculated based on its relative frequency against all terms in the document text [36]. Luhn describes the relationship between the occurrence frequencies of a term (term frequency) with the importance of a term in the document. The term with medium-frequency is more important than high or low frequency terms. Low frequency terms are included in the lower cut-off, while high frequency terms are included in the upper cut-off. Medium-frequency terms can be obtained by cutting the upper and lower cut-off. To eliminate terms in the upper cut-off can be done by filtering terms based on stop words list. However, to eliminate terms in the lower cut-off, so far there has been no research that can determine effective ways to determine the lower cut-off limits.
In this research, the elimination of terms in the upper cut-off limit is done twice. First, it done by filtering of terms based on stop word list. Second, the filtering results are filtered again through stemming proccess. For the elimination of terms in the lower cut-off limit is based on the stemming result, with different percentage removal values for each text document, as in (1). This is based on the idea that each document has different text lengths, so that no single constant value can be taken for all documents. Variable lcod (lower cut-off document) refers to the lower-cut-off constant value for document d (in the form of a positive integer). Variable fsd (false-stemming document) refers to the number of unsuccessful term stemming in document d. Variable frd (filtering result document) refers to the number of terms in document d used for the stemming process. Variable tsd (true-stemming document) refers to the number of successful term stemming in document d.
The term selection result is a collection of selected terms of each document that have important meanings to be processed at the feature extraction. In this research, feature extraction is done by topic-based LDA method. LDA has some reasoning algorithms, one of which is Gibbs Sampling that have proven effective in conducting the topic sampling process [28]. In general, in the initialization process, Gibbs Sampling assigns the topic of each term randomly using a multinomial random function. However, the use of this function cannot represent the existence of each term in the topic. Therefore, in this research, the determination of topic for each term in the initialization process is done based on the highest occurrence frequency (tf) of the term in all topics, as in (2). Variable zt,k similar with k refers to the topic. Variable tft,k refers to tf value of term t on topic k. To calculate the probability of each term in the sampling process used the formula as in (3). Variable pt,k refers to probability value of sampling for term t on topic k. Variable nkw-1 refers to the value of the topic-term matrix by ignoring the current term value. Variable V is the unique number of terms in all documents. Variable ndk-1 refers to value of the document-topic matrix by ignoring the current term value. Variable β determine the mixing proportion of documents on the topics, while α determines the mixture components of words on the topics [37]. Variable K is the number of topic.
In general, Gibbs Sampling in the LDA requires several times iterations for the sampling process until it reaches convergent conditions. This takes time and high complexity. The addition of fuzzy Tsukamoto logic into the sampling process can accelerate the achievement of convergent conditions with good measurement values [26]. The fuzzy logic concept that used in that research will be improved through this research by using Sugeno method to increase the accuracy value, considering the output of fuzzy logic which needed for sampling is a constant value. In this research, the upper and lower limits for the fuzzy curve are determined based on the tf value of each term. Fuzzification uses a triangular curve with the probability value of the sampling result for each term p, as in (4). Variable u[t] refers to the degree of membership for term t. Variable a refers to the lower bound of the curve. Variable b refers to the peak of the curve. Variable c refers to the upper bound of the curve. The implication function used is OR because fuzzy logic here is used to determine the probability value of term for one topic, while all topic will be determined in sampling process. The rule composition generates the αp value based on the maximum value of all u[t] as in (5), and the value of zo is based on the term probability value across the topic whose value is not equal to zero, as in (6). Variable t refers to the term probability of sampling result. Variable zo refers to the composition output. Variable n refers to the number of topics whose term probability is not equal to zero. For the defuzzification, the final output of fuzzy z is obtained by calculating the mean value, as in (7). The value of z is used as the probability value of term p for topic k and will be used for the next sampling process until it reaches convergent conditions. After convergence, the final value of z will be the feature value for each term and ready for clustering.

Documents clustering with Fuzzy Luhn's Gibbs LDA
The feature values obtained at the feature extraction are used to calculate the distance between documents to be used in the clustering process. In this research, distance calculations using the Cosine Similarity, as in (8). Variable |di-dj| refers to the distance between documents i and j. Variable di refers to document i, while dj refers to document j.
The distance between documents is used to cluster document using three types of AHC algorithms. In the Single Link AHC algorithm, clusters are based on the smallest distance between pairs of two documents, as in (9). In the Complete Link AHC algorithm, clusters are based on the largest distance between pairs of two documents, as in (10). In the Average Link AHC algorithm, clusters are based on the average distance between pairs of two documents, as in (11). Variable dij refers to the selected pair of documents i and j.

Metrics evaluation
In this research, evaluation is done in two steps: evaluation of the feature extraction results and evaluation of the clustering results. The text document used in this research has been classified into five categories by Indonesian news media websites, so it can be used as gold standard for the evaluation process.
Evaluation of feature extraction results is done by comparing results with lower cut-off and without lower cut-off. An evaluation was also performed to compare the feature extraction results between the Fuzzy Gibbs LDA method [26] and Fuzzy Luhn's Gibbs LDA that used in this research. The evaluation was performed using two measurement metrics. First, the perplexity is used to measure the ability of the Fuzzy Luhn's Gibbs LDA feature extraction method to generalize the hidden data, as in (12) and (13) [25]. The smaller value of the perplexity indicates the better performance of the method. Variable (̃| ) refers to perplexity value. Variable M refers to the number of documents. Variable V is the unique number of terms in all documents. Variable refers to the number of occurrences of the word t in document m. Variable K is the number of topic. Variable , refers to the number of documents for each topic. Variable , refers to the number of words for each topic.
Second, Precision (P), Recall (R) and F-Score (F) metrics are used to measure the ability of the Fuzzy Luhn's Gibbs LDA feature extraction method in finding relevant documents according to the gold standard, as in (14)-(16) [14]. Variable TP is true positive, refers to the number of relevant items retrieved. Variable FP is false positive, refers to the number of non-relevant items retrieved. Variable FN is false negative, refers to the number of relevant items that cannot retrieved. The greater value of P, R and F indicates the better performance of the methods.
Evaluation of clustering results is done by comparing the clustering results between Single Link, Complete Link and Average Link AHC using feature extraction results. In this research, the evaluation was performed using five measurement metrics. Precision, Recall, and F-Score (PRF) are used to measure the ability of methods to cluster relevant documents according to the gold standard, as in (14)- (16). The fourth is Normalized Mutual Information (NMI), as in (17)-(20) [38]. Variable (Ω, ) refers to the value of mutual information for class (gold standard) and cluster. Variable (Ω) refers to the entropy of class. Variable ( ) refers to the entropy of cluster. Variable refers to the number of document belongs to class k. Variable refers to the number of document belongs to cluster j.
The fifth is Adjusted Rand Index (ARI), as in (21)-(24) [39]. Variable refers to the number of document belongs to class i and cluster j. Variable refers to the number of document belongs to class i. Variable ′ refers to the number of document belongs to cluster j. The greater value of P, R, F, NMI and ARI indicates the better performance of the methods.

RESULTS AND ANALYSIS 3.1. Fuzzy Luhn's Gibbs LDA
The evaluation results of the Fuzzy Luhn's Gibbs LDA feature extraction method can be seen in Table 1 Sudarma [26]. The Fuzzy Luhn's Gibbs LDA feature extraction method uses lower cut-off and without lower cut-off for the selection feature method, while Fuzzy Gibbs LDA did not use Luhn's concept. The evaluation results in Table 1 shows that the feature extraction with lower cut-off using equation (1) gives the evaluation value not much different than without the lower cut-off. The difference of metric measurement values between the two methods is very small with the range from 0.0036 to 0.0091. This insignificant difference occurs because the feature selection in this research has been done through two step of the upper cut-off, which at filtering step with stop word list and then at stemming step. These two steps have filtered the term with frequencies that appear frequently and rarely appear, so it results a list of meaningful terms for the feature extraction. The lower cut-off process with the value adjusted to the length of the document only removes a small portion of the meaningful term in the feature selection so it does not significantly affect the feature extraction results.
The evaluation results in Table 1 also shows Fuzzy Gibbs LDA method resulted perplexity of 0.0376, while Fuzzy Luhn's Gibbs LDA in this research gives the value of perplexity 0.0375 for lower cutoff and 0.0339 without lower cut-off. This indicates that the Fuzzy Luhn's Gibbs LDA algorithm performs as well as Fuzzy Gibbs LDA in generating hidden data. But, the results of the P, R, and F metric indicate that the Fuzzy Luhn's Gibbs LDA algorithm performed gives better results ranging from 0.9280 to 0.9515 than Fuzzy Gibbs LDA algorithm ranging from 0.8420 to 0.8975. The increasing value of PRF metric shows that the topic determination of each term for initial sampling that performed based on the highest occurrence frequency (tf) of term to all topics by using Luhn's Idea and the use of the Fuzzy Sugeno method is better able to find documents relevant to the gold standard. This indicates that Fuzzy Luhn's Gibbs LDA algorithm is a better choice in performing feature extraction for clustering.

AHC with Fuzzy Luhn's Gibbs LDA
The evaluation results of the AHC algorithms performed based on the Fuzzy Luhn's Gibbs LDA feature extraction can be seen in Table 2. The evaluation results in Table 2 shows that the feature selections with lower cut-off or without lower cut-off do not affect the performance of the AHC algorithms in the clustering process. It can be seen from the measurement metric values that both feature selection methods produce Complete Link AHC algorithm as the AHC clustering algorithm with the best metric value. The differences for the Complete Link AHC algorithm with both feature selection methods ranges from 0.0003 to 0.0263. This shows that both feature selection methods can be used as a good choice in clustering process with AHC. But, in terms of the consistency of the value generated by the five metric measurements, Complete Link AHC and Fuzzy Luhn's Gibbs LDA with lower cut-off have consistent metric values,  [34].
In their research, Complete Link AHC with Autocorrelation distance resulted accuracy value of 0.8235. Therefore, the use of Complete Link AHC and Fuzzy Luhn's Gibbs LDA with lower cut-off is more relevant as a better clustering method in clustering documents especially Indonesian text news.

CONCLUSION
Complete Link AHC and Fuzzy Luhn's Gibbs LDA with lower cut-off algorithm that has built in this research can improve the quality of clusters generation for document clustering especially for Indonesian text news. This is shown by the value of evaluation metrics, which are Precision, Recall, F-Score, Perplexity, Normalized Mutual Information, and Adjusted Rand Index. The values of Precision, Recall and F-Score for lower cut-off have less difference than without the lower cut-off, which means both methods can be used in term selection process. The values of Perplexity, Precision, Recall and F-Score for Fuzzy Luhn's Gibbs LDA algorithm was increased, which means it performed better than Fuzzy Gibbs LDA. The value of Precision, Recall, F-Score, Perplexity, Normalized Mutual Information, and Adjusted Rand Index showed that the Complete Link AHC and Fuzzy Luhn's Gibbs LDA algorithm as the best AHC clustering algorithm, with or without lower cut-off. But, the Complete Link AHC algorithm and Fuzzy Luhn's Gibbs LDA with lower cutoff produce more consistency value for the five metric measurements, which means is more relevant to its gold standard.