Clustering heterogeneous categorical data using enhanced mini batch K-means with entropy distance measure

Clustering methods in data mining aim to group a set of patterns based on their similarity. In a data survey, heterogeneous information is established with various types of data scales like nominal, ordinal, binary, and Likert scales. A lack of treatment of heterogeneous data and information leads to loss of information and scanty decision-making. Although many similarity measures have been established, solutions for heterogeneous data in clustering are still lacking. The recent entropy distance measure seems to provide good results for the heterogeneous categorical data. However, it requires many experiments and evaluations. This article presents a proposed framework for heterogeneous categorical data solution using a mini batch k-means with entropy measure (MBKEM) which is to investigate the effectiveness of similarity measure in clustering method using heterogeneous categorical data. Secondary data from a public survey was used. The findings demonstrate the proposed framework has improved the clustering’s quality. MBKEM outperformed other clustering algorithms with the accuracy at 0.88, v-measure (VM) at 0.82, adjusted rand index (ARI) at 0.87, and Fowlkes-Mallow’s index (FMI) at 0.94. It is observed that the average minimum elapsed time-varying for cluster generation, k at 0.26 s. In the future, the proposed solution would be beneficial for improving the quality of clustering for heterogeneous categorical data problems in many domains.


INTRODUCTION
The evolution of categorical data in data mining has been widely influenced by the need for more accurate and reliable techniques. Data mining is a procedure that seeks to understand data patterns through data exploration and extraction. One of the aspects of data mining is on clustering solution in which the unsupervised learning algorithm plays an important role. Unsupervised learning algorithms focus on no target variable or unlabeled data. The data in much actual categorical data is primarily obtained from questionnaires [1] and mainly in heterogeneous categorical data. The categories are nominal, ordinal, binary, and Likert. Since data mining must also support heterogeneous categorical data, clustering algorithms must be scalable. There are various clustering methods like hierarchical, partitional, and density based. Usually, most of the clustering algorithms used are designed for numerical data only. Clustering utilizing data labeling techniques and distance computation can be directly applied to any numerical operation [2].

RELATED WORKS
The increasing attention to the study of categorical data similarity has raised concerns about the quality of analysis performance. Nominal, binary, ordinal, and Likert data types are considered categorical [8]. It is a challenge to analyze and interpret these categories of data, especially the Likert scale. For instance, ordinal features for Likert data assume that Likert items have the same meaning regardless of whether they are neutral or undecided, affecting the outcome or performance of reality research issues and may also cause biases [9]. Several attempts have been established to rectify these. Among these attempts are applied k-modes that use simple matching distance metrics to partition the datasets into many groups. However, the results gained have had a low intra-cluster similarity, and the starting points may lead to non-repetitive clustering patterns [10]. For example, k-modes using Hamming distance measure inaccurately differentiate the species as they create only one cluster [11] since the algorithm becomes unstable due to selecting the highest frequency of the data. The categorical data approach typically transforms a set of data into numerical ones by considering the relative frequency of the aspect. A review of categorical clustering data implied that the initialization of the centroid method had demonstrated promising results [12]. However, each feature had a hard category value in a hard centroid. This feature reflected the misclassified region [13]. In general, the Euclidean distance metric as a similarity between the data objects is not fully considered [14] and reduces the precision of the decision of the result. Regarding that issue, the clustering algorithm with entropy similarity measure to investigate the weight of features including nominal, binary, ordinal, and Likert scale. Distance metric using entropy is a method that can be used to determine the weight of feature, and the information of entropy consider the uncertainty of all possible occurrence.
The concept of entropy is applied in many areas of studies, for example in studying customers' satisfaction. Surveys are used to gather data from a population. Individual replies to two or more questions are utilized to determine how a survey's scale is determined. Survey responses are combined into a single score by using a scoring system. This scale and entropy values can be used to get the expected information. It is also known as the second law of thermodynamics and the measurement of uncertainty. Information entropy, which is often known as Shannon entropy, was initially proposed by Shannon in 1948 [15]. This measurement of information entropy is subject to error. The higher the information entropy is, the lower the usefulness value of information is. On the other hand, the lower the information entropy is, the lower the uncertainty and the higher the information's value are. Thus, this research aims to utilize entropy distance measure with the clustering algorithms as it has the potential to improve the clustering solution mainly in the heterogeneous categorical data.

PRELIMINARIES
This section introduces and provides the preliminary concepts of clustering required. The choice of distance metrics is crucial since it has a significant influence on the effectiveness of the clustering. The explanations of preliminary observations made on the mini batch k-means algorithm concepts and entropy distance measure capability in providing the solutions to heterogeneous data clustering are discussed.

Mini batch k-means (MBK) algorithm
The approach of the MBK clustering algorithm was adapted from [16]. The algorithm of mini batch k-means is stated in algorithm 1. The data was incrementally stored and updated using a distributed random batching approach called mini batch k-means. The data was stored and updated in a series of short batches. The cluster was updated using the data and prototype values in each batch. The more iterations in a batch, the greater the learning rate will be. Clusters must go through many iterations before they reach a consensus. It may be seen in several cases as the impact of new data decreases with the increasing iteration number. The greater the number of clusters is, the less similar the micro-batch is to a larger batch. MBK has several advantages. They have faster computation time, the most straightforward unsupervised learning that solves clustering techniques, and greater accuracy when working with mixed and large datasets [17]. However, the previous solution MBK has not yet been tested in heterogeneous categorical data.

Entropy distance measure
Performing clustering requires reasonable distances between the attributes to obtain a meaningful cluster. In a clustering method, by default, the distance measures like Euclidean distance and Hamming distance are used in clustering methods such as hierarchical clustering. They perform well in most of the homogenous categorical data [18]. In heterogeneous data, the capability of entropy distances is offered.

1051
Euclidean and Hamming distance measures have a drawback. They can only identify only one cluster at each iteration and often result in a cluster with weak intra-similarity [19]. It is evident that entropy distance measure is possible to improve the result of clustering in heterogenous categorical data. This statement is supported by [7].

PROPOSED CLUSTERING SOLUTION FOR HETEROGENEOUS CATEGORICAL DATA
This section provides the highlight of a proposed algorithm and the process flow for clustering heterogeneous categorical data using MBEKM. In heterogeneous categorical data, the entropy distance measure is applied to handle inadequate heterogeneous information of categorical data. The clustering process of the heterogeneous categorical data leads to information loss and eventually results in insufficient information for decision-making.

MBKEM algorithm
In this research, we propose the enhancement of MBK using an embedded entropy distance measure to ascertain the quality of the performance of clustering using heterogeneous categorical data as indicated in algorithm 2. The entropy distance measure is expected to assist heterogeneous categorical information clustering capability in handling information loss. The algorithm is stated in algorithm 2 which is the MBKEM algorithm. The algorithm starts with the initialization cluster node, then it creates the clusters and initializes the number of clusters, as shown in steps 2 to 4. Steps 5 to 7 are the steps to determine the reliability of the features. This technique includes the heterogeneous information provided by the questionnaire's nominal and ordinal qualities. The next stage is the computation of the probabilities associated with each feature. The identification of weight for each feature is in steps 8 to 10. All attributes' reliability and total reliability, including nominal and ordinal data, are determined to allocate weights to features. Next, steps 11 to 25 are the processes to clarify the distance between two individuals. Then, the distance between each feature category is calculated using weights and entropies. Entropy is constructed using a dissimilarity matrix. The generation of the distance matrix is from the entropy of the choices made by the respondents.
Step 27 is the step in which the sample is randomly selected. Steps 28 to 30 determine the cluster center for each sample in a batch set. In step 29, the cluster center that is the closest in proximity to the data sample is stored. Steps 31 to 36 are to synchronize each batch set with the cluster center, in step 32 is to obtain the cached central for , step 33 is to determine the rate of learning for each cluster center, and the gradient step is to update the cluster center.

Process flow for clustering heterogeneous categorical data
The process flow provides a rough indication of the actions to take from the data pre-processing until the computational result. Figure 1 demonstrates the process flow for clustering heterogeneous categorical data using MBKEM. The process flow is divided into three phases: phase 1, phase 2, and phase 3. The detailed explanations of each phase are presented in the following subsections.

Phase 1
Phase 1 involves the pre-processing of the existing survey data. The data pre-processing step is one of the preliminary steps that can be performed during the cluster analysis process. It involves analyzing the data to transform it into an appropriate format for analysis. In this phase, the steps include imputing missing values, fixing data structure entry, and removing unwanted observations. The data collected from a questionnaire is usually stored as strings.

Phase 2
Phase 2 involves the construction of MBKEM, and there are six stages involved in this. It starts with initializing the number of clusters, separating categorical data, calculating probability, determining entropy values, identifying weight, and determining cluster center batch. The categorical features are separated and located in the same categories. The measure of occurrence probability value of ( ) in feature, is defined in (1).
In which, ( ) is the number of data objects in the dataset with the ℎ values equal to ( ) . Shannon's entropy is used to evaluate the information of entropy. The entropy is to identify the starting point of the mini batch k-means clustering to determine the weight and decide from a collection of options. Shannon entropy is a straightforward measure of uncertainty in a dataset, as stated in (2). The entropy values of categories ( ) in features, Fr is written as (2).
( ) is the occurrence probability of value ( ) . The entropy value indicates the smaller value of entropy, the more typical behavior, and uncertainty. Weighing is a crucial component of entropy information. According to the information theory, the greater the use-value of an individual feature as quantified by its entropy value is, the more relevant or information-rich the judgment will be [20]. The ability to comprehend decisions between two alternatives (e.g., Likert scales) is based on more important information and provides a more convincing result. Consequently, the significance of weight features is proportional to the amount of information provided. The weighting factor used to determine the relative relevance is depicted in (3).
As the number of categories increases, the feature may generate longer distances and contribute more than necessary; hence, it must be appropriately weighed. As shown in (4) is the formula of the weighting scale for features.
is the maximum entropy of a feature in which each category is likely to occur equally. As mentioned earlier, the combined weighting of the two weights is denoted by (3). The element of is defined in (5).
The weight for magnitude and scale of features is written in (6). As shown in (7) is the formula to find the reliability of features, In which is the reliability value that indicates the proportion of the maximum volume of information stored in features, . The value of indicates that the greater the value of reliability is, the more convincible the distance is.

Phase 3
In phase 3, a series of experiments on MBKEM is performed using performance measures of clustering validation. Clustering validation refers to finding the optimal clusters to match the partition of clusters naturally without the need for class information. There are six parameters to measure clustering quality. They are clustering accuracy (CA), external validation (COM), VM, ARI, and FMI, internal validation using SI and elapsed time. CA quantifies the proportion of clustered data objects that are successfully clustered. It conveys the precision with which the results are obtained. CA values are more significant than one implying improved clustering capability and precision [21]. The cluster is partitioned into a set of clusters { 1 , 2 ,…. } on dataset, with number of objects and the formula is estimated in (8): where k is the number of clusters desired, is the number of objects and | | is equal to the number of objects in the dataset. External validation is the process of evaluating the performance of clustering using prior knowledge like class labels such as COM, VM, and ARI [22]. Completeness is considered comprehensive if it incorporates all data points that belong to a given class. A score between 0.0 and 1.0 is obtained. A labeling score of 1.0 indicates perfect labeling. V-measure can be used to ascertain the degree of agreement between two clustered datasets that have been clustered independently. The formula of completeness in (9): In which is the ratio of the number of samples labeled in a cluster that has the same and the total number of samples. Meanwhile, the ARI is a more sophisticated form of the rand index (RI) that assesses the agreement between genuine and acquired labels in terms of their projected agreement. RI determines the degree of similarity between two clustering by evaluating all pairs of samples and counting those assigned to the same or different cluster in the predicted and actual clustering. There are no duplicate clusters (ARI=1), and random labeling occurs regardless of the number of clusters (ARI = 0). The larger the ARI value is, the more influential the grouping will be. The formula of ARI in (10).
In which given a dataset with objects, suppose = { 1 , 2 … } and = { 1 , 2 … } represent the original classes and the clustering results respectively. denotes the number of objects in a cluster and cluster respectively. and is the number of objects in class and cluster . FMI quantifies the performance of a clustering technique by comparing it to other clusters. A score close to zero indicates largely independent labeling, whereas a value close to one reflects clustering agreement. The formula for FMI is defined in (11).
In which are pairs of observations of the same cluster, is pairs of observations of the same cluster but different in the predicted cluster, and is pairs of observations that are not part of the same cluster but same in a predicted cluster. Internal validation examines clusters generated by the clustering algorithm only by comparing the data. The Silhouette method is a well-known internal measure that estimates cluster-related parameter consistency. This method quantifies the similarity of items to their cluster (cohesion) concerning other clusters (separation). The optimum value is one. Near-zero values indicate the presence of overlapping clusters. Negative values frequently suggest that a sample is incorrectly assigned to a cluster because another cluster is more similar to the sample [23]. The formula of SI is defined in (12).
In which is the distance mean in the cluster and is the minimum average distance to points in another cluster.

RESULTS AND DISCUSSION
This research aims to evaluate the impact of the proposed techniques compared to the existing conventional clustering techniques. An entropy measure as a distance metric is proposed to minimize the distance value within a cluster. As a result, it offers a way to improve the quality of clustering tasks. The 1055 primary aim of this research is to enhance the quality of clustering through conventional clustering techniques with entropy measures. This section presents the results and discussions on the computational experiments to measure the quality of clustering over the existing methods and for setting the ideal number of clusters.

Dataset
In conducting the experiments, heterogeneous datasets were taken from the secondary data of a survey on public timber utilization. The public was the end-users who used timber. The total of an unlabeled public dataset was 2,407 respondents. The variable observations used to analyze the public perception of timber usage were 74 qualitative features that included five nominal features, 30 binary features, four ordinal features, and 35 Likert scales. Binary was the nominal feature, and Likert scales were the ordinal feature. The data pre-processing and cleaning procedures such as removing unwanted observations, fixing the data structure, and imputing the missing data were applied.

Computational result
The method was developed using the Python programming language. A computational experiment comparison is tested in this section to validate the impact of proposed solutions on the existing method. In this case, CA, external validation COM, VM, ARI, and FMI, internal validation using SI, and elapsed time are considered for evaluation. The following subsections provide concise results on the quality of the clustering and proposed clustering methods. Figure 2 shows the findings of the comparison made on the accuracy of the algorithm clustering in MBKEM. It shows that the clustering accuracy is 88.1%. The accuracy result indicates that the proposed algorithm is more accurate and capable of convergence.

External evaluation
This section explains performance measures on external evaluation. Figure 3 shows the comparative performance of external evaluation with varying values using MBKEM algorithm in Figure 3(a), k-means algorithm in Figure 3(b), agglomerative algorithm in Figure 3(c), DBSCAN algorithm in Figure 3(d), affinity propagation in Figure 3(e) and MBK in Figure 3(f). From the figure, it demonstrated that the proposed MBKEM has shown the highest performance in the VM, COM, ARI, and FMI. VM is at 0.82, C is at 0.81, ARI is at 0.87, and FMI is 0.94 at = 2, indicating that the two partitions are nearly aligned, more similar, and flourishing a clustering algorithm.
Usually, data categorization is influenced by the choice of unsupervised clustering. Unlike other clustering algorithms, DBSCAN and affinity propagation do not require the number of clusters as a parameter. The external performance evaluation of DBSCAN and affinity propagation influence the preference parameter and damping factor. However, DBSCAN and affinity propagation have a significant  Figure 4 shows the comparison of internal evaluation using the SI at different numbers of clusters ( ). Based on Figure 2, the proposed MBKEM shows the highest performance for SI compared to other clustering algorithms. SI is one of the best indicators for estimating the formation of clusters. The result shows that all clusters are in the right cluster since all values of SI are positive. Figure 3 indicates that the proposed MBKEM gains the best result at =2. The result of SI is due to the silhouette value being at the highest. As the number of clusters increases, the importance of SI decreases. The value of SI near 1 reveals a good value.

Computational elapsed time
The elapsed time is defined as the time required to cluster the data. The quality of the cluster is identified as the lesser amount of time is taken, the better quality the cluster will be. For different numbers of clusters ( ), elapsed time of the MBKEM, k-means, agglomerative, DBSCAN, affinity, and MBK is visualized in Table 1. The elapsed time of the MBK algorithm is less than others at seven times iteration. Hence, MBK has revealed the best clustering algorithm with minimum computational time. However, the proposed MBKEM still shows the minimum computational time compared to K-means, agglomerative, DBSCAN, and affinity propagation.

Discussion
Most of the experiments conducted for MBKEM have provided better results than k-means, agglomerative, DBSCAN, affinity propagation, and MBK. MBKEM utilizing the entropy distance measure has mainly brought a significant accuracy improvement. The nature of the entropy computation method can examine the degree of harmony or degree of consistency in a data group. Each feature in each category of the entropy measure is treated differently. The concept of entropy itself indicates the information stored on the entropy values. The entropy is associated with the weight of thinking cost and decision making. The weight of entropy analysis for each feature indicates the probability of a choice and can decide on a set of alternatives. The higher the weight is, the higher the value of variations is. Previous studies also supported that the entropy distance measure provides better performance and has higher accuracy for categorical datasets due to producing a weighted class of each data in a dataset [24]. Previous studies indicate that using the entropy weight technique to evaluate decision-making has been effective [25]. The choice of thinking cost or decision-making implies the distance between two categorical data. The smaller the entropy value is, the lesser the information stored in the choice and the higher probability of selection will be.
The distance measure for other clustering algorithms employed in this study is the Euclidean distance measure. This measure is the standard distance measure and is mostly used in clustering. Focusing exclusively on the scale of similarity-based samples and some essential features may ignore the data. The structures of ordinal features cannot represent the distance using the Euclidean distance measure. Therefore, the dynamic core structure of information in data cannot be represented, leading to non-optimal results. The non-optimal consequence will affect the performance. The clustering process must be accurate with low complexity to guarantee efficiency. The limitation of MBKEM is not on the speediest computational time than in the traditional MBK. This occurrence could be due to each feature that is deeply treated and handled batch by batch. There are pros and cons to this. Overall, MBKEM still consumes less computational cost than k-means, agglomerative, DBSCAN, and affinity propagation. Interestingly, the proposed MBKEM that embeds entropy distance measure could be a new clustering method variant, especially in heterogeneous categorical data. Thus, enhancing the clustering algorithm is still required to improve the solution's effectiveness and be tested in many problem domains.

CONCLUSION
Clustering categorical data is complex since the values have no inherent order. A good metric to analyze a questionnaire is the entropy distance metric. This research has demonstrated a novel framework based on heterogeneous categorical data of unsupervised clustering methodologies using entropy distance measure as a similarity. Using the similarity measure of entropy enables the information for each data feature to be considered. The MBKEM framework has been proposed and evaluated using heterogeneous categorical data. The performance evaluation metric and time complexity have been investigated, and comparisons with other algorithms have been made to prove the effectiveness of the proposed method. MBKEM has mostly outperformed the other unsupervised clustering methods. This new idea of clustering heterogeneous categorical data solution has demonstrated that the evaluation outperformed in terms of CA, VM, COM, ARI, FMI, and SI at k=2. The execution time of the MBKEM is a bit higher than conventional K-means since each feature is comprehensively treated. Hence, MBKEM's can be improved with the swarm intelligence method in increasing the grouping performance of the heterogeneous categorical data. We believe this framework can be a valuable basis for another relevant research.