A space-structure based dissimilarity measure for categorical data

ABSTRACT


INTRODUCTION
The augment of available datasets provides to the research community new resources to achieve scientific discoveries, optimizing industrial processes and it grants to find relations or characteristic patterns in data [1]. However, there are open issues, for example, determining a dissimilarity measure is one of the most attractive and recent challenges in data mining problems. This is because the performance of many algorithms for clustering, classification, dimensionality reduction, and outliers detection, depends on the metric used to measure similarity/dissimilarity among the data [2]. For this reason, it is convenient to establish an appropriate distance measure for a given data set, instead of using an arbitrary metric.
Choosing a metric for quantitative data (continuous) is relatively simple, since there are several developed metrics such as Euclidean, Cosine, Manhattan, among others. Also, with this type of data it can be used the standard methods of machine learning directly and performing numeric calculations without drawbacks or limitations [3,4]. While, choosing a metric for categorical data (or nominal) is more complex, due to there is not an intrinsic similarity/dissimilarity measure established for categorical objects [5]. In addition, standard machine learning algorithms can not be applied directly in categorical data, because it is 621 not appropriate to calculate statistic descriptors (mean, standard deviation, etc) over a dataset with nominal or qualitative variables as if they were quantitative variables [6]. Besides, categorical data is highly overlapped. This observation has motivated several researchers to work with categorical data, as in the case of [7] who carried out a study on distances for heterogeneous data (databases with mixed quantitative and qualitative variables), based on a supervised learning approach where each sample has additional information about the class to which it belongs. However, this approach can be extended to the unsupervised learning paradigm. Other works such as [8][9][10][11][12] was proposed to binarize the categorical information, resulting in samples that are assigned a 1 or a 0 as indicated by their original qualitative value, to later using a similarity/dissimilarity measures for binary data in cluster algorithms. Nevertheless, far from being a reliable solution these methods have a problem: they are only applicable to categorical databases whose variables have only two possible states, which in general is not the case. Further, these algorithms need to handle a large number of binary attributes when the datasets have features with many more categories, classes or groups. The above in increases the computational cost and memory storage of the algorithm [13].
Alternatively, in the work developed by [14], the authors evaluate the performance of a variety of similarity measures in the literature like overlap, inverse occurrence frequency, occurrence frequency, among others. This is done in the context of outliers detection, but their experiments showed a very poor performance and unstable results with increased standard deviation, and suggest that there is no one best performing similarity measure, and it is necessary to understand how a similarity measure handles the different attributes of categorical datasets [15]. In a recent research [16] was proposed a new metric to measure the distance between categorical type objects, based on the frequency probability of each attribute value in the whole dataset and the degree of dependence among different attributes. However, the probability distribution of the attributes must be taken into account, and the inherent structure of the data in the feature space is not considered [14,17].
Given the previously pointed out, in this paper, we propose a new approach to determine the similarity/dissimilarity measure between qualitative data based on the number of possible states of a categorical variable, to assign the degree of relevance or degree of contribution to the compact cluster structure of each attribute. We call our method: weighted pairing (W-P) based on feature space-structure. The effectiveness of the proposed W-P metric is demonstrated by performing experiments in real categorical databases obtained from the public UCI machine learning repository [18]. We compare our proposal with the distance metric method (DM3) and hamming (H-SBI). We analyze the performance of the W-P distance metric by embedding it into the framework of the K-modes algorithm, which is the most popular distance-based clustering method for purely categorical data [16,19], using a centroid initialization method proposed in [20].

WEIGHTED PAIRING DISTANCE (W-P)
In this section, we introduce the dissimilarity measure for categorical data in the paradigm of unsupervised learning as a weighted pairing distance learned according to the attribute compactness within the data structure. Let a set of objects = { : ∈ [1, ]}, each of them expressed as the vector = [ 1 , … , … ], where stands for the number of attributes. For categorical data, the object at attribute , ∈ , takes one value from the unordered, discrete set = { : ∈ [1, ]} of possible values [21]. In order to establish a similarity measure between categorical objects, the simple matching function aggregates the number matching values [3]: being ( , ) the delta function that equals to 1 if = or 0 in otherwise. Equation (1) determines how many attribute values a couple of samples have in common, and 1 ⁄ is a normalization factor. However, the simple matching lacks of an attribute ranking required for understanding the categorical data [15,22].
Aiming to overcome above issue, we propose a dissimilarity measure that considers the relevance of each attribute, termed weighted pairing (W-P) distance, as follows: .
with the normalized relevance weights ∈ [0, 1] satisfying ∑ =1 = 1, and ( , ) ∈ [0, 1]. To determine the attribute relevance, we assume that the more possible values an attribute can take, the more dispersed the objects in the feature space, as Figure 1 illustrates. Then, we account for the data compactness in the relevance weights in terms of the attribute cardinality as: where operator |•| determines the number of objects within a set. Therefore, the most relevant attributes are those contributing to the most to data compactness, since the attribute relevance becomes inversely proportional to its number of possible values.

Weighted pairing distance properties
If the attribute weights satisfy (3), the dissimilarity function in (2) becomes a distance.

Distance implementation
The computation of the proposed W-P distance is a two-step process. Firstly, we need to learn the weights given a dataset as shown in Algorithm 1. Secondly, given the weights and an object pair, we compute the W-P distance following Algorithm 2. where is a 1-byvector containing a single observation. is an -by-matrix containing all centroids, is an -by-matrix of distances between all observations and all centroids .
According to Algorithm 1, W-P demands the computation of cardinality values and a scaling factor , that yields a time cost of ~(4 ). Besides, Algorithm 2 verifies an attribute matching times for object pair with a complexity of ~(3 ). Therefore, the computational complexity of W-P is linear on the number of attributes, (2 ).

EXPERIMENTAL SETUP
We analyze the performance of the W-P distance in clustering tasks by embedding it into the K-modes algorithm [23] and initialized cluster centroids with [20] method. Table 1 summarizes the details of the six UCI machine learning datasets considered for evaluating the proposed distance [18]. The first dataset, Congressional Voting Records, includes 16 key votes identified by the Congressional Quarterly Almanac of the US House of Representatives, grouped into Democrat or Republican. The Breast Cancer Wisconsin (Original) collection holds 699 samples periodically collected from 1989 to 1991 as the clinical cases of Dr. Wolberg and labeled as benign or malignant. The Mushroom dataset corresponds to 23 species of gilled mushrooms in the Agaricus and Lepiota families. The dataset describes each specimen in terms of its physical characteristics and classifies it as poisonous or edible. Soybean (Small) contains 35 categorical attributes, among nominal and ordered, and four disease classes. The Car Evaluation, a dataset derived from a hierarchical decision model, labels 1728 cars according to six different aspects into unacceptable, acceptable good, and very good quality. Lastly, the Zoo dataset contains sixteen boolean-valued attributes to group specimens into seven different animal classes. For each considered dataset, we assess the clustering performance of W-P in a 50 fold bootstrap scheme, and report average and standard deviation for the following cluster quality indexes, the average intracluster/intercluester distance, cluster discrimination index, rand index and normalized mutual information index. We compare our proposal with the state-of-the-art methods, distance metric (DM3) and Hamming metric with support based initialization (H-SBI), methods proposed in [16] and [20] respectively.
Cluster discrimination index (CDI): Given clusters ⊂ with ⋃ and ∩ ′ = ∅, the CDI computes the performance according to the average intracluster distances (AID) as [16]: stands for the cardinality of the -th cluster. Therefore, ≥ 0, and the smaller its value, the more distant the clusters and the closer the objects within each cluster. Where denotes the centroid of the -th cluster resulting from the K-modes algorithm, and ∆ stands for the average distance between and all the -th cluster objects.
Rand index (RI): is a similarity measure based on the overlap in class agreement, compared to the class disagreement, is defined as [24]: being {•} the trace operator and ∈ [0, ] × corresponds to the permuted confusion matrix between the cluster algorithm output and gold standard label .
Normalized mutual information (NMI): The NMI score relies on the shared object membership, is a symmetric measure for the degree of dependency between and . Unlike correlation, mutual information also takes higher order dependencies into account [25]: where , , denote the number of objects labeled as = , the number of objects in the -th cluster, and the number of objects grouped in the -th cluster that belong to -th ground truth label, respectively. NMI is a positive value with a maximum of 1 achieved when the ground-truth and the resulting clustering perfectly match.

RESULTS AND DISCUSSION
In order to determine the best metric performance between DM3, H-SBI and W-P, we validate our proposal metric distance in terms of intracluster/intercluster distance, CDI, RI and NMI index in the unsupervised learning framework. The first evaluated index was CDI, a smaller CDI value indicate better discrimination on the cluster structure of the dataset. In the Table 2 we compare our W-P metric with DM3 and H-SBI. We see that for three of the six datasets we obtain better CDI values, this may be a consequence of the allocation of the weights, giving greater relevance to the attributes that make the structure of the data more compact. As we can see that for distance-based clustering on categorical data, the K-modes algorithm with the proposed distance metric has a competitive advantage in terms of clustering rand index, in Table 3 the distance W-P metric obtain a betters results in four of six datasets in comparison with DM3 and H-SBI, specifically in the Voting and Mushroom datasets the RI index increase drastically. In addition to making an exhaustive evaluation of the W-P metric introduced in the K-modes clustering algorithm with support based initialization, the NMI index was evaluated. And as can see in Table 4 we obtained bets results to DM3 and H-SBI, we exceeded them in four of the six datasets, the results indicate that our proposed distance metric is more appropriate for the unsupervised categorical data analysis. The average intracluster distance of each cluster and the average intercluster distance between each pair of clusters has been presented in Table 5 to Table 10, where we can see that for Voting, WBDC, and Car datasets the W-P intercluster distance increase in comparison with DM3 results, while Mushroom, Soybean, and Zoo datasets the average W-P intracluster distance decrease in comparison with DM3, this is reasonable because by (4) we can deduce that for datasets with larger number of attributes the weights become small, so the distance between samples is short. Moreover, in Voting and Car datasets the difference between the average intercluster and intracluster distance with W-P metric is greater than DM3 metric results, in the rest of datasets, the difference between the average intercluster and intracluster distance is very similar.

CONCLUSION
In this work, we introduced a new similarity/dissimilarity measure for categorical data based on the feature space structure. This distance metric is a variation of pairing matching but weighted. We call our method: weighted pairing (W-P) based on feature space-structure. The weights are determined for the number of states that each feature has, indicating which attribute contributes more to the cluster's compact structure. The performance of W-P metric was evaluated in terms of intracluster/intercluster distance, CDI, RI, and NMI index into a K-modes algorithm with support-based initialization in the unsupervised learning framework, and we compare with the distance metric (DM3) and H-SBI methods. The obtained results showed a better performance for W-P than DM3 and H-SBI, we demonstrated that this way of computing a distance is effective in recovering the inherent clustering structures from categorical data when such structures exist, and this can be attributed to the fact that our approach is space-structure based.