A Preference Model on Adaptive Afﬁnity Propagation

,


INTRODUCTION
In big data era, analysis of large amounts of data become as essential area in Computer Science. The methods of data mining, among others clustering methods, classification methods, etc., are needed to extract or mine the knowledge from large amounts of data. To group the data in accordance with their multiple-characteristic based similarities is known as clustering [1].
In recent years, there are two new proposed data clustering algorithms. One of them is Affinity Propagation (AP) that has been proposed by Brendan J. Frey and Delbert Dueck (2007) [2]. Unlike previous clustering method such as k-means which taking random data points as first potential exemplars, AP considers all the data points as potential cluster centers [3,4]. AP works by taking an input of similarity between data points and simultaneously considers all the data points as potential cluster centers which called exemplars by iteratively calculating responsibility r and availability a based on the similarity until converge. After the points converge, AP found clusters with far less error than k-means and it takes place in less than one hundredth of the amount of time [3].
AP have several advantage over other clustering methods due to AP consideration of all data points as potential exemplars while most clustering methods find exemplars by recording and tracing the fixed data points and iteratively correcting it [3]. Because of it, most clustering methods does not change the set that much and just keep tracking on the particular sets. Furthermore, AP supports non-symmetrical similarities and it does not depend on initialization that found on other clustering algorithms. Because of these advantages, it has been successfully applied in many disciplines such as image clustering [3] and Chinese calligraphy clustering [5]. The paper is the continue study of our previous works [6,7]. In [6] we survey and investigate the performance of various AP approaches, i.e. Adaptive AP, Partition AP, Soft Constraint AP, and Fuzzy Statistic AP. And it is found that i) Partition AP (PAP) is the fastest one among four other approaches, ii) Adaptive AP (AAP) can remove the oscillation and more stable than the other, and iii) Fuzzy Statistic AP (FSAP) could result smaller cluster number than the other approach, because its preferences are generated by using fuzzy-statistical methods iteratively. In [7] we investigate a time complexity of various AP approaches, i.e. AAP, PAP, Landmark AP (L-AP), and K-AP. And it is found that the approach that has the most efficient computational cost and the fastest running time is Landmark AP, although its clustering result is very different in clusters number than AP.
Although AP itself has been proven to be faster than k-means, and it also has shown much success in data clustering, AP still has a limitation, i.e. it is not easy to determine the value of the parameter "preference" p which can result an optimal clustering solution [8]. The goal of this paper is to resolve this limitation with proposing a new model of the parameter "preference" p, i.e. it is modeled based on the distribution of similarity. The model will be explain in the subsection 3.1.. Then it will be applied to Adaptive AP (AAP). This is because AAP has a better level of accuracy than other approaches.
In experiment random non-partition dataset, random partition dataset, and real dataset from UCI datasets [9] are used. By partition of dataset k-means algorithm [10] is applied to generate a four groups of data points. The results of our experiment are shown in section 4..

Input Preference
Supposed we have a set of data points X = {x 1 , x 2 , x 3 , ..., x n }, AP takes as input of similarity matrix (SM) between data points s, where each similarity s(i, j) shows how good data point x j is fitted to be an exemplar for x i . The similarities of any type can be accepted, e.g. for real data, the negative Euclidean distance, and for non-metric data the Jaccard coefficient , so AP can be widely applied in different areas [7].
Instead of requiring that the number of clusters be predetermined, AP takes as input a real number s(j, j) for each data point j so that data points with larger values of s(j, j) are more likely to be chosen as exemplars. These values are referenced to as "preferences". These preferences will affect the number of clusters produced. The preferences values can be the median of the similarities or their minimum. p = median(s(:)), or, p = min(s(:)) (1)

Messages passing
Supposed we have similarity s(i, j), (i, j = 1, 2, ..., n), AP attempts to obtain the best exemplars that can make the net similarity maximized, i.e. the roundly sum of similarities between all exemplars and their member data points. Process in AP can be viewed as passing values between data points. There are two values that are passed between data points: responsibility and availability. Responsibility r(i, j) shows how well-suited point j is to serve as the exemplar for point i, taking into account other potential exemplars for point i. Availability a(i, j) reflects the accumulated evidence for how appropriate it would be for point i to choose point j as its exemplar, taking into account the support from other points that point j should be an exemplar. Figure 1 shows us how the availability and responsibility works in AP. Availabilities a(i, j) are transmitted from candidate exemplars to data points to state the availability of candidate exemplars to data points as cluster point. Responsibilities r(i, j) are transmitted from data points to candidate exemplars and state how strongly each data point favors the candidate exemplar over other candidate exemplars. All of this message passings are kept done until convergence is met or the iteration reach a certain number.
Initially all r(i, j) and a(i, j) are set to 0, and iteratively their values are updated as follows until convergence values achieved: After calculating both responsibility and availability, iteratively their values are updated as follows until convergence values achieved: where λ is a damping factor modeled to reduce numerical oscillations, and r o (i, j) and a o (i, j) are previous values of responsibility and availability. λ should have value that is greater than or equal to 0.5 and less than 1, i.e. 0.5 ≤ λ < 1. A high value of λ may make number oscillations avoided, but this is not guaranteed, and a low value of λ will make the AP run slowly [11].

Exemplar decision
For a set of data point x i , if x j can reach the maximal of r(i, j) + a(i, j), then it could be deduced that i) x j is the most suitable exemplar for x i , or that ii) x j would be the most exemplar of x i . The Exemplar for x i is selected as the following formula: where c i is the exemplar for x i .

Adaptive Affinity Propagation
AP has many extensions. One of the extension is Adaptive Affinity Propagation (AAP). AAP is designed to solve AP limitation : we can not know what value of preference can result the best clustering solution, and if oscillations occurs, it cannot be eliminated automatically. To solve the problem, AAP can adapt to the need of the data sets by configuring the value of preferences and damping factor.
As in [11] we assume that C(i) is the number of clusters in the iteration. We summarize the AAP algorithm as follows:
As proposed in [11] when the values of C(i) is lower than 2, the AAP algorithm stops. AAP has shown a better quality or at least same quality in making a clustering result as AP and finding optimal solution based on different kind of data sets [11]. AAP has shown to be able to process several type of data ISSN: 2088-8708 such as gene expression [11], travel route [11], image clustering [11,12], a mixed numbers and categorical dataset [13], text document [14], and zoogeographical regions [15].

PROPOSED WORK 3.1. Proposed Preference Model
From a set of data points X = {x 1 , x 2 , ..., x n }, supposed we have randomly two data points x i and x j , if distance from x i to other points is larger than to x j , then x i has a lower possibility than x j to be the dataset center. On the basis of this assumption, preference for each data point can be computed as follows.
For a given data point x i , similarities from x i to other data points are summed up as: Then it is normalized as: Finally, for each data point preference can be defined as follows : where Const can be real non-zero number or min(s(:)) Equation (9) of preferences represents and reflects the distribution of data set, and also we hope that it will tend to better results as shown in results section. Then this model is applied to Modified Adaptive Affinity Propagation algorithm (MAAP), as explain in subsection 3.2.. Our model is simple and easy to apply if we compare to a another model proposed by Ping Li et.al (2017) [16].

Modified Adaptive Affinity Propagation (MAAP)
We modify adaptive AP in the following manner
Because we set the proposed preferences p in algorithm, then we could omit the step 6 in Algorithm 1.

Cluster Evaluation
Silhouette validation index and Fowlkes-Mallows index [17] are used to evaluate the quality of a clustering process.
For a given dataset X = {x 1 , x 2 , ..., x n }, x i ∈ R m , the Silhouette index of x i can be defined as where a(x i ) is defined as a mean distance from other data points in the same cluster K c to x i ; d(x i , K c ) represents a mean distance from all data points in cluster K c (c = c) to c = 1, 2, .., C, (C represents the number of cluster) (C = C).

IJECE ISSN: 2088-8708 1809
And for a given cluster X = {x 1 , x 2 , ..., x n }, the Silhouette Index of the X dataset can be expressed as follows: Sil(x i ))/n (12) Fowlkes-Mallows Index(FMI) is an external criteria [17], and defined as where a is the number of data with the same label and classified in the same cluster, b is the number of data with the same label but classified in different clusters, and c is the number of data with different labels but classified in the same cluster.

RESULTS AND DISCUSSION
MAAP algorithm is written and ran with MATLAB R2014b. The test was carried on 4GB RAM Intel(R) Core(TM) i5-2670QM 2.20 GHz machine. We test this MAAP algorithm with: • two-dimensional random non-partition data point sets of size 100, 500, 1000, 1500 and 2000 respectively to view the scala; • two-dimensional random partition data point sets of size 100, 300, 500, 800, and 1000 respectively. The random non-partition data points are generated using uniform distribution from 0 to 1. The random partition data points are divided into four group and generated using k-means algorithm.
• Real datasets are used as shown in table 1. These datasets can be downloaded from UCI-Repository [9]. For the similarity, we use negative Euclidean's distance from the data points: for points x i and x j ,

Experiments on Random Non-Partition Dataset
The clustering results on random non-partition dataset are presented in table 2, table 3. And Figure 2 shows an example result with number of data N = 1000. From those tables and Figure 2, although the number of cluster N C resulting from the MAAP-DDP algorithm are almost the same those from the original AP with p = min(s), MAAP-DDP algorithm is slower than original AP both with p = median(s) and with p = min(s). This make sense, because MAAP-DDP algorithm searches adaptively the λ-value in order to eliminate the oscillations.
For various values of N Silhouette index (Sil) for both algorithm are almost the same with the range from 0.3 to 0.45. Interestingly, the Sil value from MAAP-DDP looks more constant, around 0,325. It means that the clusters resulting from the MAAP-DDP is more stable than those from original AP. The FMI can not be calculated because the random non-partition dataset do not have true labels.

Experiments on Random Partition Dataset
The clustering results on random 4-partition dataset are presented in table 4, table 5. And Figure 3 shows an example result with number of data N = 1000. From those tables and Figure 3, the MAAP-DDP has succeeded to identify clusters according to the number of dataset's true labels. The number of dataset's true labels is 4, and the number of clusters (N C) resulting from the MAAP-DDP is also 4 for various values of N .
The speed of the MAAP-DDP is comparable with those of original AP, it means that the execution time of MAAP-DDP is not slower than those original AP. Although the Sil values of the MAAP-DDP are smaller than ISSN: 2088-8708

CONCLUSIONS AND FUTURE RESEARCH
From above results it can be concluded: (i) the proposed algorithm, MAAP-DDP, is slower than original AP for random non-partition dataset, (ii) for random 4-partition dataset and real datasets the proposed algorithm has succeeded to identify clusters according to the number of dataset's true labels with the execution times that are comparable with those original AP.
Beside that the MAAP-DDP algorithm demonstrates more feasible and effective than original AAP procedure. As we know that the original AAP algorithm stops when the values of C(i) is lower than 2, while MAAP-DDP algorithm terminates after the best clusters founded. The key parameter of the MAAP-DDP algorithm is Const in Eq.9. The algorithm is designed to find adaptively Const-value for obtaining the best clustering solution.
In the future, for verification of the algorithm we have to test the MAAP-DDP algorithm with the other kinds of dataset, e.g. synthetics dataset , face-image dataset, and so on.

ACKNOWLEDGMENT
The Authors gratefully acknowledge Gunadarma University for providing research funding and for permission in using the research facilities.