Improved optimization of numerical association rule mining using hybrid particle swarm optimization and cauchy distribution

Particle Swarm Optimization (PSO) has been applied to solve optimization problems in various ﬁelds, such as Association Rule Mining (ARM) of numerical problems. However, PSO often becomes trapped in local optima. Consequently, the results do not represent the overall optimum solutions. To address this limitation, this study aims to combine PSO with the Cauchy distribution (PARCD), which is expected to increase the global optimal value of the expanded search space. Furthermore, this study uses multiple objective functions, i


INTRODUCTION
The ARM or association analysis method is used to find associations or relationships between variables, which often arise simultaneously in a dataset [1]. In other words, association analysis builds a rule for several variables in a dataset that can be distinguished as an antecedent or a consequent. The Apriori and Frequent Pattern (FP) growth methods are widely employed in association analysis. These methods are suitable for categorical or binary data, such as gender data, i.e., males can be represented by 0 and females by 1 [2]. Furthermore, if the data are numeric, such as age, weight or length, these methods process the data by transforming numerical data into categorical data (i.e., a discretization process). This transformation process requires more time and can miss a significant amount of important information because data transformation does not maintain the main meaning of the original data [3], [4], [5]. For example, if age data represents a 35 years old person and is transformed to 1, this obscures the original meaning of the age information. In addition, both methods require manual intervention to determine the minimum support (attribute coverage) and confidence (accuracy) values. Note that this step is subjective in some cases; thus, the results will not be optimal [6], [7]. ISSN: 2088-8708 To resolve this problem, some researchers have proposed solutions that employ optimization approaches, e.g., particle swarm optimization (PSO) [4], fuzzzy logic [8], and genetic algorithm (GA) [3], [7]. Regarding of the PSO approach which has multiple objective functions for solving association analysis of numerical data without a discretization process. This research produced the better result than other previous optimization methods. It has optimum value automatically without determining the minimum support and minimum confidence. However, this method can also become trapped in local optima. When iterations are complete and the number of iterations tends toward infinity, the velocity value of a particle approaches 0 (the weight value of the velocity function is between 0 and 1). Therefore, the search is terminated because the PSO method can not find the optimal value when the velocity value is 0. Thus, PSO often fails to seek the overall optimal value [4], [9], [10].
We proposed a method that can address the premature searching and the limitations of traditional methods that it does not use a discretization process. In other word, the original data are processed directly using the concept of the Michigan or Pittsburgh approaches. Furthermore, support and confidence threshold values are determined automatically using the Pareto optimality concept. One solution to this problem is by combining PSO with the Cauchy distribution. This combination increases the size of the search space and is expected to produce a better optimal value. Yao et al (1999) reported that combining a function with the Cauchy distribution will result in a wider coverage area; thus, when the Cauchy distribution is combined with the function of the PSO method, the optimal value will increase [10].
Therefore, the purpose of this study is to find the optimal value of the numerical data in association analysis problems by combining PSO with the Cauchy distribution (PARCD). Furthermore, we determine the value of several objective functions such as support, confidence, comprehensibility, interestingness, and amplitude, as a parameter to evaluate the performance of the proposed method.
Problem solving in numerical data association analysis is generally performed using several approaches, including discretization, distribution and optimization. That the discretization is performed using partitioning and combining, clustering [11], [12] and fuzzy [8] methods, and the optimization approach is solved using the optimized association rule [13], differential evolution [14], GA [3], [7] and PSO [4], [15] as shown in Figure 1. We focus to solve the problem of association analysis of numerical data by optimization. The previous research from optimization approach is known as the GAR method. It has been attempted to find the optimal item set with the best support value without using a discretization process [13]. And then, the differential evolution optimization approach includes the generation of the initial population, as well as mutation, crossover and selection operations. The multi-objective functions are optimized using the Pareto optimality theory. This method is known as MODENAR [14]. Furthermore, a study of numerical association rule mining using the genetic algorithm approach (ARMGA). It successfully solved association analysis of numerical data problems without determining the values of the minimum support or minimum confidence manually. In addition, this method can extract the best rule that has the best relationship between the support and confidence values [7]. Another study of GA approach has been used MOGAR method. It presented that using MOGAR method was faster than using conventional methods, such as Apriori and FP-growth algorithms, because the time complexity of the MOGAR method tends to be simpler, and follows quadratic distribution. On the other hand, the Apriori IJECE   ISSN: 2088-8708  1361 algorithm follows an exponential distribution, which requires more time for computation [3]. Next, the optimization method has been used PSO for solving numerical ARM problem. Some authors who performed PSO method such as they used ARM to investigate the association of frequent and repeated dysfunction in the production process. The result obtained a faster and more effective optimization employed PSO, which resulted in a faster and more effective optimization process than the other optimization methods [16]. In addition, the PSO approach was used to improved the computational efficiency of ARM problems such that appropriate support and confidence values could be determined automatically [17]. In 2012, the development of PSO for ARM problems was performed by weighting the item set. This weighting is very important for very large data because such data often contain important information that appears infrequently. For example, in medical data, if there is a rule {stiff neck, fever, aversion to light} → {meningitis} that rarely appears but this rule is very important because in fact this condition is often happen [18]. In 2013, Sarath and Ravi introduced binary PSO (BPSO) to generate association rules in a transaction database. This method is similar to the Apriori and FP growth algorithms; however, BPSO can determine optimum rules without specifying the minimum support and confidence values [19]. In 2014, Beiranvand et al. studied numerical data association analysis using the PSO method. They stated that the employed method could effectively analyze numerical data association analysis problems without using a discretization process. This research employs four objective functions, i.e., support, confidence, comprehensibility and interestingness. This method is referred to as MOPAR [4]. In 2014, Indira and Kanmani conducted research using a PSO approach; however, they attempted to improve results and analysis time using an adaptive parameter determination process to determine various parameters, such the constant and weight value in a velocity equation. They developed the Apriori algorithm using a PSO approach (APSO), and the results demonstrated that this approach was faster and better compared to using only an Apriori method [15]. In addition, the combination of PSO and GSA has been conducted for solving optimal reactive power dispatch problem in power system. The problem has succesfully accomplished on basis of efficient and reliable technique. And then, the result were found satisfactorily to a large extent that of reported earlier [20]. Verma and Lakhwani examined ARM problems by combining PSO and a GA. The results showed better accuracy and consistency compared to individual PSO or a GA method [21].
There are many developments of PSO method. i.e. the papers; "the implementation of PSO in distributed generation sizing" [22], "improved canny edges using cellular based PSO technique in digital images" [23], and the hybrid method. One of hybrid methods is the hybrid PSO with the Cauchy distribution [24]. This method provides better results compared to using only PSO. In 2011, this combined method was retested for SVM parameter selection [25][26][27]. The combined approach was also used to improve performance weaknesses in a process to identify a watermark image based on discrete cosine transform (DCT). The results demonstrated that combining PSO with the Cauchy distribution outperforms the compared method [28]. In 2014, an empirical study demonstrated that combining PSO with the Cauchy distribution provided. The results show that the use of PSO with Cauchy distribution higher than using only PSO [29].
To the best of our knowledge, combining PSO with the Cauchy distribution has not been applied to ARM problems that involve numerical data. This research has important contribution for optimization approach of numerical ARM problem.
The reminder of this paper is organized as follows. Research method is discussed in Section 2. This section describes the design of the multiple objective functions and the development of the proposed PARCD method. Section 3 exposes the experimental result and discussion of proposed method which was tested using a dataset benchmark. This section also provides a comparison of the results obtained by the proposed PARCD method and existing methods. Conclusions and suggestions for future work are provided in Section 4.

RESEARCH METHOD 2.1. Objective Design
This study uses multiple objective functions, i.e., support, confidence, comprehensibility, interestingness and amplitude. First, the support criterion determines the ratio of transactions for item X to the total transaction (D), i.e., support(X)=|X|/ |D|. Then, if A is the antecedent of the transaction dataset as a precondition then C is consequence as the conclusion of a transaction dataset. The support value of if A then C (A → C) is computed as follows: ISSN: 2088-8708 where | A ∪ C | is the number of transaction which contain A and C. The minimum support value is closely linked to the number of items covered to determine the referenced rule. If the threshold value is low, the support covers many items and vice versa. The support measurement is used to determine the confidence measurement criteria, i.e., the criteria used to measure the quality or accuracy of the rule derived from the total transactions. Such rules are often developed for each transaction to better demonstrate quality or accuracy [4]. Confidence can be expressed as follows, However, these criteria are not guaranteed to produce appropriate rules. Thus, for a given rule to be considered reliable and to provide overall coverage, the result must also satisfy the comprehensibility and interestingness criteria. Gosh and Nath (2004), stated that less number of attributes in antecedent component of a rule show that the rule is comprehensible [30]. The comprehensibility measurement criteria can be expressed as follows: where | C | is the number of consequence item and | A ∪ C | is the rule number of if A then C (A → C). Next, the interestingness criteria are used to generate hidden information by extracting some interesting rule or unique rule. This criterion is based on the support value and is expressed as follows: The right side of Eq. (4) consists of three components. The first component shows the generation probability of the rule that is based on the antecedent attribute. The second is based on the consequence attributes and the third is based on the total dataset. There is a negative correlation between interestingness and support. When the support value is high, the interestingness value is low because the number of frequent items covered is small [4].
The last criterion is the amplitude interval. The amplitude interval, which is a measure of a minimization function, differs from support, confidence and comprehensibility measures, which are maximization functions. The amplitude interval is expressed as follows: Here, m is the number of attributes in the item set (| A ∪ C |), ui and li are the upper and lower bounds encoded in the item sets corresponding to attribute i. max(Ai)and min(Ai) are the allowable limits of the intervals corresponding to attribute i. Thus, rules with smaller intervals are intended to be generated [14].

PSO
PSO, which was first introduced by Kennedy and Eberhart (1995), is an evolutionary method inspired by animal behavior, e.g., flocks of birds, school of fish, or swarms of bees [31]. PSO begins with a set of random particles. Then, a search process attempts to find the optimal value by performing an update generation process. During each iteration, each particle is updated by following two best values. The first is the best solution (fitness) achieved to this point. This value is called pBest. The other best value tracked by the swarm particle optimizer is the best value obtained by each particle in the population. The value is called gBest. After finding pBest and gBest, each particle's velocity and corresponding position are updated [15].
Each particle p in some iteration t has a position x(t) and displacement speed v(t). The finest particles (pBest) and best global positioning (gBest) are stored in memory. The speed and position are updated using Eqs. 6 and 7, respectively [15].
IJECE ISSN: 2088-8708 1363 Here ω is the inertia weight; V i, old is the velocity of the i − th particle before updating; V i, new is the velocity of the ith particle after updating; Xi is the i − th, or current particle; i is the number of particles; rand() is a random number in the range (0, 1); C1 is the cognitive component; C2 is the social component; pBest is the particle best or local optima in some iterations on every running; gBest is the global best or global optima in some iterations on every running. Particle velocities in each dimension are restricted to maximum velocity V max [32].

Cauchy Distribution
Yao et al. (1999) used a Cauchy distribution to implement a wider mutation scale [10]. A general formula for the probability density function is expressed as follows.
A Cauchy random variable is calculated as follows. For any random variable X with distribution function F. The random variable Y=F(X) has a uniform distribution in the range [0,1). Consequently, if F is inverted, the random variable can use a uniform density to simulate random variable X because X = F −1 (Y). Therefore, the cumulative distribution function of Cauchy distribution is expressed as follows Therefore if y = 1 π arctan(x) + 0.5 by inverting its function, the Cauchy random variable can be expressed as follows This function can be expressed by Eq. (12) because y has a uniform distribution in the range (0,1]. Thus, we obtain the following,

PSO for Numerical Association Rule Mining with Cauchy Distribution
PARCD is an extension of the MOPAR methods that combines PSO and the Cauchy distribution to solve problems that occur in the association analysis of numerical data [33]. The goal is to find the optimal value of amateurs and avoid being trapped in local optima. Essentially, this method uses the concept of PSO but modifies the velocity equation by including the Cauchy distribution. The velocity function is expressed as follows, The next step is normalization by using V i(t + 1) value (Eq. 13), which makes the vector length 1. The variant of the Cauchy distribution is infinite and the objective function scales are 1 [10].
The result of the normalization process is multiplied by the Cauchy random variable as follows.
Si(t + 1) = U i(t + 1) · tan π 2 · rand[0, 1) ISSN: 2088-8708 Then, the result of Eq. (15) which is a combination of the velocity value and the Cauchy distribution, is used to determine the new position of a particle.

PARCD Pseudo code and Flowchart
The PARCD pseudocode as shown in Figure 2 and flowchart as shown in Figure 3 show that the algorithm begins by initializing the velocity vector and position randomly. The algorithm calculates the multiobjective functions as the current fitness. Then, it executes looping iterations to seek pBest until it finds the gBest value as the optimal solution.

. Experimental Setup
We conducted an experiment using the Quake, Basketball, Body fat, Pollution, and Bolt benchmark datasets in Table 1. from the Bilkent University Function Approximation Repository. The experiment was performed using a computer with an Intel Core i5 processor with 8 GB main memory running Windows 7. The algorithms were implemented using MATLAB.

Experiments
Association rule analysis comprises two steps. The first step is to determine the frequent itemset that includes the antecedents or consequences of each attribute. The second step is to implement the proposed algorithm.

Output Rules of the PARCD Results
This experiment shows the 20 th run time where each running contains 2000 rules. We presented three datasets of output rules i.e. Body fat, Bolt, and Pollution datasets. Table 3 shows the results obtained with the Body fat dataset. For Rule 1, there are eight antecedent attributes and three consequent attributes. For Rule 2, the number of antecedent and consequent attributes are the same as Rule 1. For the last rule, the number of antecedent and consequent attributes are six and two, respectively.
The antecedent attributes of Rule 1 are case number, percent body fat (Siri's equation), density, age, adiposity index, chest circumference, abdomen circumference, and thigh circumference. The consequent attributes are percent body fat (Brozek's equation), height, and hip circumference. For Rule 2, the antecedent and consequent attributes are the same as Rule 1. Thus, Rules 1 and 2 can be expressed as follows: if (att1, att3, att4, att5, att8, att11, att12, att14) then (att2, att7, att13). For Rule 2000, the antecedent attributes are Percent body fat using Brozek's equation, Percent body fat using Siri's equation, density, height, neck circumference and knee circumference, and the consequent attributes are case number and weight. Therefore, Rule 2000 is if (att2, att3, att4, att7, att10, att15) then (att1, att6). Table 4 shows the results obtained with the Bolt dataset, which has eight attributes; (run, speed, total, speed2, number2, Sens, time and T20Bolt). As can be seen, the first two rules the same results for both antecedent and consequent attributes. The antecedent attributes are total and time, and the consequent attributes are run and speed1. Therefore, the rule is if (total, time) then (run, speed1). The rule 2000 shows that the antecedent ISSN: 2088-8708 attributes are run and speed2. However, the consequent attribute is unknown. Thus, this rule cannot be declared clearly because it does not have a conclusion. Table 5 shows the rule results for the pollution dataset obtained using the proposed particle representation PARCD method. The results for the first and second rules are the same. Here  :Weight (lbs) Att7 :Height (inches)(target) Att8 :Adiposity index Att9 :Fat Free Weight Att10 :Neck circumference (cm) Att11 :Chest circumference (cm) Att12 :Abdomen circumference (cm) Att13 :Hip circumference (cm) Att14 :Thigh circumference (cm) Att15 :Knee circumference (cm) Att16 :Ankle circumference (cm) Att17 :Extended biceps circumference (cm) Att18 :Forearm circumference (cm) Att19 :Wrist circumference (cm) :NONW non-white population in urbanized areas, 1960 Att10 :WWDRK employed in white collar occupations Att11 :POOR poor of families with income ¡ U SD3000 Att12 :HC Relative hydrocarbon pollution potential Att13 :NOX Same as nitric oxides Att14 :SO@ Same as Sulphur dioxide Att15 :HUMID Annual average, relative humidity at 1 pm Att16 :MORT Total age-adjusted mortality rate per 100,000

Output of multi-objective function and correlation of PARCD methods
The basic concept of association analysis comprises two steps, i.e., the first step is the determination rules which in every rule contain antecedent and consequent and the second step is the implementation of the algorithm (i.e., the proposed method). This method begins with the initialization process, which as the start of the algorithm starts with the determine the multi-objective function value and calculates the particle velocity and positioning at i. Then, an iterative process is performed to search for pBest and gBest as the optimal solution. ISSN: 2088-8708 Table 6 shows the results of the multi-objective function of the PARCD method. Here, there are four parameters i.e., support, confidence, comprehensibility and interestingness. Then, the method is examined using five datasets i.e., quake, basketball, body fat, bolt, and pollution. Generally, the Bolt dataset is the dominant data set and has the highest value for each parameter (except comprehensibility). Conversely, the least dominant dataset is quake (with the exception of the confidence parameter). The first parameter, i.e., support, showed a higher value with the Bolt dataset (250.84%) and the lowest with the quake dataset (22.97%). The average was approximately 90%. The highest confidence value was similar to the support value. The highest confidence value was obtained with the Bolt dataset (96.88%) with a deviation of approximately 10. The lowest confidence value was obtained with the pollution dataset (34.96%) with a very high deviation of just under 45. The average confidence value was approximately 80%. The highest comprehensibility value was obtained with the Quake dataset (approximately 785). The lowest comprehensibility value was obtained with the pollution dataset (approximately 110 with a deviation, well over 165). The average comprehensibility value was approximately 400. The final parameter, i.e., interestingness, obtained the highest value with the bolt dataset (approximately 43% with a deviation of just under 40). The lowest interestingness value was obtained with the quake dataset (2.34% with a deviation of just under 10). The average interestingness value was approximately 15%. This demonstrates that the support and confidence values, i.e., 90% and 80% respectively, were satisfactory. Moreover, the comprehensibility value was four times better; however, the interestingness value was not satisfactory (approximately 15%).
The correlation values between each objective function are shown in Table 7 and Figure 4. The results show one objective function with another are significant association either be positive or negative. The correlation value of all objective functions to amplitude was always close to zero. In other words, the correlation to the amplitude function was low. This proves the opinion given by Alatas et al. (2008), i.e., the amplitude function differs from other functions because it attempts to minimize while the other functions attempt to maximize their values.   Table 8 shows a comparison of the support value obtained by the proposed PARCD method and five previous methods (i.e., the MOPAR, MODENAR, GAR, MOGAR, and RPSOA methods). Generally, the support percentage obtained by the PARCD method was better that obtained by the other methods. The support value obtained by the PARCD method with the Quake dataset was the lowest (22.97%), The highest value was obtained by the MOPAR method (46.26%). The support value of the remaining methods was just over 35% on average. The support values obtained with the basketball and body fat dataset were the highest, i.e., 61.04% and 73.94%, respectively. The second highest support value was obtained by the MOGAR method with the basketball and dataset (50.82%). The average support value of all other methods was well over 35%. The lowest support value for the body fat data set is MOPAR method (22.95%), and the averages value was appoximately 65%.
The comparison of number of rules and confidence values are showed in table 9. The proposed PARCD method demonstrates a nearly similar number of rules compared to others methods. The greatest number of rules obtained with the quake dataset was achieved by the MODENAR method (55 rules). The PARCD method obtained the greatest number of rules with the basketball (78 rules); however, with the body fat dataset, the PARCD method obtained the lowest number of rules (32). The MOGAR method obtained the greatest number of rules with the basket ball dataset. The confidence values obtained by the PARCD, MOPAR, and MOGAR methods were approximately the same (just over 80%). Generally, the MOPAR method showed the highest confidence value with all datasets, with the exception of the body fat dataset, with which the MOGAR method obtained the highest confidence value. Then, the second position is PARCD method. Tables 8 and 9 show that the support and confidence values were correlated with the number of rules, i.e., significant negative correlation were observed. Note that, if the support and confidence values were high, then the number of rules was low (and vice versa). This condition occurs because the high support and confidence values effectively filter the number of rules selectively. Table 10 shows the size value and amplitude percentage obtained by the proposed PARCD and existing methods. Generally, the size value of the body fat dataset was the highest with all methods, e.g., the GAR method obtained a size value of approximately 7.5. On the other hand, the size value of the Quake dataset with the MODENAR method was the lowest. The PARCD method obtained the best amplitude value with the Basketball dataset (approximately 2%), while the opposite value is also using PARCD method which Quake dataset gain around 65%. The amplitude value obtained by the MOPAR method was fairly good. The amplitude value obtained by the MOPAR method with the Body fat dataset was approximately 4%, and that obtained by the ISSN: 2088-8708 MOPAR with the quake dataset result was less than that obtained by the PARCD method, which was just over 50%. In addition, the MODENAR, MOGAR, and GAR methods outperformed both the PARCD and MOPAR methods. Their amplitude results were approximately 17% to 29% for all dataset. The overall results indicate that proposed PARCD method can reach wider compared to the existing methods when searching for an optimal value. These results also indicate the proposed method may be robust for problems in others fields, such as the numerical association rule mining optimization problem.

CONCLUSION
This study has proved that combining the PSO with Cauchy distribution can solve the numerical ARM problem. The problems of local minimum and premature convergence with large datasets can be solved using the proposed method. The experimental results demonstrate that the proposes PARCD method outperforms existing methods (i.e., MOPAR, MODENAR, GAR, and RPSOA) relative to all multi-objective functions, such as the support, confidence, comprehensibility, interestingness and amplitude functions. In future, the numerical problem of ARM problem can be further improved by developing or combining other methods, such as time series or deep learning method.