Association rule hiding using integer linear programming

Received Sep 5, 2020 Revised Dec 12, 2020 Accepted Jan 19, 2021 Privacy preserving data mining has become the focus of attention of government statistical agencies and database security research community who are concerned with preventing privacy disclosure during data mining. Repositories of large datasets include sensitive rules that need to be concealed from unauthorized access. Hence, association rule hiding emerged as one of the powerful techniques for hiding sensitive knowledge that exists in data before it is published. In this paper, we present a constraint-based optimization approach for hiding a set of sensitive association rules, using a well-structured integer linear program formulation. The proposed approach reduces the database sanitization problem to an instance of the integer linear programming problem. The solution of the integer linear program determines the transactions that need to be sanitized in order to conceal the sensitive rules while minimizing the impact of sanitization on the non-sensitive rules. We also present a heuristic sanitization algorithm that performs hiding by reducing the support or the confidence of the sensitive rules. The results of the experimental evaluation of the proposed approach on real-life datasets indicate the promising performance of the approach in terms of side effects on the original database.


INTRODUCTION
Data mining aims to explore and analyze the huge volumes of data and data mining systems are categorized depending upon the types of knowledge they discover. However, the knowledge discovered through various data mining algorithms may contain sensitive information about an individual or business. Disclosure of such sensitive information may cause a threat to security. Henceforth, comprehensive sanitization of the database is essential when data is shared with a third party.
Privacy preserving data mining (PPDM) [1,2] has evolved as an interesting problem in database security applications due to the diverse conflicting requirements of data sharing, proprietary data disclosure, privacy concern and knowledge discovery. The objective of PPDM is to develop data sanitation algorithms which modify the data such that even after applying mining algorithms the sensitive knowledge remains intact. Verykios et al. [3] analysed the state-of-the-art, presented classification hierarchy and clustering of different privacy preserving data mining techniques. Bertino et al. [4] presented an approach for evaluating different attributes of a privacy preserving algorithm. Knowledge hiding, a subfield of PPDM, can be achieved by a process known as data sanitization [5]. The process of knowledge hiding modifies the sensitive data before delivering it to the third party [6] in order to ensure data privacy. In this paper, we focus on the knowledge hiding process in the context of association rule mining (ARM). The sensitive rule hiding problem is common in collaborative ARM applications where only a part of the information found in the data must be revealed by the organization, strategic knowledge inferred by the sensitive rules must be concealed. Hence, sensitive patterns should no longer be extracted from the database at the same time database utility be well maintained. The technique that we propose is formulated based on the mathematical optimization problem called integer linear programming (ILP). The objective of the proposed ILP formulation is to minimize the total weight of sensitive transactions while achieving zero hiding failure. The solution obtained from the ILP determines the transactions that need to be sanitized in order to hide the sensitive rules.
The paper is organized as follows: Section 2 presents a brief overview of previous works on sensitive association rule hiding. Section 3 provides a formalization of the problem and the proposed methodology is described in section 4. Section 5 discusses performance results of the proposed methodology. The final section of this paper presents concluding remarks and future extensions.

RELATED WORK
Data sanitization approaches are categorized into heuristic, border, exact and evolutionary algorithms. The heuristic approach uses blocking or distortion technique to determine suitable sensitive items and transactions for modification. Two fundamental heuristic approaches were presented in [7] to prevent sensitive rules from disclosure. The first method hides the frequent itemsets from which sensitive rules are derived thereby preventing sensitive rules from being generated. The second method reduces the relevance of sensitive rules by bringing its confidence below the given minimum threshold. Oliveira et al. [8,9] implemented an index schema and the transaction retrieval engine to speed up the process of sanitization. Aggregate, Hybrid and Disaggregate algorithms presented in [10] perform better than the SWA algorithm [9] in terms of data utility but suffers from computational complexity. Item grouping algorithm (IGA) presented in [8] is improved in [11] to decrease the number of modifications. Hong et al. in [12] proposed a greedy SIF-IDF algorithm that uses TF-IDF measure from information retrieval to compute the correspondence between sensitive itemsets and transactions. Cheng et al. [13] proposed an algorithm that reduces data distortion degree by modifying the least number of transactions in order to conceal a sensitive rule. Le et al. [14] proposed a distortion-based approach that modifies the minimum number of transactions during the hiding process. Pang et al. [15] devised a sensitive association rule hiding algorithm on outsourced data uploaded from multiple data owners in a twin cloud architecture using homomorphic cryptosystem. Shaoxin et al. [16] proposed a database reconstruction-based technique for hiding frequent itemsets achieves a high degree of privacy and reasonable data utility of the synthetic database. The main drawback of the heuristic approach is that in the majority of cases, it fails to deliver an optimal solution to the sanitization problem.
The border approach focuses on reducing side effects on non-sensitive itemsets during the process of database sanitization. The sanitization process utilizes border theory [17] to reduce the impact of the sanitization process on low support non-sensitive itemsets. The border approach presented by Sun and Yu [18] uses the positive border to keep track of the impact of transaction sanitization. Telikani et al. [19] devised the DCR algorithm using the combination of heuristic and border-based approaches in order to minimize the impact on non-sensitive rules while hiding sensitive rules. Greedy algorithm presented in [20] uses border theory to provide an optimal solution for hiding sensitive frequent itemsets.
The exact approach formulated the sanitization problem as a constraint satisfaction problem (CSP). Menon et al. [21] utilized ILP to formulate a CSP that determines the least number of transaction sanitizations in order to conceal sensitive itemsets. Divanis and Verykios in [22] defined a CSP to select candidate itemsets for modifications. The sanitization algorithm determines frequent itemsets that belong to positive and negative borders. The first phase of the sanitization process terminates when all sensitive itemsets are concealed with zero side effects. Otherwise, the second phase is executed until the feasible solution to the CSP is found. CSP based approaches efficiently maintain data accuracy but require high computation time. Evolutionary algorithms encode the sanitization problem into a population of binary solutions. Cuckoo Optimization method proposed in [23] conceals sensitive association rules, while it minimizes the number of cycles and access. GA-based algorithms proposed in [24] and PSO-based algorithms devised in [25] are deletion-based approaches, compute righteousness of chromosome to determine side-effects of sanitization by defining fitness function. Each solution consists of a transaction set which is used for the chromosome encoding. Wu et al. [26] presented an algorithm ACS2DT based on ant colony system to reduce side effects. Genetic algorithm approach proposed in [27] formulates an objective function that computes the side effect on non-sensitive rules. ABC4ARH rule hiding algorithm presented in [28] selects sensitive transactions by using an improved discrete binary artificial bee colony algorithm. Genetic algorithm-based approaches provide strategies only for identifying transactions to be removed from or to be added into the database.  3. PROBLEM STATEMENT Let = { 1 , 2 , … , } be the finite set of m items. An itemset is a nonempty subset Ik where ⊆ and k-Itemset is an itemset containing k items. Let = { 1 , 2 , . . . , } with ∀ , 1 ≤ ≤ : ⊆ be a tuple of transactions over I. This tuple is called the transaction of the database D. A transaction TiϵD supports an itemset α iff α ⊆Ti. The support count of α is the number of transactions containing α, denoted as |α|. The itemset α is frequent if its support is greater than or equal to the given minimum support threshold.
An association rule is defined as an implication expression → where , ⊆ and ∩ = . A rule α→ β is said to hold a support σ in the database, where σ is the fraction of transactions that covers both α and β. A rule α→β is said to have confidence δ in the set of transactions where δ measures how frequently itemset β appears in the transactions that covers itemset α. The support σ and confidence δ are mathematically formulated by (1) and (2).
Let σmin and δmin be the user specified minimum support threshold and the minimum confidence threshold. A rule is strong if it satisfies both support and confidence thresholds. The ARM algorithm finds all strong association rules. To determine the strong rules, the rule mining algorithm first finds all the itemsets in D that are frequently enough to be considered important i.e. support ≥ (frequent itemsets) and subsequently derives rules that are strong enough to be considered interesting. Sensitive rules are strong association rules that the data owner wants to hide. The association rule hiding problem aims to restrict theses sensitive rules from being disclosed.
The sensitive association rule hiding problem addressed in this paper is stated as follows: Given a transactional database D, minimum support threshold σmin, minimum confidence threshold δmin, a set of strong rules R mined from D and a set of sensitive rules S⊆R to be hidden, modify the original D into a transformed database D′ to hide sensitive rules S from being disclosed, while minimally influencing nonsensitive rules in the set R-S. In this paper, we propose an approach to reduce the support or confidence of the sensitive rules below the user specified minimum threshold by sanitizing selected transactions of D such that no sensitive rule is discovered from D′. The proposed approach conceals sensitive association rules while maintaining data utility.

PROPOSED SOLUTION
This section presents the proposed ILP based strategy for hiding sensitive rules. A sensitive rule → can be is hidden using one of the following methods: Method 1: removing an item jϵα or β from the selected transactions until support ( → ) < . Method 2: adding all items jϵα to the selected transactions until confidence ( → ) < . Method 3: removing an item jϵα from the selected transactions until support ( → ) < or confidence ( → ) < . The insertion or deletion of any item may lead to side effects including, ghost rules and lost rules: i) Ghost rules are new non-sensitive rules discovered from the transformed database D′ but not present in the input database D; and ii) Lost rules are non-sensitive rules which are discovered from the input database D but lost in the transformed database D′ during hiding process. The solution to the hiding process is split down into three phases: Pre-processing, ILP formulation and hiding process.

Pre-processing
To find the solution to the hiding problem, we employ the item deletion strategy of method 3, as it has more utility in hiding sensitive association rules. One of the key issues that need to be resolved for the sanitization is identifying suitable transactions in the database for modifications. If an item that belongs to the consequent itemset of a sensitive rule is removed from the selected supporting transaction, it reduces both the support of the inducing itemset and the confidence of the sensitive rule, but the support of the antecedent part remains unaffected. In contrast, if an item of antecedent itemset of sensitive rule from a supporting transaction is removed, it reduces union support of antecedent and generating itemset. This technique decreases the confidence slowly as compared to the former techniques. To optimize and speed up the hiding process, a pre-processing phase is implemented to find database D1 with all sensitive transactions that completely supports one or more sensitive rules. The pre-processing phase also finds non-sensitive rules that contain no items of any sensitive rules and deletes from the set of non-sensitive rules S′ because database sanitization has no impact on such non-sensitive rules.

ILP formulation
The solution to the rule hiding problem is modelled with the ILP shown in (3), (4) and (5). (3) Each variable vi, coefficient ci corresponds to a transaction Ti in the pre-processed database D1 and each constraint corresponds to a sensitive association rule Sj in S. A constraint contains a variable if the corresponding sensitive rule is supported by the transaction Ti.
The linear system has a variable for each sensitive transaction and |S| constraints. The objective function of the ILP shown in (3) aims to minimize database side effects while achieving zero hiding failure. In order to conceal the sensitive rule Sj the conditional constraints given in (4) ensure that at least nmin transactions that support sensitive rule Sj are selected for sanitization. In (5) represents the selection or rejection of a transaction Ti and enforces each variable vi be zero or one. The solution generated by the linear program indicates the set of transactions that need to be selected for sanitization. The coefficient assigned to each sensitive transaction can have a significant effect on the collection of transactions identified for each modification and hence on the quality of transformed database D′. In order to compute the minimum number of transactions nmin, the following properties are used.
Property 1: Let Ts be the transaction set supporting the sensitive rule α→β. To reduce the confidence of the sensitive rule below δmin, the least number of transactions to be sanitized in Ts is n1=⌈|α→β|-|α|*δmin⌉ +1. Proof: If an item of the consequent itemset of a sensitive rule Sj is deleted from a sensitive transaction Ts, then support of Sj decreases by 1. Assume n1 is the least number of transactions that are forced to be sanitized in Ts to decrease the rule's confidence below δmin, then we have (|α→β|-n1)/|α|<δmin. Therefore n1>|α→β|-|α|*δmin. Since n1 is is the least integer, we can derive n1=⌈|α→ β|-|α|*δmin⌉+1.
Property 2: Let Ts be the transaction set supporting the sensitive rule α→ β. To reduce the support below σmin, the least number of transactions to be sanitized in Ts is n2=⌈|α→β|-σmin*|D|⌉+1. Proof: If an item from consequent itemset of a sensitive rule Sj is deleted from a sensitive transaction Ts, then support of Sj decreases by 1. Assume that n2 is the least number of transactions from Ts that requires sanitization to reduce the support of Sj below σmin, then we have (|α→β|-n2)/|D|<σmin. Therefore n2>|α→β|-|D|*σmin. Since n2 is the least integer, we can derive n2=⌈|α→β|-σmin*|D|⌉+1.
From properties 1 and 2, it can be deduced that the least number of transactions that require sanitization is nmin=min(n1, n2) to suppress the sensitive rule α→β. Since decreasing the support of some sensitive rule A→B may have an impact on the support of antecedent itemset of the sensitive rule α→β, n1 cannot be calculated in advance. Therefore nmin=n2=⌈|α→β|-σmin*|D|⌉+1.
The coefficient for each transaction that is included in the constraint matrix is computed using the coefficient computing algorithm shown in Figure 1. The constraint matrix is created by considering transactions that support one or more sensitive rules. The objective function is devised such that the binary variables that indicate the selection or rejection of a transaction are multiplied by the pre-calculated coefficients that reflect its vulnerability of being affected by the sanitization. We assign the weight for each item present in the consequent itemset the sensitive rules based on its presence in the number of antecedent and consequent itemset of the sensitive rules.
Furthermore, a small constant μ is added to the denominator to prevent the possibility of division by zero. The impact of deleting the maximum weight item on non-sensitive transactions is used as the coefficient of a sensitive transaction. The value of the transaction coefficient indicates the risk of over concealing non-sensitive rules on selecting the transaction for sanitization. A sensitive transaction containing less non-sensitive rules with large support yields a low coefficient. On the other hand, a transaction that contains more non-sensitive rules with support closer to σmin is less likely to get selected for sanitization. The impact, ek of deleting the item ′k′ on a non-sensitive rule S′j is calculated using (6). where |S′j|1 , |S′j| are support of non-sensitive rule S′j in D1 and D respectively. Deleting an item from the two different transactions Ti and Tj with the same number of non-sensitive rules does not ensure that they are vulnerable to introducing side effects to the same degree.

Hiding process
The sanitization algorithm ILPARH shown in Figure 2 hides each sensitive rule α→β by deleting an item from consequent itemset β until its support is below σmin or its confidence is below δmin. The number of item deletions required for a sensitive rule α→β is given by the equation nmin=n2.The algorithm computes weight wj for each item j, where jϵTi as described in Coefficient Computing Algorithm and an item with the maximum weight is selected as victim item for deletion. If two or more items have the same maximum weight, then an item contained in the fewest antecedent of sensitive rules is selected. If a tie arises another time, then an item with highest support is selected for deletion. If two or more items have the same maximum support, then the sanitization algorithm picks the victim item randomly. The sensitive rules containing the victim item is also removed from Si. The support of affected frequent itemsets, sensitive rules, and confidence of the affected sensitive rules are updated. This procedure is repeated until Si is left empty.

PERFORMANCE EVALUATION AND RESULTS DISCUSSION
This section presents the results of experimental evaluations carried out on different real-world datasets. We evaluated our proposed algorithm and compared the results with the results of the DCR algorithm [19]. A set of experiments are conducted to measure the performance of the algorithms in terms of side-effects and execution time. The ILPARH and DCR algorithms were implemented in R and were executed in an Intel Pentium 4 using the Windows 10 Operating System at 2.50 GHz with 4 GB of RAM.

Datasets
We examined the proposed algorithm using three different transaction datasets that are publicly accessible through the FIMI repository: mushroom, chess and BMS-1. These datasets exhibit different characteristics with regard to the maximum size of an itemset, number of transactions and average transaction size. The configurations of the overall datasets depicted in Table 1 where |I|, |D| and AvgSize respectively indicate the maximum size of an itemset, the number of transactions and the average size of transactions. The parameters σmin and δmin were set to confirm that ARM algorithm results in adequate number of strong associations rules.

Experimental results
In order to demonstrate the efficiency of the proposed algorithm, several experiments were conducted on real-life datasets. At first, using the association rule mining algorithm, frequent itemsets are generated with the threshold parameter σmin. Then, association rules are discovered with the threshold parameter δmin. The Table 2 depicts the number of association rules that are generated for datasets Chess, Mushroom and BMS-1. Some these rules are selected as sensitive association rules S. The major performance criterion of the sanitization algorithm is the side effects it incurs on the data. We measure the side effects by summing up the number of lost rules and the number of ghost rules introduced. Figure 3 depicts the relationship between number of lost rules and number of sensitive rules. Figure 4 depicts the relationship between number of new rules generated and number of sensitive rules. The results show that the proposed method generates fewer side effects in comparison with the DCR algorithm. The reason is that the ILPARH algorithm utilizes ILP to determine the candidate transactions for modifications and thereby lead to a higher quality data. Figure 5 shows the relationship running time of the algorithm and number sensitive rules. As illustrated in Figure 5 there is an increase in the running time when compared to the DCR algorithm. The reason is that our algorithm requires additional computation to calculate the coefficients of each sensitive transaction in the ILP formulation.

CONCLUSION
In this paper, we presented a privacy-preserving algorithm ILPARH to protect sensitive association rules. The degree of the side effects on non-sensitive rules is used as coefficients of sensitive transactions in the ILP formulation. We exploit the characteristics of objective function to utilize the partial results of the CSP and deriving the solution for hiding sensitive rules. The results of experiments show that our approach minimizes the number of concealed non-sensitive rules and also discovery of ghost rules. In our future work, we intend to employ the evolutionary based framework to identify the candidate transactions for modifications during the sanitation process. Also, the evolutionary approach in conjunction with our victim item determining technique can be adopted to reduce side effects with the improved algorithmic efficiency.