Performance analysis of perturbation-based privacy preserving techniques: an experimental perspective

ABSTRACT


INTRODUCTION
Mining datasets spread across numerous organizations without disclosing further personal information has recently gained importance. In many businesses, protecting data privacy is currently a big concern. In present scenario, people are worried that their private information may be exposed and used for improper purposes. They think that individuals' private information should be protected [1], [2]. Additionally, safeguards for personal data protection should be in place. Privacy-preserving data mining tools have been proven and put into practice to solve this issue [3]. Security assurance solutions that have been developed based on a number of annoyances are being combined using a variety of data mining techniques. Privacypreserving data mining (PPDM) helps to safeguard private and sensitive information for individuals. Description of architectural design of privacy preservation in data mining is shown in Figure 1. In Figure 1(a), various stages of PPDM process is shown. In order to assure information preservation, this paper aims to apply perturbation methods [4]. By substituting some alternative data that are similar to those of records with comparable non-sensitive data, all sensitive data is replaced. It can be carried out using either the distributions of sensitive data when specific non-sensitive data are present the mean of sensitive data [5], [6]. These transformations and conversions typically include numerical data. Additionally, some procedures entail simple changes [7]. By transforming user input into an improbable and unpredictable form, perturbation techniques have been created as a way to ensure secrecy [8]. The data can only be modified by authorized individuals, as illustrated in the Figure 1(b). The owner then makes the data available to analysts for the data mining process [9]. Privacy and security of datasets have been the subject of extensive investigation. PPDM strategies have been proposed and used in a variety of ways. However, the majority of these strategies do not work in all situations. Mehta and Rao [10] identified existing ways from the field of natural language processing (NLP) to transform the unstructured data to a structured form. A perturbation-based method for protecting privacy in data mining was presented by Mariammal [11]. It is basis is the additive rotation strategy. In this method, the author calculated the privacy level using the variance of the initial dataset. Banu and Nagaveni [12] provided a data modification-based method to protect confidentiality in data mining process. Its foundation is the random rotation technique. They calculated the privacy level using the variance of the initial dataset. Rao et al. [13] demonstrated that this strategy is more efficient and accessible. They contrasted their algorithm with the perturbation strategy and demonstrated that their algorithm provided 100% data usefulness. Mary [14] asserts that the random projection strategy has a higher level of privacy than the other approaches. The photos can be very well maintained by employing RP. This method makes it possible to protect data better. It is possible to increase privacy. Ghosh et al. [15] gave a thorough analysis of the currently employed PPDM approaches and categorized the different data modification techniques. Javid et al. [16] provided a practical hybrid method for safeguarding the dataset's privacy. For numerical data, they employ geometric data perturbation, and for categorical data, they use the k anonymization technique. In their approach, they performed perturbation with randomization (intervals). Pika [17] investigated a number of data perturbation techniques in healthcare. In data perturbation methods, records' data values are changed. Their research indicates that the perturbation approach utilized to protect the confidentiality of original values.
Perturbation approach involves altering the original dataset's structure or introducing a small amount of noise to the data. By transforming user input into an improbable and unpredictable form, data perturbation may be utilized to efficiently employ PPDM. It is one of the often used techniques for protecting privacy [18]. There are several ways of perturbation. A range of methods, including as noise addition, data hiding techniques, swapping, and many more, may be used to change the information in datasets [19]. The probability distribution method and the value distortion method are two strategies of perturbation [20], [21]. In first technique, the data is immediately replaced by the distribution, however in the value distortion method, the data is directly altered either by using another randomization process or by introducing noise. Perturbation can be of three types: projection, geometric, and random perturbation [22]. In projection perturbation, modification is accomplished by changing the dimensions. Data randomly moves in this manner from high-dimensional to low-dimensional space [23]. In the geometric perturbation approach, perturbation is performed using a mixture of several techniques, including rotation transformation, translation transformation, and adding random value [24]- [26].

5275
When publishing data, the data owner may employ a variety of privacy-preserving techniques. Owner can use various perturbation strengths to change the datasets [27]. This research offered a comparison of approaches based on random projection and principal component analysis that concurrently improve data classification accuracy while lowering the high dimension to a low dimension in order to safeguard the dataset's privacy [28], [29]. The performance of two perturbation-based privacy-preserving methods is examined in the current research. Healthcare datasets have been used to test this analysis. The following are this paper's main contributions: i) The present research provides an exhaustive analysis of PPDM techniques based on perturbation; ii) Experimental and comparative analysis of two perturbation-based privacypreserving methods i.e., improved random projection perturbation (IRPP) and enhanced principal component analysis-based technique (EPCAT) are described in this paper; and iii) The impact of hybrid privacy preserving approaches is analyzed in this research.
The further flow of the paper is organized as: research method used in this research is presented in section 2. Section 3 provides the description of the experimental results and its comparisons with the existing work. Section 4 concludes the paper in the end.

METHOD
Numerous PPDM-related methods and techniques have previously been created and used [30]. In this article two hybrid techniques named enhanced principal component analysis-based technique and improved random projection perturbation are discussed. The efficiency of perturbation-based datasets transformation is also investigated via comparative analysis. These techniques are as follows:

Enhanced principal component analysis-based technique
It is a principal component analysis (PCA) and classification-based approach that protects privacy. In this method, the initial stage involves pre-processing the original data using a data filter. The filtered data is then subjected to a PCA-based modification after the data pre-processing stage. Finally, the modified data is subjected to a classification approach for data mining.
The following two phases make up the full structure of this technique: a. Phase 1: The preservation of individual privacy in datasets is the focus of this phase. This phase consists of mainly two parts. Which are: i) The most crucial component for improving the precision and speed of the classification approach is the classification filter module (CFM). Prior to the PCA modifications of the dataset, this filter is applied to the original dataset and ii) The second module is the perturbation module, where the altered data set is once more disturbed using PCA-based transformations. Additionally examined and contrasted with the original dataset is the affected dataset's correctness. b. Phase 2: The perturbed data set is mined after the two aforementioned modules. The "naive Bayes" approach is used as the classification method in this instance. Additionally, accuracy is calculated on the original datasets and contrasted with the accuracy of the perturbed dataset. Figure 2 shows the functional flow diagram for this model.

Improved random projection perturbation
Random projection is a potent method that involves utilizing a random × matrix to project the original high d-dimensional data onto a smaller k-dimensional subspace [31]. Figure 3 provides a general overview of the design view of improved random projection perturbation method. The technique's overall structure is split into two sections. a. Phase 1: The preservation of individual privacy in datasets is the focus of this phase. This phase consists of two parts: i) Feature selection: This module is used to choose features and improve the classification technique's accuracy. Prior to the dataset's changes using random projection, this was done to the original dataset using PCA. Prior to the random projection process and the classification phase, feature selection is used. In this paper, a feature selection method based on PCA is applied for selection of appropriate features and ii) Random projection: The perturbed data is adjusted once more in this module utilizing dimension reduction, which is accomplished with the aid of the random projection method. The datasets are distorted using the random projection technique. Additionally examined and contrasted with the original dataset is the affected dataset's correctness. b. Phase 2: In this phase, perturbed data sets are mined using a particular classification technique. Naïve Bayes classifier is used in this instance. Additionally, several matrices are computed on the original datasets and their accuracy is contrasted with that of the perturbed dataset.

Simulation
WEKA and R-studio software are used to carry out the performance analysis of both techniques [32]. The original datasets are used in these experimental analysis together with the chosen methods to create the transformed datasets [33]. On both datasets, numerous metrics including correctly classified instances, incorrectly classified instances, true positive (TP) rate, F-measures, model building time are calculated. These metrics are used to assess how well the chosen algorithms performed on the projected dataset. The naive Bayes classification method used for implementations of the algorithms in order to evaluate the efficacy of the strategies. In this performance evaluation, classification accuracy and time usage are the two key emphasis points. Comparisons are made between the performance results and the results collected from the NB. Two datasets were perturbed by the IRPP and EPCAT. The entire 10-fold cross-validation process used in the techniques. For the purpose of validation, ten samples were chosen ten times this validation process was repeated. Nine samples were collected for training in a single run. One sample was utilized to evaluate the effectiveness of the suggested method. The average accuracy of the 10 iterations was then used to represent the overall performance for a particular dataset. Figure 4 depicted the snapshots of principal components of hypothyroid dataset after performing PCA based transformations on training set and test set. Figure 5 depicts the transformed datasets after perturbation for the cardiovascular system in Figure 5(a) and for the hypothyroid dataset in Figure 5(b). Modified datasets are shown to be more secure than the original dataset since it is challenging to access the changed data. So, confidentiality of datasets is maintained.

Evaluation metrics
The effectiveness of the used techniques is assessed using various categorization measures. These include F-measures, TP rate, accuracy, runtime, and false positive (FP) rate.
-Accuracy: Accuracy can be used to evaluate a classification model. It is one aspect that may be considered to rate a classification models. Accuracy is the proportion of forecasts that our model successfully predicted. The official definition of accuracy is as (2).

Number of correct predictions
Total number of predictions -FP rate (accuracy-): Accuracy of machine learning algorithms may be evaluated using a statistic called the False Positive Rate. The false positive rate is determined as (4).
-F-measures: It is combined measure for precision and recall metrics. It provides a single score that balances both the concerns of precision and recall in one number. It is calculated as (5).

RESULTS AND DISCUSSION
This research analyses the impact of hybrid perturbation based PPDM techniques in data mining process. For this purpose, IRPP and EPCAT are selected. Several tests were run on data sets of two different sizes, and the associated outcomes were seen. The results of the experiments demonstrate that the both hybrid techniques IRPP and EPCAT perform better due to their greater accuracy, TP rate, FP rate, F-measures, and run duration values.
-Datasets: This research is implemented and experimented on two datasets i.e., cardiovascular dataset and the hypothyroid dataset. The cardiovascular dataset consists of 70k instances and 13 attributes. The hypothyroid dataset consists of 7,200 instances and 21 attributes [34]. The effectiveness of both method on cardiovascular datasets is shown in Table 1. On the provided training datasets, the metrics accuracy, TP rate, FP rate, F-measures, and run time are computed. It is clearly shown in the Table 1 and can be noticed that the strategies produced superior results than the conventional model of categorization in all regards. The performance of IRPP and EPCAT privacy-preserving algorithms to the conventional classification algorithms on cardiovascular datasets is shown in Figure 6. It displays the efficiency of the random projection-based privacy-preserving and PCA-based privacy-preserving method to the traditional classification model on cardiovascular datasets. As shown in the figure, it is observed that the IRPP method has better accuracy measures than the conventional classification algorithms and PCA-based privacypreserving method. The effectiveness of both methods to the conventional classification model on hypothyroid dataset is shown in Table 2.
On provided training datasets, the metrics accuracy, TP rate, FP rate, F-measures, and run time are calculated. It is well depicted in the in the table, and it is easy to see that the IRPP approach yields better results overall than the conventional model of classification. On hypothyroid datasets, Figure 7 compares the efficiency of the privacy-preserving algorithms IRPP and EPCAT to the traditional classification techniques. It demonstrates the effectiveness of the privacy-preserving random projection and PCA methods in comparison to the conventional classification model on hypothyroid datasets. As depicted in the Figure 7, it can be seen that the IRPP approach outperforms both the PCA-based privacy-preserving method and traditional classification algorithms in terms of accuracy measurements.

CONCLUSION
Developing algorithms that can conceal or provide privacy to some sensitive information is the fundamental goal of privacy preservation in data mining operations. PPDM techniques are essential in order to stop profiteers from gaining unwanted access. However, data mining accuracy and privacy conflict. In this context, this paper has analyzed the impact of PPDM techniques based on perturbation to datasets. This article provides a brief overview of some privacy techniques, namely PCA-based perturbation, and random projection-based perturbation, and analyzes their competencies and differences in different scenarios. The effectiveness of these hybrid techniques has been tested in classification algorithms naive Bayes classifiers. For implementation purpose two datasets cardiovascular and hypothyroid datasets have been selected. It has been found that IRPP privacy-preserving approach and enhanced principal component analysis-based technique EPCAT are more effective than traditional technique. The perturbed datasets are more privacy preserved. In cardiovascular dataset's case, the perturbed datasets outperform the original dataset in terms of runtime, accuracy, TP rate, and F-measurer. In hypothyroid dataset's case, implementation results on all measurements are better or almost identical to the previous approach model. Therefore, it is noticed that the datasets that are altered using hybrid privacy preserving approaches are more secure and efficient than the original datasets.