http://ijece.iaescore.com Security aware information classification in health care big

Info 2021 These days e-medical services frameworks are getting famous for taking care of patients from far-off spots, so a lot of medical services information like the patient’s name, area, contact number, states of being are gathered distantly to treat the patients. A lot of information gathered from the different assets is named big data. The enormous sensitive information about the patient contains delicate data like systolic BP, pulse, temperature, the current state of being, and contact number of patients that should be recognized and sorted appropriately to shield it from abuse. This article presents a weight-based similarity (WBS) strategy to characterize the enormous information of health care data into two classifications like sensitive information and normal information. In the proposed method, the training dataset is utilized to sort information and it comprises of three fundamental advances like information extraction, mapping of information with the assistance of the training dataset, evaluation of the weight of input data with the threshold value to classify the data. The proposed strategy produces better outcomes with various assessment boundaries like precision, recall, F1 score, and accuracy value 92% to categorize the big data. Weka tool is utilized for examination among WBS and different existing order


INTRODUCTION
Nowadays data in various fields like government organizations, health care systems, military, and banking sectors are growing exponentially. As the data is huge in amount, it needs to be stored in digital form. In earlier years, even though the size of data does not matter, still the inflow from where the information comes from and the structure of that information was restricted. In today's world, the situation has changed and a big amount of data can be fetched from enormous sources and in a variety of its formats. The various tools like hadoop distributed file system (HDFS) and MapReduce are used to store such a big amount of data [1], [2]. The role of big data in the NoSQL database is in an enormous form to achieve high performance and accuracy over conventional databases. It is challenging to handle such a big amount of data using rows and columns format when the input data is in an unorganized format. NoSQL databases effectively handle such kinds of unstructured data [3]. One of the well-known sources of big data is logs generated through various web and desktop applications. The data coming from these sources are having types like well-organized data, unorganized data, or unstructured data. Big data having different phases, which are termed lifecycle phases of big data [4]. The first phase is the Collection of the input data from legal and authorized sources and gets it available as an input for the next phase of the lifecycle. The second phase is to store the collected data using trusted functions. This is the phase where the chances to get big data access by the attacker are very high. Big data analysis and processing is the next step in the lifecycle phases. In this phase, we need to maintain the integrity of the data, which is being processed. The last phase is to generate meaningful knowledge from the analyzed data. This meaningful information is helpful to make the decisions for the business models of the organizations. The generated knowledge is used as sensitive data and various security tools and techniques can be used to secure this generated sensitive data.
In reality, the problem of analyzing big data is there for many years as data creation is simpler than finding beneficial information from it. Even if there is much fast processing equipment, processors are available in the market, huge data processing is still a tedious task [5]. Since the conventional techniques are unable to handle the processing load of the big data, so algorithms are created in such a way that they should be compatible with MapReduce as well as SailFish frameworks to handle big data [6]. There are some areas like transport systems where the knowledge produced by processing the big data is used. To increase the performance of this kind of system organizations and government needs processed data for decision-making. The accuracy of the decision-making increases proportionally with the number of training datasets as input [7]. Modernizing the hardware of the system is necessary along with various platforms to analyze and extract meaningful knowledge from the big data. Optimal matching for hardware and software within a specific time to analyze big data is one of the challenges. Selecting the right platform to increase the scalability and performance of the system, which leads to accurate results generation. The generated results are also termed sensitive data. This sensitive data can be used for further evaluation as an input to other systems [8].
Big data require new mining strategies for processing the data. The definition of 3Vs of big data implies that the data are huge in amount, the data might be created in hurry, and the information will exist in multiple types as captured from extraordinary resources [9], [10]. One of the trusted sources for input data in the modern era is the use of intelligent objects using the internet of things (IoT). These devices have collaborated with internet and the acts as a mediator between users and other gadgets or appliances. The amount of data generated during communication can be used as big data [11]. The healthcare organization has generated a big quantity of information from the report keeping, compliance, and patient associated statistics [12].
The cryptographic system requires multiple keys for the encryption process so instead of encryption technique if the classification approach is used to categorize data into normal attributes and sensitive attributes category then encryption techniques can be applicable on only sensitive data [13]. Big data are analyzed not only on-premises/servers but also on cloud platforms. The data are stored and can be analyzed on the cloud where frameworks are provided to do so [14]. In the digital world, data need to be digitized to enhance the healthcare system by minimizing the charges and analyzing many records within a short period successfully [15]. It needs an advanced approach to deal with huge data. The main thing in health care usually consists of sensitive facts, which want to be covered from any unanticipated operations or record retrievals [16], [17]. Moreover, analyzing personal information or dataset is problematic because many related facts may additionally comprise sensitive things [18], [19].
Health care data costs more as it carries sensitive data of the patient. According to the Ponemon Institute report, the average healthcare record breach price is almost $380. In many cases, data may be lost due to a lack of secured data structures, machines, and data storage strategies. To permit employees to have data records on their devices is never so easy. Privacy of data is considered a primary issue in big data analysis [20], [21]. The healthcare domain is widely used as one of the trusted sources of datasets. The biosensor device can be used to collect health information from the patient. The sensor is applicable using the external part on the body and the report generated will be stored in the available databases. The results generated by the sensor are the mixed format input data. We can classify those input data into normal and sensitive data. To classify big data, proper classification techniques are required which will give accurate results in less amount of time.
There are many classification techniques introduced by different people in recent days. Some of the techniques are as below:  Naive bayes classifier: The naive bayes classifier is used to classify the input data from the health care dataset. The information related to blood pressure, and pulse rate. The limitation of the Naive based classifier is if the dataset size is small, the precision will decrease [22].  Improved support vector machine classifier (SVM): Improved SVM is used to classify big data by adding extra functionalities like a weighted Euclidean distance, radial integral kernel functioning existing SVM. It is suitable for multi duplicated samples and a large amount of data [23].  Bagging-based naive bayes trees: This classification technique is used to classify big data by combining two approaches; i) bagging ensemble, and ii) naïve bayes, and the decision tree approach. It can classify big data in less amount of time [24]. WBS classification technique, particularly designed for those applications concerning the information security of big data as an input. The investigation of the proposed algorithm is essential, as an accurate classification of datasets into two categories sensitive and non-sensitive is needed in the view of the security of data in various domains.

RELATED WORK
Pham and Prakash [24] proposed the bagging-based naive bayes trees (BAGNBT) approach. It is used for landslide vulnerability categorization in Viet Nam. This method was approved utilizing the tests such as the Chi-square test and statistical indexes. For the correlation, different analysis models were chosen in this research. The output shows that the novel model with the area under the receiver operating the characteristic curve (AUC) [25] (0.834) resulted better with a comparison of rotation forest-based naïve bays trees (RFNBT) (0.830). This demonstrates that the technique is a promising and better elective technique for landslide vulnerability assessment. Techniques like the AUC, statistical pointers, along the Chi-Square test have been used to validate the models in the current test. AUC is used as a standard for validating prototypes. "Sensitivity" and B100-specificity" [26] parameters are used for plotting the AUC. For value 1, models are considered as immaculate. Models are labeled as good with greater AUC value. For the approval of the model different performance measures like root mean squared error (RMSE), Kappa (κ), and Accuracy (ACC) are used [27]. Landslide causing parameters, in particular incline, separation to deficiencies, bend, street thickness, profile bend, perspective, plan bend, waterway thickness, lithology, height, separation to streets, separation to rivers, precipitation, flaw thickness, and area usage, were chosen for the evaluation of landslide vulnerability. For examination purposes, maps of these components were created. BAGNBT approach includes four main steps as given below: a. Dataset generation: Mainly two necessary datasets were created with this step. First is the training dataset and the second is validating the dataset. For the training dataset, they have used 70% of the datasets with rockfall and non-rockfall values while for validating the datasets they have used 30% remaining data. Boolean values 1 and 0 were used for rockfall and non-rockfall simultaneously. b. Novel model training: The training dataset was used for training the hybrid model in this approach. In this phase, to optimize the provided dataset for the categorization BAG ensemble was used. 11 iterations were identified to match the high accuracy of the model named bagnbt. along with it, a classifier like Naïve Bays Tree (NBT) was used to label the non-landslide and landslide classes for the geographical forecast of the landslide using upgraded training data items. In the last step, a combo of BAG was recycled for consolidation produced classifiers of naïve bays tree to develop a brand new model. c. Model validation: The novel model was approved utilizing different strategies such as AUC, and the Chi-Square test. d. Developing landslide weakness maps: Landslide map vulnerability was created utilizing the NBT, SVM, RFNBT, and BAGNBT models. Rockfall vulnerability evaluation was performed at most sensitive cities where landslides happen very frequently like Viet Nam using the suggested method with the approach of a combination of the Naive Based Trees as a classifier with the BAG collaborative. The statistical pointers, AUC, and chi-square methodology has functioned for testing. Estimation and relationship consequences of the proposed system prove that the BAGNBT model has a prominent demonstration for rockfall vulnerability analysis of AUC whose value is equal to 0.834 in examination with the RFNBT value which is equal to 0.830. It classifies data moderately. As it contains multiple iterations for classifying data, its weakness is a bit time-consuming.
Lakshmanaprabu et al. [28] introduced a method for the classification of data with a random forest (RF) classifier. The RF classifier is used for the analysis of particular big data collected from various sources in this paper. Figure 1 shows the data collection process from various data sources. Particularly they used patient data including many health problem parameters. For the proper classification of data taken from the healthcare, database author has utilized the improved dragonfly algorithm [29]. With the assistance of ideal features, a selected RFC classifier [30] is applied to characterize the healthcare data. Output values got from the result of precision are at most 94.2 as per execution. So distinctive measures are used and compared with existing strategies to check the viability of the techniques. Author has used actual health care data as test data and web information as training data. This work has different phases such as information accumulation, observing, splitting, making digital, controlling, and support. After data accumulation, the featured those are highlighted are taken out with the help of the dragonfly algorithm and given to RF classifier for further classification of data according to the user's requirement.
The RF classifier technique performs well for classifying the data as it does not have much effect on disorders of big data, variations in data with huge amounts. It is based on a tree generation approach in which many trees are generated for the estimation of the result to be chosen as a better one. RF develops a random sample of the data and identifies the key organization of features for making a choice tree. Figure 2 demonstrates the RF structure. RF makes a case of the data and checks with many decision trees generated before for the finalization of optimal solutions. The training dataset is provided to substitute bootstrap tests for generating every choice tree. Amid the enhancement of a decision tree at each node division, an irregular subset of few components is browsed the first-factor set and the best division in light of these few factors is used. RF classifier mainly referred to three key parameters for the classification of big data as below:  Node size is taken that is not similar to the comparison of decision trees.  Trees count up to 500 trees is commonly a proper choice.  The count of the number of sampled predictors to be tried at each split would have all the earmarks of being a key parameter that should affect how well RF performs. For error cases, the following (1) is used.
where, r1, r2 are random values and v1 and v2 are vectors. Quality and relationship parameters measure the accuracy of each classifier and dependency between classifiers. Different characteristics used at the phase of the base classifier return the quantity of data that is arbitrarily picked from the initial organization of the properties particularly at each node of the base decision tree. The most extraordinary probability strategy is used to secure perfect results in each tree so that it can be helpful in RF classification. In this research, accuracy gets over 90% and 95%. The weakness of the proposed calculation is computationally moderate as the result of the huge database. In the existing systems, the classification techniques like SVM, RF, BAGNBT are used for data classification, and the results are analyzed in terms of precision, recall, F1 score, and accuracy parameters. In the proposed system, weight is calculated, and it is compared with the threshold value to decide on normal data and sensitive data. The generated results for the proposed algorithm and existing systems are compared based on kappa parameters, which shows superior results for the proposed system.

PROPOSED METHOD
For the proposed algorithm multiple attributes from network packet instances of health care data system are considered as input (step 1). For the classification of data into two categories, sensitive data and normal data background knowledge is used (step 2). The proposed technique is based on an instance-based method so it fetches attributes per instance generated from the network. After that background knowledge is applied to each attribute and tokenization is done for every attribute. With the help of background knowledge each input data mapping is performed (step 3). Weight is calculated for each mapping and compared with Threshold (step 4). If the resultant weight value is greater than or equal to the threshold, then that attribute will be considered as a sensitive attribute otherwise it will be considered as a normal attribute (step 5). Finally, in the end, two lists will be created for sensitive and normal data.

Algorithm
The Proposed WBS technique collects data from the health care system with the help of different types of sensors attached to the patient body to measure different health conditions of the patient. Furthermore, to categorize data, the patient database is taken as an input to the system, which contains multiple records. Each record has multiple patient attributes such as patient name, address, mobile number, Blood pressure. In the algorithm, R[i] shows patient records and A[j] shows attributes of every patient record.
In this system, T shows a training dataset with the E number of examples to compare patient data for sensitive and non-sensitive attributes. As shown in (2)
Step 2: Calculate the weight of each attribute by checking the similarity of the attribute with given training set examples.

weight(A[j]) = similarityDistance(A[j], T)
Similarity distance is calculated by using the cosine similarity measure for finding the similarity between two objects or vectors. It is calculated using formula Cos ө = Based on the above statement, the algorithm classifies each attribute either it is sensitive attributes or non-sensitive attributes. Figure 3 shows the flow of the WBS technique for classifying data into two categories sensitive data and non-sensitive data.

RESULTS AND DISCUSSION
This proposed work has been implemented by using a 64-bit Windows 10 operating system, NetBeans as an Integrated Development environment, Weka tool to compare kappa parameters, and Java technology. The hardware used for this work is an Intel i7 processor with 8 GB of RAM. For generating results, we have considered one lac instance of records, which are collected from real-time data generated by different sensor devices by the hospital. The data generated used for training and result generation purpose and the values of the attributes are real-time for systolic BP, diastolic BP, heart rate, total cholesterol, Cholesterol_LDL, Cholesterol_HDL, stress, Random Sugar, QT_Interval, PR Interval, oxy_saturation, and HB. Part of the sample dataset of the patient for classification considered as input is displayed in below Figure 4. Figure 5 shows classified data into two categories normal data and sensitive data.
The proposed technique is compared with existing classification techniques such as support vector machine (SVM), random forest (RF), Bagging-based Naive Bayes Trees (BAGNBT. The comparison parameters are; i) Precision, ii) Recall, iii) F1 Score, and iv) Accuracy. These parameters are defined in (4)-(7), respectively.
Here, comparison parameters are defined based on positive and negative class strategy. Positive class refers to the set of sensitive data items from patient record. Negative class refers to a set of normal data items from patient record. True positive is the outcome where the classification algorithm accurately predicts a positive class. True negative is the outcome value where the algorithm predicts the negative class accurately. False positive is the outcome where the classification approach wrongly predicts positive class. False negative is the outcome where the classification technique incorrectly predicts the negative class. Figure 6 is a graphical representation of precision for existing techniques and proposed algorithm (WBS) where WBS has higher Precision (0.92) as compared to the existing techniques.  Figure 7 is a graphical representation of Recall for existing techniques and proposed algorithm (WBS) where WBS has a higher value for Recall (0.91) as compared to the existing techniques. Figure 8 is a graphical representation of the F1 score for existing techniques and proposed algorithm (WBS) where WBS has a higher F1 score (0.89) as compared to the existing techniques. Figure 9 is a graphical representation of Accuracy for existing techniques and proposed algorithm (WBS) where WBS has higher Accuracy (92%) as compared to the existing techniques.

CONCLUSION
This article proposed a technique based on weight-based calculation for the classification of health care big data. It has three main steps data extractions, mapping of input data with background knowledge, and evaluation of the weight of input data with a threshold to classify data. In this technique, the health care dataset is properly classified using background knowledge provided for the data. Experimental results are estimated by using the WEKA tool. The proposed technique is compared with many existing techniques using different comparison parameters like Precision, Recall, F1 score, and accuracy. It performs better with an accuracy of 92% for classifying sensitive data as compared to existing techniques NB (82.76%), RF (91.20%), SVM (77.30%). As the proposed algorithm provides more accuracy to medical data, health care applications can categorize and store sensitive data more securely.