MapReduce-iterative support vector machine classifier: novel fraud detection systems in healthcare insurance industry

Fraud in healthcare insurance claims is one of the significant research challenges that affect the growth of the healthcare services. The healthcare frauds are happening through subscribers, companies and the providers. The development of a decision support is to automate the claim data from service provider and to offset the patient’s challenges. In this paper, a novel hybridized big data and statistical machine learning technique, named MapReduce based iterative support vector machine (MR-ISVM) that provide a set of sophisticated steps for the automatic detection of fraudulent claims in the health insurance databases. The experimental results have proven that the MR-ISVM classifier outperforms better in classification and detection than other support vector machine (SVM) kernel classifiers. From the results, a positive impact seen in declining the computational time on processing the healthcare insurance claims without compromising the classification accuracy is achieved. The proposed MR-ISVM classifier achieves 87.73% accuracy than the linear (75.3%) and radial basis function (79.98%).


INTRODUCTION
The recent advancements made in communication and digital technologies have revolutionized the modern world. It develops a highly connected environment among the communication entities. Different types of networks such as social platforms, e-commerce, blogs, industrial trading, banking and insurance networks are increasing along with the development of communication technologies. A tremendous volume of data is being generated from these networks. A billion transactions are carried out in a fraction of seconds. A vast array of information is easily accessible by the fraudsters (or) attackers via creating anonymous platforms [1]. The growth of anomalous networks, fraudsters have developed several opportunities to manipulate the data without the user's knowledge. Many organizations employ preventive measures to secure their networks and data from internal and external threats with the help of digital technologies. Special considerations are taken on the interactions and the activities performed among the inter-network entities [2]- [5].
A widespread of machine learning (ML) algorithms is incessantly explored in the different fields of real-time applications. In recent years, it has been increasing prominence due to the popularity of big data [6]- [8]. The problems in ML algorithms are known to be the issue of learning from experience by analyzing some tasks and performance measures. It helps the users to unleash the data structure and develop the There has been a myriad of studies in detecting fraud claims in the healthcare industry. The review study has analyzed from two aspects, class imbalance and features representation. Class imbalance and feature representation are some of the classical problems in machine learning algorithms [13]. Due to the improper definition of features, an imbalance occurs between the classes, i.e. one class has high data samples whereas the other class has low data samples. Finding the abnormal claims is a challenging task due to the issues mentioned earlier.
The performance of ML techniques in financial frauds has been surveyed by [14]. The general techniques involved in machine learning are descriptive techniques, predictive techniques, artificial intelligence techniques and hybrid techniques. The analysis has stated that the hybrid techniques, genetic algorithm and support vector machine (SVM) outperformed better than other techniques [15]- [17]. The solution to financial frauds is always a never-ending task because of class imbalance. Frauds and abuse are the two factors that incline healthcare costs. Due to class imbalance, it is highly affected by the Brazilian Health Care Market [18]. Service Providers were asked to find out the link between fraud and abuse on the claim authorization process. The execution of cross-validation distribution on treatment methods, machine learning algorithms like SVM, C4.5, random forest and naive Bayes were analyzed. The results have stated that the random forest was not affected by the class imbalance. Other ML algorithms were involved when the class distribution changed.
Insurance frauds have become more complex due to the accumulation of prominent data resolved by big data predictive modelling [19]. Distributed and the ML algorithms tested parallel computing tools to differentiate the fraud records. The fraudulent patterns change over time, and thus, an imbalance between the detected patterns creates trouble for the detection approaches. Concept drift [20] is a domain that encompasses the dynamic data, i.e. change over time. Labelling of unsupervised data requires concept learning. The authors have dealt with automatic labelling of unsupervised data using the concept drift approach. A permutation test was conducted over each statistical data, and the p-value determined the class of data. Since it follows a one-fixed algorithm, the effects of class imbalance are high in noisy data.
The decision support tools require an intensive SME analysis when it comes to prepayment and post-payment control models. The presence of outliers in medical data lowered the accuracy of the detection framework. Outlier detection techniques [21] were explored to find the misclassified patterns, i.e. false positive rate. It tested on Medicaid data of 650,000 healthcare claims and 369 dentists of one state. An improper cluster formation has increased the FPR, and also, the estimated clusters mean improper class distribution. The comprehensive services provided by healthcare sectors have become more portable by adopting android technologies [22]. The tracking of claiming benefits requires a timely prediction. Because of the complex data granularity, it has lowered the accuracy of the framework. Thus, methods such as semi supervised isomap (SSIsomap) activity clustering, simple local outlier factor (SimLOF) outlier detection, and the Dempster-Shafer theory-based evidence aggregation are studied on the real-world dataset. The behavior profile pattern also alters when the data size increases, which strongly induces the estimated frequent itemset.
The provider-consumer model incurs a considerable expense from the healthcare systems. Thus, the anomaly was detected from the provider and consumer models [23]. Brazilian healthcare records from 2008 to 2015 were collected and evaluated using bipartite graphs and k-nearest neighbors algorithm (k-NN) algorithms. The bipartite charts were employed to find the relationships between those two models. The detected similar patterns used to classify into potential providers and anomaly classes using k-NN. The performance measure cost and effectiveness validate it. Instead of validating the number of hospitals, the available cities and consumer scores was used for effectiveness estimation. Therefore, representing the features are essential in graph-based approaches.
Several researchers have explored Medicaid to discover fraud in medical data beyond the transaction level [24]. The multidimensional data analysis was designed for fraud classifications using sparrow's insights. The discovered fraud patterns are also from unsupervised data. The data was classified into six classes based on the levels of fraudulent data patterns were identified. It was concluded that the inefficiency of training data had lowered the performance of supervised ML techniques. With the above as a base, an ML model was designed to detect frauds done by physicians [25]. When it comes to billing procedures, the frauds may be external (or) internal frauds. Irrespective of the claimer, the physicians were also performing the misuse of billing procedures, which is challenging. Hence, a multinomial Naïve Bayes algorithm was designed to resolve multi-class classification by following 5-cross validation. The classification was done by interchanging the features, like field experts, specialty, and provider types. Relied upon F-score, the fraud levels on procedures done by physicians were reported. It has built an association among different levels of physicians when handling the claim data.
Association rule mining is also employed to recognize fraudulent patterns. It is used to constructing associations/correlations between features. Initially, the transaction data was transformed into a set of clusters, and then some standard association rules [26] were framed. Based on the lift and confidence value, the data samples were classified into fraudulent and non-fraudulent claims. The analysis of claim data was concentrated on the feature extraction phase rather than the classification phase. However, some features are discarded in terms of big data analytics. The invasion of variant actors and commodities [27] in the healthcare insurance claims has imposed different challenges to the ML techniques. Therefore, an interactive framework for unsupervised data analysis was required using pairwise computational models such as analytical hierarchical processing (AHP) and expectation-maximization (EM). CGM Turkey for private insurance companies was validated under area under curve (AUC). It has been stated that the independent analysis of actors and the commodities reduced the time rate for predicting fraud. The fragmented nature of feature representation has brought significant changes towards the facts finding the process of institutions.
The patient rule induction method (PRIM) [28] was designed to extract the anomalies patterns under big data context. It was implemented in Center for Medicare Services (CMS) 2014 dataset, which has improved the feature space. While partitioning the feature space, a depth-analysis on different classes is not done. Since it performs conditional probability on features, the activities of physicians are not traced. Heuristics approaches on defining optimal fraud indicators are not possible due to the higher accumulation of false claims. Fake billing frauds are available more than other frauds, especially in auto/vehicle insurance claims [29]. Comparison models were designed using random forest, naive Bayes and decision tree under confusion matrix measure. It was implemented in a synthetic dataset, which concluded that the random forest has outperformed better than the other two models. Feature modelling has a significant part in designing the classifiers to reduce false positive and true negative rates. Analyzing camouflage behaviors [30] is a troublesome task from the classification approaches because it sustains for a short period. Patient cluster divergence-based healthcare insurance fraudster detection (PCDHIFD) was designed to classify the Int J Elec & Comp Eng ISSN: 2088-8708  fraudulent caused by camouflage behaviors [30]. With the help of patient admission date features, the correlation value between the patients, hospitals and the providers were computed from a graph-based dense peak clustering approach. Then, a divergence cluster value was used to detect the fraud patients. The fmeasure has been improved by 15% than other classification models. Interpreting the medical admissionoriented features affects the classifiers in the camouflage behaviors analysis. This research study proposes a novel fraud detection model by hybridizing the strengths of big data and machine learning approaches to solve the insurance claim classification. It reduces the effects of class imbalance over the voluminous data that has multi-classes. The insurance claim data is preprocessed using MapReduce framework that scales up the efficiency of data processing capabilities. The deployment of iterative support vector machine (ISVM) classifier on processed data helps to classify the fraud providers by executing the pointed iterative conditions. The proposed MapReduce based iterative support vector machine (MR-ISVM) classifier achieves the objectives of classification accuracy with the less computational time.

METHOD
Class imbalance and feature modelling are mutually dependent on supervised based ML techniques. Multi-class learning (MCL) is a challenging domain between ML and big data analytics. The research on MCL has not been suggested more than single-class learning (SCL). MCL is defined as the problem of associating an instance with more than one class, even for binary labels. Conventional methods do not support MCL because it reduces the prediction accuracy of the application framework. MCL requires a systematic approach to handle the medical data effectively and enhances cost-saving and detection efficiency. Feature selection technique (FST) has to take the categorical, continuous, and high-dimensional data to innovate the MCL domain. Let us define the problem in vector form. Each instance in the database is represented as, = { 1 … … } where, p represents the final instance. The data instance is obtained from the domain, = { 1 … … }. Then, each instance a is associated with the class labels, is denoted as, This research aspires on framing a fraud detection model that detects the mishandling of the claiming process using machine learning algorithms. Ideally, the proposed method is designed to discover provider abuse by analyzing the variables used in treatment, disease and claim. The steps of the proposed process are explained in a detailed manner. Figure 3 presents the block diagram of the proposed research. The proposed research comprises five phases, and they are explained in brief: a) Data acquisition: It is the foremost step that portrays the information of datasets. b) Data preprocessing: It is the second step that portrays the organization of the collected datasets. c) Feature selection: The third step describes the selection of features used for constructing the training classifier. d) Classification: It is the fourth step that presents the workflow of the proposed classifier. e) Detection: It is the final step that assists the testing data.

Data acquisition
Dataset is collected from the well-known public repository, known as "Healthcare provider fraud detection analysis" [31]. Provider fraud is one of the biggest scams prevailing in the healthcare industry. Due to the mishandling of disease and the treatment details by the physician, the providers increase the medical costs. The metadata of the dataset is presented in section 3. The collected dataset determines the success rate of the research objectives.

Data preprocessing
This step's task is to organize the data presented in the datasets efficiently. It is achieved by eliminating the missing values, duplicate data and also developing efficient data partitioning. The examination of missing values and duplicate data is described in the next section. The development of the data partitioning is explored by using a novel MapReduce technique. It is found that multiple claim IDs are generated with various providers, which is differentiated by the diseases. Owing to this, the MapReduce technique is employed over the 'inpatient and outpatient' tables. Based on the disorders, a new table is created. As the name suggests, the MapReduce technique consists of viz, mapper, and reducer functions. The mapper function is expressed as, The reducer function is expressed as, where, 1 & 2 are the input key and the output key; 1 & 2 are the input value and the output value;

Feature extraction and selection
Feature selection is the third step that deals with the extraction of required features to build an efficient training classifier. The data table contains a high set of features, and thus, the importance of each feature is studied to eliminate the irrelevant features. Linear discriminant analysis (LDA) is performed over the data table. The objective of LDA is to explore the linear combination of features that combines two (or) more classes of objects. The most desired features are obtained from reducing the dimensionalities before building the classifier. It is the most suitable model for preserving the multiple classes with reduced dimensions. The claiming procedure depends on the different aspects of the medical reports of the patients. It is found that a beneficiary holds multiple claiming strategies for multiple diseases. The amount is claimed based on the disease code, treatment code and the total amount. Here, three types of variables, viz, claiming variables, disease variables and treatment variables. In this step, we intend to find out the 'confidence score' of the bills given by the provider. The estimated confident score will help verify the attributes taken for creating, validating and verifying the statements provided by the provider. The confidence score function is calculated: For the given score function, the aim is to estimate the linear coefficients of variables that maximize the score, which is further given as: where, : coefficients of Linear model, C1 and C2: covariance matrices, and 1 & 2 : mean vectors. The discriminant assessment can be done by computing the Mahalanobis distance between two groups.
At last, we obtain a new data point that can be classified into C1 (default) and C2 (not-default) by following the conditional formatting on, where, : coefficients of vector, x: vector of the data, 1 + 2 2 : mean value of vector, and p(C 1 )

p(C 2 )
: probability of class. Depending on the obtained scores, the relevant features are extracted and selected for the classification purpose.

Classification
Iterative support vector machine (ISVM) is employed to ease the classification tasks with minimized computational efforts. It extracts the provider data via feedback loops in an iterative manner. Initially, a hyperplane data cube is created by combining source data tables and their principal components. Then, a general SVM is applied to the hyperplane data cube that generates the classification map. MapReduce framework is employed to receive the required information from the SVM based classification map. The output obtained from the applied preprocessing technique is combined with the other hyperplane data cube for the next iteration process. Likewise, the iterative process continues until achieving the stopping criteria. h) Stopping rule is defined for terminating the iteration i.e. Feedback process, which is explained in the next section. i) If − ( ) , satisfies the stopping rule, then the ISVM is stopped. Atlast, the final classification map is declared. j) Else, the process continues by following the step (d), by iteratively, j= j+1.

Framing of stopping rule for ISVM
The main concept behind the stopping rule of ISVM is to find the best classification maps obtained from j th and (j-1) th iterations. Tanimoto index (TI) is employed to find the best stopping rules from the generated classification maps. It is given as, where, ; −1 are the classification maps. TI ranges from [0, 1] and a threshold value is defined. If obtained classification maps cross higher than the given threshold , then the iteration is stopped. Figure 5 represents the functionality of the ISVM.

RESULTS AND DISCUSSION
The proposed framework is applied to the real-world insurance provider data obtained from medical fraud provider detection the previous model is compared with the proposed model using institution-level variables. From the Medicare data warehouse, beneficiary, inpatient and outpatient data details are preserved in different tables. The Table 1 presents the tables and their details. It is to be noted that the unique feature between inpatient and outpatient data is the absence of diagnosis code. The Table 1 presents the database tables and their feature details. The data is preprocessed using MapReduce technique that eliminates the medical treatment records, claiming records, removing missing values, and fixing errors. As a result, we have used 5,000 records for modeling. Provider ID and the Beneficiary ID are the primary key and the claim details like reimbursement and the deductible amount are taken as the secondary key value of this study. The collected dataset is preprocessed using MapReduce framework. A beneficiary can hold the inpatient and outpatient data and thus, it is organized using MapReduce framework which is given as: Table 2 presents the sample records organized using the MapReduce framework. The primary key is to recognize the "clean and organized" data that can reuse the previous results, i.e., it splits the input data into smaller volumes of data quickly and stably. These smaller data volumes may ensure that more small data volumes are clean. Regardless, much smaller data volume increases the overhead, and thus, the designed MapReduce framework, as a preprocessor, must assure stability and speed. In the view of sorting the data imbalance issue, the MapReduce framework adoption has scrutinized the cardinality of the majority and minority classes. Compared to the synthetic minority oversampling technique (SMOTE), the proposed MapReduce technique has modified the intrinsic way of data learning process. The developed Java-based decision support engine is associated with MySQL using java server pages (JSP) scripts. The feature extraction process on the preprocessed data involves claims cost validation. A new data table, 'Unbundled date'' is created and linked with the proposed (ISVM) classifier. The claims are split into two, namely: i) claims with the approved costs within each diagnostic related group and ii) claims with the disapproved costs within each diagnostic related group. LDA is used to haul out a nominal attributes subset that aims for the probability distribution of data classes. The separated classes are close to the original class data distribution by making use of attributes. A new data table is constructed to the estimated 'confidence score' of the bills given by the provider. The choice of features based on the LDA are, attendance data, hospital code, diagnostic related group, Claim bill, and drug bill. The dataset is subjected to the ISVM by 70% training and 30% for testing. The approved claims are then fed into ISVM training classifier. The best data those that meet the confidence score of LDA's criteria are classified first. Each instance of this dataset is organized into "Fraud provider" (or) "Legal Provider". The proposed iterative conditions fed into the ISVM classifier to detect the fraud providers are: i) count of total BeneID is compared with the total ClaimID for each provider. If the count of BeneID is greater than the count of ClaimID, it is labeled as a fraud provider; ii) claimStartDate and ClaimEndDate are After each classification, the confusion matrix is displayed. The matrix is embossed of the count of true legal, true fraud, false legal, and false fraud. a) True legal provider: It includes the count of 'approved costs" correctly classified as "True legal provider" by the ISVM classifier. b) True fraud provider: It includes the count of 'disapproved costs' correctly classified as "True fraud provider" by the ISVM classifier. c) False legal provider: It includes the count of 'disapproved costs" incorrectly classified as "False legal provider," even though they are not, by the ISVM classifier. d) False fraud provider: It includes the count of 'approved costs", which were incorrectly classified as "False fraud provider," even though they are not by the ISVM classifier. The Figure 6 presents the proposed implementation framework. The performance metrics are employed to evaluate the MR-ISVM classifiers. a) Accuracy: The proportion of recognizing the classes to the proportion of aggregate total data samples.
The efficacy of the accuracy metric is achieved on the balanced datasets which is expressed as,  Table 3 and Figure 7 represents the number of fraud data available in the testing datasets. The accuracy of the MR-ISVM classifier is evaluated from the classification and detection ability of fraudulent providers. The proposed MR-ISVM classifier is tested in 10-fold cross validation of hyperparameters (C, ). A random search is performed on ISVM parameter training until classifying the optimal claims data samples. The sample screenshots of the proposed framework are shown in Figure 8.  Table 3. Fraud provider types based on data volume   Fraud provider types  Sample data size  1,000 2,000 3,000 4,000 5,000  Identity-wise analysis  6  8  45  34  22  Date-wise analysis  0  56  34  12  98  Diagnosis code-wise analysis  45  23 Table 4 and Figures 9 to 11 represents statistics of the SVM classifiers on sample data size. The confusion matrix is also known as the error matrix that helps to visualize the performance of the iterative SVM classifier. As the sample data size increases and given iterative conditions, the classification, and detection of fraud claims incline exponentially.   Table 5 represents the average of different versions of the SVM classifier's performances. It is observed that the MR-ISVM classifiers perform better classification with an accuracy of 87.73%, followed by 88.98% precision and 79.95% recall. Compared to the radial basis function and linear kernels, the MR-ISVM outperformed better to classify and detect the fraud provider. Figure 12 represent the analysis of the computational time of the MR-ISVM classifier with the linear and radial basis function. Along with the classification, the required time in computing the sample datasets is significant in this study. It is understood from the above analysis that the computational time increases depending on the volume of the sample dataset. Figure 13 presents the comparative analysis between the existing and proposed techniques. The proposed MR-ISVM classifier takes less computational time than the linear and radial basis functions. The variation in instant time is owing to the training dataset using the MapReduce framework. As we know that the data has been growing widely and rapidly in recent times. Thus, more computational resources need proper and accurate machine learning approaches.

CONCLUSION
The healthcare industry generates a tremendous amount of data from heterogeneous data sources like medical reports, hospital devices, and billing systems. The healthcare data transactions are too complex and voluminous to be computed by conventional methods. Fraud detection is one of the major research areas that need to be scaled up in real-time scenarios. It is a kind of risk management control activity. Class imbalance and feature modeling are the major issues that degrade the performance of machine learning approaches on healthcare data. This research work aims to introduce a novel fraud detection model by hybridizing the qualities of big data and machine learning approaches. The collected insurance claims data is preprocessed using the MapReduce framework that categorizes the voluminous claims data. The required features related to the disease, treatment, and total amount are modeled using LDA. The ISVM approach is widely explored due to its strength in separating the claims data into legal and fraud providers. The soft margin function enables the separation of claims data, which is done by iterative conditions. Thus, the fraud detection systems support the combination of two approaches and achieve higher fraud detection accuracy. The implementation analysis has demonstrated that the MR-ISVM classifier outperforms better in classification and detection than other SVM kernel classifiers. The achieved results explore a positive impact in reducing the computational time on processing healthcare insurance claims without compromising the classification accuracy. The proposed MR-ISVM classifier achieves 87.73% accuracy than the linear (75.3%) and radial basis function (79.98%).