Predicting automobile insurance fraud using classical and machine learning models

ABSTRACT


INTRODUCTION
The insurance industry has become a fundamental pillar of our modern world for many years.This industry is relevant to society as it can offer financial security in the event of an accident or illness.However, some irresponsible individuals have taken advantage of making false insurance claims to obtain compensation or benefits which is called insurance fraud.Fraudulent claims are a serious offense as they not only threaten the insurer's or policyholder's profitability but also harmfully affect the insurance industry and the existing social and economic systems.
For the past few decades, the insurance fraud claims problem has been unresolved from the beginning of the insurance industry due to a lack of resources, research, documentation, and technology.However, in the late nineteenth century, the growing systematic data collection allowed the use of pattern recognition techniques such as regression, neural network (NN), and fuzzy clustering [1].In the late twentieth century, data played a central role in the insurance industry, and today, most insurance companies have obtained more access than before.When the size of databases grows, the traditional approach may overlook a significant portion of fraud as it is difficult to manually review large databases.With emerging technology, this issue can be solved using machine learning models to identify and predict fraud claims [2]- [4].

ISSN: 2088-8708 
Predicting automobile insurance fraud using classical and machine … (Shareh-Zulhelmi Shareh Nordin) 913 machine learning models depends on many factors, such as data quality and model settings.The novelty of this paper is the evaluation of machine learning classification techniques for automobile insurance fraudulent prediction.We also emphasized the importance of the data preparation stage to ensure data quality and statistical tests for identifying the relevant independent variables.The aim of this paper is to analyze the performance of the classical statistical model (logistic regression) with machine learning models (NN, SVM, decision tree, random forest, AdaBoost algorithm) and a semi-naïve Bayesian learning model (TAN).All the models were evaluated using various performance metrics.The remainder of the paper is organized as follows: section 2 describes the empirical data and data preparation phase.Section 3 discusses the classical statistical model and machine learning models as well as the evaluation criteria to assess the model performance.Discussions of the results are provided in section 4. Section 5 concludes the paper.

DATA DESCRIPTION AND DATA PREPARATION PHASE
The car insurance claim data which consists of a dependent variable, fraudulent insurance claim (denoted as claim_flag), and 23 independent variables with a total of 10,299 cases were retrieved from Kaggle website [27].These independent variables can be further categorized into four major groups including drivers' demographic factor: age of the driver (denoted as age), gender (gender), maximum education level (education), job category (occupation), income (income), marital status (mstatus), single parent (parent1), number of children (homekids), year on the job (yoj), home value (home_val), number of driving children (kidsdriv), distance to work (travtime) and home/work area (urbancity), vehicle factor: vehicle age (car_age), value of the vehicle (bluebook), type of car (car_type), vehicle use (car_use) and red car (red_car), insurance factor: number of a claim for the past 5 years (claim_freq), the total claim for the past 5 years (oldclaims) and time in force (tif) and license/charges factor: license revoked for the past 7 years (revoked) and motor vehicle record points (mvr_points).
As data preparation is one of the important stages in the data mining process, data cleaning and transformation such as removing illogical data, data reclassification, data encoding, data imputation and binning of continuous variables were carried out via the IBM SPSS Modeler 18.After removing illogical data for age, a total of 10,191 cases are used for the analysis.We observed that the distribution of the dependent variable claim_flag with no fraud claim (73.59%) is much higher than with fraud claims (26.41%) which indicates that the claim_flag data is imbalanced.Twenty-three independent variables with ten continuous variables and thirteen categorical variables.After performing data cleaning and data transformation, we reclassify three categorical variables (homekids, kidsdriv, and clam_freq), eight variables are encoded (gender, education, mstatus, parent1, urbancity, car_type, car_use, and revoked), six variables with missing values are imputed (age, occupation, income, yoj, home_val, and car_age), binning was carried out for continuous variables (income, home_val, bluebook, and oldclaims).We first carried out chi-squared tests to determine the significance of the relationship between the dependent variable (claim_flag) with each of the independent variables.Results show that the value of the vehicle (bluebook) and red car (red_car) were not significant at the 5% level.Therefore, these two variables were omitted, and the remaining twenty-one independent variables were used for developing the machine learning model.A complete "clean" dataset is available upon request.

RESEARCH METHOD
Many machine learning models are used in insurance fraud detection.Among them, the logistic regression model ranked the top with the most common model, followed by NNs, Bayesian belief network, decision trees, and naïve Bayes [24].As the logistic regression model is easy to implement, this classification model is still widely used in many recent applications [28].
Due to the complexity of the data, machine learning models have shown their potential in classification problems [29].In this paper, we evaluate the empirical classification performance of seven predictive models, namely: logistic regression, NN, SVM, TAN, decision tree, random forest, and AdaBoost.a. Logistic regression is a common model for the prediction of a dichotomous dependent variable based on one or more independent variables.Different model selection settings including enter, forward, and backward stepwise are considered.b.NN is a useful machine learning model for categorical and continuous dependent variables.NN has three or more layers that are interconnected.Each hidden layer consists of a few neurons.These neurons send data to the deeper layers, which in turn will send the final output data to the output layer.Through the backpropagation method, each time the output is labeled as an error during the supervised training phase, the information is sent backward to update the weights until the error is minimized [30].c.SVM [31] is a classification technique that obtains the optimum separation hyperplane between the target (fraud or not fraud) in a multidimensional space that maximizes the separation between the two classes.The performance of SVM greatly depends on the choice of the kernel function.Among the kernel functions considered include sigmoid, radial basis function (RBF), linear and polynomial [31], [32].d.Naïve Bayes model is based on the Bayes theorem with an assumption of independence among independent variables, X.It involves the calculation of the posterior probability (Y|X), from (Y), (X|Y) and (X) where Y is the dependent variable.In this model, continuous features are converted to categorical types [32].Meanwhile, TAN is a semi-naive Bayesian learning model.It relaxes the naïve Bayes attribute independence assumption by employing a tree structure, in which each attribute only depends on the class and one other attribute [33].e. Decision tree is a non-parametric supervised machine learning algorithm which predicts the target variable by building a tree based on some splitting criteria.If the dependent variable is of continuous type, the model is called a regression tree, while a classification tree involves a categorical dependent variable.The tree structure begins with the parent (or root) node and splits into child nodes and ends with the leaf nodes.Each node of a classification tree is split based on a splitting criterion which is either the Gini index, entropy, or chi-square.In IBM SPSS Modeler, the classification and regression tree (CART) algorithm uses the Gini splitting criteria while C5 and chi-squared automatic interaction detection (CHAID) use the entropy and chi-square, respectively.In the classification tree, each leaf node represents a decision rule to predict the class of the dependent variable [32].f.Random forest is an extension of the decision tree model.The model built the forest in a random manner and the massive number of trees in the forest provides an ensemble model for prediction.It increases model predictive accuracy by using bootstrap samples of the training data and random feature selection in tree induction.Each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors where  ≈ √.Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble [34].g.The adaptive boosting algorithm, named AdaBoost [18] is a machine learning model for regression and classification problems that produces a predictive model in the form of an ensemble of weak predictive models.It constructs the model in a stage-wise manner as the other methods of boosting.The outputs of weak learner models are combined into a weighted sum that will represent the final output of the weighted classifier.AdaBoost is adapting in the sense that subsequent weak learners are adjusted to support those samples that were misclassified by the previous classifier.The steps involved in the boosting process can be found in [18], [35].
In order to assess the performance of these predictive models, accuracy is the most commonly used performance metric [36].However, this metric does not really capture the effectiveness of a classifier for imbalanced data.Therefore, many other performance metrics have been proposed in terms of error and fitness [37].We evaluate the predictive performance using the following seven metrics.a.  = TP+TN TP+TN+FP+FN where TP is true positive (fraud predicted as fraud), TN is true negative (nonfraud predicted as non-fraud), FP is false positive (non-fraud predicted as fraud) and FN is false negative (fraud predicted as non-fraud).f.Area under the ROC curve (AUC) was calculated for the ROC curve.The closer the AUC value is to one, the better the model.g.Gini coefficient is between 0 and 1.The higher the Gini coefficient, the better the model.

RESULTS AND DISCUSSION
A total of 10,191 cases was divided into two groups in which 90% (9,171 cases) of the dataset is used in the training and testing stages for the models and the remaining 10% (1,020 cases) from the dataset for the deployment stage.We then partitioned 9,171 cases into 6,389 cases (70% of data) for training the model.The remaining 2,782 cases (30% of data) were used for testing the performance of the models.
Six different machine learning models, namely logistic regression (enter, forward and backward), NN (MLP and RBF), SVM (sigmoid, RBF, linear, and polynomial), TAN, random forest (number of trees: 50 and 100) and decision tree (CHAID, CART, and C5) and a boosting algorithm (AdaBoost with CHAID, CART and C5) were evaluated using several performance metrics.Each model was evaluated for different settings and the performance results are shown in Table 1 Table 1 shows the seven performance metrics results of various models at the training and testing stages.For the logistic regression model, the result shows that all variables are significant at the 5% level except age, income, parent1, home_val and car_age which are useful to form the logistic regression model regardless of the selection methods.Thus, the performance metrics also show merely identical results for the training and testing stages regardless of the selection methods.Figure 1 shows that the ROC curves of enter, forward and backward stepwise selection are identical for the training and testing stages.As all enter, forward and backward also provide identical AUC and Gini values in both stages, a logistic regression model with enter setting was selected for comparison with other models.For the neural network model, urbancity, mvr_points, and travtime are ranked the top three important variables based on MLP, while mvr_points, age, and travtime are ranked the top three based on RBF.The NN model based on MLP performed better than RBF in terms of accuracy, sensitivity, precision, misclassification rate, AUC, and Gini for both stages.Although the model based on RBF has a slightly higher specificity rate than the MLP, the overall performance of MLP is still better than that based on RBF.In addition, the ROC curves given in Figure 2 show that the MLP (light blue line) model has a higher curve than RBF (maroon line) model.As a result, the NN model based on MLP was selected to be compared with other models.The performance metrics for the SVM model using four different kernels were then evaluated.All variables are identically important for SVM (sigmoid).The top three important variables are education, occupation, and homekids for SVM (RBF), while homekids, car_type, and mstatus are important variables for SVM (polynomial) and education, mstatus, and occupation are the important variables for SVM (linear).Among these kernels, the linear kernel, in general, outperforms other kernels in terms of their overall metrics for the testing stage.The sigmoid kernel performs the worst in terms of its sensitivity and precision (0.00%).Among these kernels, the linear kernel outperformed the sigmoid, RBF, and polynomial kernels in terms of testing accuracy (79.26%), precision (66.97%), misclassification rate (20.74%),AUC (0.81), and Gini (0.616).The sigmoid kernel has higher specificity (99.95%) but the sensitivity and precision are very low indicating that the SVM (sigmoid) is unable to predict the true positive effectively.The result shows an overfitting issue for SVM (polynomial) as the performance metrics are very much lower for the testing stage.In Figure 3, the ROC curves show that the linear (maroon line) kernel has a consistent curve in both stages compared to RBF (light blue line), sigmoid (blue line), and polynomial (green line) models.Hence, SVM with the linear kernel is selected to be compared with other models.The random forest model with the number of trees 50 and 100 shows that the top three important variables are age, travtime, and yoj.All three models provide perfect accuracy, sensitivity, specificity, and precision at the training stage.However, the accuracy, sensitivity, specificity, and precision decrease at the testing stage regardless of the number of trees used which indicates that the model is overfitting.The ROC curves in Figure 5 show that the curve of each model is not consistent at both stages.Moreover, the AUC values for both stages are inconsistent.As a result, the model is not an ideal model for this insurance fraud data.For the decision tree model using CHAID, CART and C5, the splitting criteria based on the Gini index show that C5 produces the highest number (170) of decision rules.The most important variables are occupation, claim_freq, and urbancity for the CHAID, claim_freq, revoked, and urbancity for the CART, while urbancity, gender, and mstatus for the C5.From Table 1, the seven metrics performances of CHAID and CART are consistent for both stages, while the performance of C5 is inconsistent for both stages.In the testing stage, the accuracy (76.74%), specificity (93.18%), precision (61.22%), and misclassification rate (23.26%) for the CART are slightly better than the CHAID.Furthermore, the AUC and Gini values for the CART are slightly better compared to the other model settings in the testing stage.The ROC curves in Figure 6 show that CHAID (maroon line) has slightly better curves than the CART (light blue line) and C5 (blue line) models.As a result, the decision tree with CHAID is selected for comparison with other models.
The Finally, we compare the performances among all selected models which are the logistic regression (backward), NN (MLP), SVM (linear), TAN, random forest (100), and CART_boost.The logistic regression (backward), SVM (linear), and TAN shared almost similar and higher accuracies which are 79.19%,79.26%, and 79.35% respectively compared to the CART_boost, NN (MLP), and random forest which are 77.75%,78.07%, and 78.15% respectively.Overall, the TAN model has the highest testing accuracy (79.35%) and sensitivity (44.70%).Thus, the TAN model was selected for deployment using 10% data (1,020 cases).Results in Table 2 show that the performance of the model at the deployment stage is quite similar to the testing stage.

CONCLUSION
Machine learning models are useful tools for classification and prediction problems.The performance of seven machine learning models, namely the logistic regression, decision tree, NN, TAN, SVM, random forest, and AdaBoost for decision tree with different model settings were compared using a publicly available automobile insurance fraudulent claims dataset.This study found that the TAN model has better classification performance than the other models.The result of this study concurs with other studies 919 that random forests are prone to overfitting issues.In addition, the result shows that the AdaBoost algorithm can improve the classification performance of the decision tree.In this study, the sensitivity of all machine learning models was less than 50%.This could be due to the fact that the dataset is slightly imbalanced with only 26.41% fraudulent cases.Future works should consider sampling techniques such as synthetic minority oversampling techniques to balance the data before applying machine learning models.

Figure 1 .
Figure 1.ROC curves of the logistic regression models

Figure 2 .
Figure 2. ROC curves of the NN models based on MLP and RBF

Figure 3 .
Figure 3. ROC curves of the SVM models

Figure 4 .
Figure 4. ROC curves of the TAN model

Figure 5 .
Figure 5. ROC curves of the random forest models

Table 1 .
. Performance metrics of various models for the training and testing stages The values of accuracy, sensitivity, specificity, precision, and error are given in percentage; ii) Values in parentheses are the results for the testing stage.

Table 2 .
Performance metrics of TAN model for the deployment stage