Comparative Study of Classification Method on Customer Candidate Data to Predict its Potential Risk

ABSTRACT


INTRODUCTION
Data mining is a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions [1]- [3]. The process is performed by extracting or recognizing the important pattern from the data contained in the database. In the data mining there are many techniques to do it, among of them are C4.5, Naive Bayes algorithm, Apriori, K-NN and many others.
Bank credit risk assessment is widely used at banks around the world. Some of bank risks include: credit risk, the risk that the loan won't be return back on time or at all; liquidity risk, the risk that too many deposits will be withdrawn too quickly, leaving the bank all; liquidity risk, the risk that too many deposits will be withdrawn too quickly, leaving the bank short on immediate cash; and interest rate risk, the risk that the interest rates priced on bank loans will be too low to earn the bank adequate money [4]. As credit risk evaluation is very crucial, a variety of techniques are used for risk level calculation. In addition, credit risk is one of the main functions of the banking community. Banks classify clients according to their profiles. While classifying, the financial background of the customers and subjective factors related to them are evaluated [5]. To facilitate the company in processing the large data, then the system would be needed to produce a decision on potential customer's risk. One of them is using data mining techniques, the much so that the data can be used optimally. By exploiting these data, it is expected to assist in addressing the customer candidates whom are predicted will have payment problems in the future to assist in determining the prospective customer credit more as well.
In the study published in the Journal entitled "C4.5 Algorithm to Predict the Impact of the Earthquake" it is describe about the earthquake that cannot be predicted when it would happen, but we can predict the expected impact of the quake based on seismic data that never happened before. One of the methods used to dig or to search for information on old data is data mining algorithm C4.5. The output of the algorithm C4.5 in predicting the impact of the quake is divided into three parts. Namely, there are no impact/minor damage, severe damage, and the damage and tsunami. By predicting the implications of the earthquake, it is expected to minimize the quack impact. This study uses the C4.5 algorithm to predict the effects of earthquakes while the attributes that are used are the epicenter, distance from the beach, depth, scale, duration, and effect. The results of the study show the pattern to predict is based on the effects of earthquakes. If the scale is low, it does not cause any effect. If the scale is medium and in short duration, then there is no effect. If the scale is medium and in long duration, then it will cause the broken. If the scale is height and in a certain distance from the coast or it is happening on the land, it will cause the broken too. If the scale is height and its distance from the coast is very far, then it will cause broken and tsunami. If the scale is height and its distance from the coast is far and the epicenter in the sea, it will cause broken and tsunami [6].
The other study that utilizes the C4.5 is also presented in [7]. The study describes about rainfall, soil data and climate dataset that are used to predict the crop production. These types of datasets are preprocessed to remove the unwanted and null data in the dataset. The feature extraction method is used to extract a subset of new features from the datasets through functional mapping to maintain the information. In feature selection, genetic algorithm is used to select optimal features. The genetic algorithm provides the opportunity to discover the optimum solution. The enhanced ANFIS classifier then is used. The ANFIS classifier is the improvement of C4.5 classifier in hidden layer to generate the rules to predict the yield. By enhancing the C4.5, the experimental results of proposed work show better accuracy of 92.50 % than existing classifier. The comparative study of decision tree variants performance of information mining in the forest burned area is conducted by Putri et al as published in [8]. The study conducted comparative analysis of three decision tree variants ie. CART, C5.0, and C4.5 algorithm. Of these three decision techniques, the C5.0 algorithm is the most suitable for spatial data of the forest burned area. The algorithm is outperform shown by its accuracy is 99.79%.
In [9] authors show their study in using Naive Bayes classifier to predict the patient's hypertension disease. The hypertension disease is a significant health problem, and patients may not be able to recognize this disease for years. But in the other side, it's still difficult to answer complex queries such as "Given patient records, predict the probability of patients getting hypertension". Most of the time, clinical decisions are often made based on doctors intuition and experience rather than on the knowledge rich data hidden in the database. In this study, the Naive Bayes algorithm is employed to make a model with predictive capabilities. It provides new ways that of exploring and understanding knowledge. Attributes used in this research are as follows sex, chest pain, exam, age, systolic BP, diastolic BP, cholesterol, fasting blood sugar, thalach, old peak, the risk of hypertension. The Naive Bayes experiments in the study give performances as: the recall is 83.70%, the precision is 83.60% and the accuracy is 83.67%. Another interesting of naïve Bayes application for classification purpose is presented in [10]. In the study author present the result of the Zakah receiver classification experiment that utilizes the naïve Bayes classifier. According the experiment results, the classifier provides good accuracy i.e. 85 %. One of the application of naïve Bayes classifiers in media social mining domain is discussed in [11]. The study explored the application of Multinomial Naïve Bayes classifier technique to mine the sentiment opinion pattern of GSM based on customer's twitter account. By using 1665 features of the dataset, the technique provides the accuracy results of 73.15 %.
In this work we perform an experimental study of Naive Bayes and C4.5 algorithm that applied to the company leasing customer data history. The purpose of the data is to evaluate the performance of both algorithms in assisting the company leasing to make the decision regarding the approval of customers candidate who apply the leasing. The such study is critical to local Indonesia context since the financial technology is currently growing quickly while the information technology, especially the software/ application, the environment is still in the initial phase. According to the author's knowledge, there is a very limited publication related the application of Artificial Intelligent or Machine Learning to this domain for Indonesia cases.

MATERIAL AND METHOD 2.1. Classification
Classification is one of the Data Mining techniques that is mainly used to analyze a given dataset and takes each instance of it and assigns this instance to a particular class such that classification error will be least. It is used to extract models that accurately define important data classes within the given dataset. Classification is a two step process. During the first step the model is created by applying a classification algorithm for training data set, then in the second step the extracted model is tested against a predefined test dataset to measure the model trained performance and accuracy. So classification is the process to assign a class label from dataset whose class label is unknown [9].

C4.5 Algorithm
C4.5 algorithm is an algorithm used to construct a decision tree [12], a classification and prediction methods are extremely powerful and famous. Decision tree method changes the very large fact into a decision tree that represents the rule. The decision tree is also useful to explore the data in finding the relationship between input variables and a certain output/target variable. In general, C4.5 algorithm to construct a decision tree is described as follows: a. Select an attribute as root. b. Create a branch for each value. c. For the case of the branches. d. Repeat the process for each branch until all cases the branches have the same class.
To select an attribute as roots, is based on the highest gain value from the existing attributes. To calculate the gain used formula as follows:

Naive Bayes Algorithm
Naive Bayes algorithm studies the events of the database record by calculating the variables which are analyzed with other variables [13]. The result of this process is we can predict something such as whether or a person coming from certain groups based on variables attached to it. Additionally, Naive Bayes can also analyze the variables that most influence in the form of probabilities. Naive Bayes is a simple probabilitybased prediction techniques based on the application of Bayes theorem to assume strong independence. The steps below are Naive Bayes stages process: a. Counting the number of classes / labels b. Counting the number of cases per class c. Multiply all class variables d. Compare results per class The formula of Naive Bayes Algorithm is as follows: : Probability c

Weka Tools
Weka is a collection of machine learning algorithms for data mining tasks. Weka stands for Waikato Environment for Knowledge Learning. It was developed by the University of Waikato, New Zealand. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization [14]. The workflow of Weka would be follows in Figure 1.

Data Set
The data source used in this research is collected from one of the leasing companies located in the area of Cikupa-Tangerang, Banten Province. The total amount of data collected are 560 record data, each instant contains 5 attributes, namely: age, marital status, salary, other installments and worthiness as presented as Table 1. Worthiness attribute is the target variable/label. Some samples of data instant are described in Table 2.  Two of four attributes, age and salary, can contain values in wide range, so this condition will make suffer in its computation. To deal with this problem we apply the categorization mechanism to both of attribute values as presented in Table 3. Table 4 shows data example.

Experiment Scenario
The main parts of experiment scenario consist of two steps. The first step is to obtain the best model from each algorithm and the second is to compete the both best models obtained. The detail of experiment stages and scenario is illustrated as the Figure 2. The data collected is not ready yet to be processed by the algorithm since there are too many biases or ambiguous contained on it, so it needs to perform the data preprocessing operation. In this step we perform data cleaning by ignoring the uncompleted data. The next step of data preprocessing is a data transformation that transforms the data format to format that compatible with Weka tools. Data splitting is then applied to the data to divide the data into two parts: training data and testing data. In this case, we use 80% parts of the data for data training, and the rest as data testing. The same training data is then used to train both of algorithm to provide the models which will be tested with the same data testing. For both algorithms used, we perform twenty experiment runs to get the best model of each algorithm. Both of the best models are then competed to evaluate their performance and to get the best model among of C4.5 and Naive Bayes.

Data Preprocessing
Data preprocessing is required to improve the quality of the data by removing the unwanted data from the original data [15]. Preprocessing data is important since the raw data contains missing values, noisy, and inconsistent data it will result in data not qualified. In this study, we do data preprocessing as follows: a. Data Cleaning Data cleaning is to do data cleaning of the noise found in the form of missing values, inconsistent data, and redundant data. All the above attributes will then be selected to obtain attributes that contain relevant values, not missing values, and not redundant, where the three requirements are the prerequisites that must be done in data mining so that will be obtained a clean dataset for use in the data mining stage . In this dataset found 1 missing value, the technique that will be done for 1 missing value record is to delete it record. b. Data Transformation The data transformation stage is at this stage the data is converted into the appropriate form for processing in data mining. In this study the data will be processed from Microsoft excel will be converted into a CSV file (Comma Separated Values) which can be used for data processing on Weka tools.

Evaluation
To evaluate the performance of both algorithms, we use the common criteria in data mining i.e. precision, recall, and accuracy. The calculation of those parameters is performed by to provide a confusion matrix. A confusion matrix contains information about actual and predicted class provided by a classification system [16]. All correct classifications that lie along the diagonal from the north-west corner to the south-east corner also is called True Positives (TP) and True Negatives (TN) while other cells are stated as the False Positives (FP) and False Negatives (FN) [17]. In this study, the likely cases are considered as the positive case, while the unlikely and probable cases are the negative cases. The definitions of these parameters are presented as follows: = + + + +

RESULTS AND DISCUSSION
This section presents the experimental results and analysis of this study which utilize two clasifiers, C4.5 and Naive Bayes. Three experiments scenarios based on percentage data splitting are performed to each algorithm. The first experiment uses 60% of training data and 40% of the data testing, the second experiment uses 70% of training data and 30% of the data testing, and the third experiment uses 80% training data and 20 % data testing. The experiment which provides the highest performance values for each method is used as a model to find the best method by re-testing on provided data testing. The Table 5 presents the average performance parameter values of each experiment scenario of C4.5 on model testing stages, while Table 6 shows the results of Naive Bayes. Based on the achieved value of accuracy criteria, the first experiment  The next stage of the experiment is to compare the best model provided from each experiment scenario which are run for both algorithms. These two models then are applied to the data testing that has been provided to get which of algorithm that is suitable for the study case. The results of this comparison stage are presented as Table 7. Table 7 shows that the C4.5 algorithm is superior compared to the Naive Bayes algorithm with its accuracy is 83.33%, while the Naive Bayes algorithm achieved is 80.67%. To validate the result above, we perform the next experiment based on the cross validation evaluation scenario. Three different k-folds are used in the scenario i.e. 5-fold, 10-fold, 20-fold and each these k-fold is applied to both C4.5 and Naïve Bayes as well. The results are presented as Table 8 and Table  9. Table 8 presents C4.5 performance, whereas Table 9 presents Naïve Bayes performance. The cross validation experiment confirms that, in this case, C4.5 achieves better performance compared to Naïve Bayes. Of all k-folds applied C4.5 presents better accuracy than Naïve Bayes. The other information presented by the results is their different performance pattern. C4.5 gives a better accuracy performance for the less k-fold, whereas Naïve Bayes better accuracy performances are provided by the bigger k-fold. The superiority of C4.5 compared to Naive Bayes can be understood since all of the input variable are independence each other, so C4.5 is more suitable to this characteristic of data. On the other side, the nature of the Naive Bayes algorithm is based on the conditional probability of input variables, so in this case the advantages of Naive Bayes is less use. Another implication shown by the results is that the customer leasing application tends to fall into recommender application rather than classification.

CONCLUSION AND FUTURE STUDY
In this study, C4.5 Algorithm and Naive Bayes Algorithm were implemented on a customer credit dataset to predict the potential risk in the future. Based on two types of experiments scenario results, C4.5 algorithm achieves better performance. The results study presents that the recommender system as the characteristics of C4.5 is more suitable than Naive Bayes which work based on conditional probability of the input variables. Whereas, on C4.5 algorithm salary attribute is the most influential attribute shown by the its significant value of entropy gain compared to other input variables. The dominant influence of the salary attribute is also presented in every experiment scenario where the attribute is always selected as the root node of the tree. In the future study, we will explore some opportunities to apply the others technique in this domain. We also will investigate the other real applications which still open to exploit such as: customer care, sales recommender, and micro finance which is growing quickly.