Artificial neural network technique for improving prediction of credit card default: A stacked sparse autoencoder approach

Received Jun 21, 2020 Revised Mar 16, 2021 Accepted Mar 26, 2021 Presently, the use of a credit card has become an integral part of contemporary banking and financial system. Predicting potential credit card defaulters or debtors is a crucial business opportunity for financial institutions. For now, some machine learning methods have been applied to achieve this task. However, with the dynamic and imbalanced nature of credit card default data, it is challenging for classical machine learning algorithms to proffer robust models with optimal performance. Research has shown that the performance of machine learning algorithms can be significantly improved when provided with optimal features. In this paper, we propose an unsupervised feature learning method to improve the performance of various classifiers using a stacked sparse autoencoder (SSAE). The SSAE was optimized to achieve improved performance. The proposed SSAE learned excellent feature representations that were used to train the classifiers. The performance of the proposed approach is compared with an instance where the classifiers were trained using the raw data. Also, a comparison is made with previous scholarly works, and the proposed approach showed superior performance over other methods.


INTRODUCTION
In artificial intelligence and machine learning, tasks such as classification and clustering, the input data tends to influence the performance of the algorithms. Optimal performance is obtained when algorithms are given suitable data. To this end, some machine learning methods focus on processing high dimensional data, including linear dimensionality reduction methods such as linear discriminant analysis, principal component analysis, and multiple dimensional scaling and nonlinear dimensionality reduction methods such as isometric feature mapping, locally linear embedding, and Laplacian Eigenmap. Meanwhile, feature engineering and representation learning are the two main methods used to achieve representation from raw data. Recent research has focused on the latter since feature engineering methods are usually dependent on domain knowledge, are labour-intensive, and time-consuming [1]. Furthermore, representation learning methods tend to learn a representation from data automatically, which can then be used for classification. An autoencoder (AE) is a type of unsupervised representation learning. Autoencoders are unsupervised neural networks with three layers, including an input layer, a hidden layer, and an output layer. The output layer can be considered as the reconstruction layer [2]. The structure of a basic autoencoder is shown in Figure 1. Autoencoders tend to learn a representation of the input data, usually for dimensionality reduction, by training the network to ignore noise. Together with the reduction side, a reconstructing side is learned, where the AE attempts to create a representation of the original input [3], [4]. There are different types of autoencoders, including sparse, denoising, contractive, variational, and convolutional autoencoders [5]. Credit card default/fraud detection is a crucial problem that has gotten the attention of machine learning researchers, and a significant number of approaches have been proposed [6]- [9]. However, the problem is still challenging since most credit card data seem to suffer from class imbalance as non-fraud transactions overwhelmingly supersede fraud transactions, making it difficult for many machine learning algorithms to achieve good performance. Meanwhile, a good feature representation can be obtained from the dataset, which can enhance the classification performance of the algorithms. Representation learning is a possible solution to the challenge of credit card default and fraud prediction because of its remarkable feature learning ability in large and unbalanced datasets. While basic autoencoders aim at learning a representation or encode data by training the network to ignore noise and reconstructing the data as close as possible to the input data, however, training the autoencoder network in such a way that encourages sparsity can result in optimal feature learning. Sparsity induced neural networks have been extensively applied in image recognition and several other applications resulting in state-of-the-art performance [10]- [12].
In this paper, an approach is proposed to improve the classification performance of classifiers by using the unsupervised feature learning capability of autoencoders. During the training of the autoencoder, sparsity is encouraged, and the model is optimized using the AdaMax algorithm [13] instead of the conventional stochastic gradient descent. To ensure accurate feature representation, we stack two sparse autoencoders to get the final model. Also, to further prevent overfitting and enhance the performance, speed, and stability of the network, we introduced the batch normalization technique [14] to the network. The lowdimensional features are then used to train various classifiers, including logistic regression (LR), classification and regression tree (CART), k-nearest neighbor (KNN), support vector machine (SVM), and linear discriminant analysis (LDA). The performance of the proposed method is compared with an instance where the classifiers were trained with the raw data. Further comparison is made with other scholarly works, and our proposed method shows better performance. The main contributions of this study can be summarized is being as:  To construct an effective artificial neural network for feature learning using multiple layers of sparse autoencoder.  To improve the classification performance of various classifiers using the proposed stacked sparse autoencoder.  To demonstrate the effectiveness of the proposed method by applying it to a popular credit card dataset. The rest of the paper is organized is being as. In section 2, we briefly review previous related works that utilized different types of autoencoders. Section 3 presents the proposed method and section 4 provides a brief case study of credit card defaulting prediction models. The obtained results are presented and discussed in section 5. Lastly, section 6 concludes the paper and highlights some future research directions.

RELATED WORKS
Recently, autoencoders have been applied to several tasks, and they achieved state-of-the-art performance. In this section, we discuss some previous works that utilized various autoencoders and lay the foundation for the proposed stacked sparse autoencoder network. Sun et al. [15] proposed a method for fault diagnosis by applying a sparse stacked denoising autoencoder for feature extraction due to its robustness and data reconstruction capability, which improved the diagnostic accuracy. The autoencoder was used together with an optimized transfer learning algorithm. Similarly, Zhu et al. [16] proposed a novel stacked pruning sparse denoising autoencoder for intelligent fault diagnosis of rolling bearings. The method comprised of a fully connected autoencoder network, connecting the optimal features extracted from previous layers to subsequent layers. To effectively train the autoencoder, a pruning operation was added to the model to restrict non-superior units from participating in all subsequent layers. When compared with other fault diagnostic models, their approach showed superior performance.
Furthermore, Sankaran et al. [17] proposed a feature extraction method using an autoencoder network, and ℓ2,1-norm based regularization was used to achieve sparsity. The authors identified that due to the presence of many training parameters, several feature learning models are susceptible to overfitting, and different regularization approaches have been studied in literature to mitigate overfitting in deep learning models. The performance of their model was studied on publicly available latent fingerprint datasets, and it gave an improved performance. Chen et al. [18] proposed a method to address the challenge of learning efficiency and computational complexity in deep neural networks. The technique used a deep sparse autoencoder network to learn facial features and softmax regression applied to classify expression features. The softmax regression aimed at handling extensive data in the output of the autoencoder network. Also, to overcome local extrema and the challenge of gradient diffusion during training, the network weights were finetuned, and this improved the performance of the architecture.
Most approaches used to implement autoencoders depend on the single autoencoder model, and this presents a problem when learning different characteristics of data. Yang et al. [19] proposed a method to solve the problem by implementing a feature learning framework using serial autoencoders. The technique achieved superior representation learning by serially connecting two different types of autoencoders. The approach incorporated two encoding stages using a marginalized denoising autoencoder and a stacked robust autoencoder via graph regularization. When compared to baseline methods, the proposed approach showed significant improvement. Meanwhile, Al-Hmouz et al. [20], introduced a logic-driven autoencoder, whereby the network structure was achieved using some fuzzy logic operations. The autoencoder was also optimized using gradient-based learning. Lastly, sparse autoencoder networks have achieved remarkable performance in representation learning [21], [22]. However, better representation learning can be gotten when multiple sparse autoencoders are stacked and optimized effectively, which is the focus of this research.

PROPOSED METHOD
This section considers the method applied to developing the proposed autoencoder. An autoencoder consists of two functions, i.e., an encoder and decoder, the former maps the d-dimensional input data to get a hidden representation, and the latter maps the hidden representation back to a d-dimensional vector that is as close as possible to the encoder input [23]. Assuming the original input is , the autoencoder encodes it into a hidden layer ℎ to reduce the input dimension, which is then decoded at the output. The input vector is encoded according to: where represents the activation function; in this case, the sigmoid activation function, is the weight matrix, and is a bias vector. The hidden representation is decoded to get the data as close as possible to the input using: where ′ is weight matrix and ′ represents the bias vector [24]. The sigmoid activation function is described as: The disparity between the original input and the reconstructed input ̂ is called the reconstruction error. To optimize the parameters W, ′ , , ′ , the mean squared error (MSE) function is used as the reconstruction error function: The average activation of neurons in the hidden layer is represented as: To induce sparsity in the autoencoder, we limit ̂= , where is the sparsity proportion, and it is usually a small positive number near 0. Therefore, we try to minimize the kullback-leibler (KL) divergence between ̂ and according to: Also, to ensure better feature representation and, by extension, enhance the performance of the classifiers, multiple sparse autoencoders are stacked. A stacked sparse autoencoder (SSAE) can comprise of numerous sparse autoencoders whereby the outputs of each layer are connected to the inputs of the next layer [25]. The SSAE is based on research conducted by Hinton and Salakhutdinov [26], where they proposed a deep neural network with layer by layer initialization. The error function of the SSAE is expressed as: where and represents the number of samples and the number of layers, respectively, the original input is , and denotes the corresponding label. The regularization coefficient is represented by . and denotes the rows and columns of the matrix ( ) [27]. By adding the sparsity term to (7), the overall cost function of the SSAE becomes: where S represents the total number of neurons in a layer and is the sparsity regularization parameter, and it sets the sparsity penalty term. We now have three optimization parameters, including , , and , and we set their values as 3, 0.0001, and 0.05, respectively. In the sparse autoencoder network, a neuron is said to be active if its output is a value close to 1, while it is inactive if its output is a value closer to 0 [8]. Algorithm 1 shows the proposed sparse autoencoder procedure. Figure 2 shows the structure of the proposed stacked sparse autoencoder (SSAE). For simplicity, the decoder parts of the SAE are not shown. The output of the SSAE is then used to train the various classifiers.

Algorithm 1. Proposed method of the SSAE
Input: train set x Process: Start Initialize , , ′ , , ′ Obtain the cost function according to (4) Apply weight penalty to the cost function according to (7) Add the sparsity regularizer to the cost function according to ( The greedy layer-wise training strategy proposed by Bengio et al. [28] is employed to successively train every layer of the SSAE in order to obtain access to the weights and bias parameters of the network. Also, the network is finetuned using the backpropagation algorithm to obtain the best parameter settings. The AdaMax algorithm [13], a variant of the adaptive moment estimation (Adam) algorithm that uses the infinity norm, was applied to optimize the autoencoder network. Lastly, we introduced the batch normalization technique [14] to prevent overfitting and enhance the performance, speed, and stability of the network.

CASE STUDY OF CREDIT CARD DEFAULTING PREDICTION MODELS
Credit risk plays a crucial role in the financial industry. Most financial institutions grant loans, mortgage, and credit cards, among many other services. Due to the rising number of credit card clients, these institutions have faced an increasing default rate. They are thereby resorting to the use of machine learning methods to automate the application process and predict the probability of a client's future default. However, several machine learning methods have been developed in various literature with varying performance. A major limitation to achieving optimal performance in the credit card default prediction is that the datasets are highly imbalanced, i.e., the instances where clients do not default are more than the defaulting cases.
Certain studies have used the default of credit card clients dataset [29] and achieved good performance. For example, Prusti and Rath [30] used various algorithms such as decision tree, KNN, SVM, and multilayer perceptron to make predictions on the dataset. Additionally, they proposed a method that hybridized decision tree, SVM, and KNN, which gave improved performance compared to the stand-alone algorithms. Sayjadah et al. [31] conducted a performance evaluation of credit card default prediction using logistic regression, random forest, and decision tree. The experimental results showed that random forest achieved superior performance with an accuracy of about 82%.
Furthermore, because the dataset is imbalanced, a method is proposed to tackle the problem using synthetic minority over-sampling technique (SMOTE) [32]. Using the SMOTE method together with seven other algorithms, the random forest algorithm achieved the best performance with an accuracy of 89.01% and F1-score of 89%. Lastly, Hsu et al. [33] and Chishti and Awan [34] also proposed models to predict the defaulting of credit card clients and achieved comparable performance. However, we are aiming to improve on what has been done by applying our proposed method on the same dataset.

RESULTS AND DISCUSSION
In this work, the defaulting of the credit card client dataset [29] is used. The dataset was obtained from the University of California Irvine (UCI) machine learning repository, and it contains 30,000 instances and 25 attributes, including demographic and financial records. The dataset was established to predict customers who are likely to default on payments in Taiwan. Out of the 30,000 instances 23,364 are non- default and 6,636 are default cases. The rationale behind the dataset is for financial institutions to be able to identify possible customers who will default on their credit card payments, thereby declining such applications. We use the 70-30% train-test split. The SSAE is trained with the training set in an unsupervised fashion, while the test set is input with the learned SSAE model to obtain the low-dimensional data. The classifiers are then trained using the low-dimensional train set, and the performance tested using the lowdimensional test set. The number of neurons in the first and second hidden layers was set at 100 and 85, respectively.
To efficiently evaluate the performance of our approach, we utilize performance metrics such as accuracy, sensitivity, precision, and F1 score. Accuracy is the ratio of the number of correct predictions to the total number of predictions made, sensitivity is the ratio of the number of correct positive predictions to the total actual positives, precision is the number of correct positive predictions divided by the number of positive results predicted, and F1 score is the harmonic mean between precision and sensitivity. Mathematically, the performance metrics can be represented as: See appendix for the complete derivation of F1 score: Therefore, it can be represented as (13): For simplicity, F1 score in (13) can be derived is being as: Finding the lowest common factor Open the bracket Bring like terms together and factorize.
2( ) 2 ( + )( + ) ÷ 2( ) 2 + ( + ) ( + )( + ) Invert and cancel out like terms where , , , stands for the number of true positives, the number of true negatives, the number of false positives, and the number of false negatives, respectively. Meanwhile, all the experiments were carried out using a computer with the following specifications: Intel Core i5-6300U, 2.40 GHz, with 16 GB RAM. And Python programming language was used for the computations. To show the effectiveness of the proposed approach, we conduct a comparative study with five base classifiers. Therefore, we first show the performance of these classifiers on the raw dataset. The classifiers include CART, LR, KNN, SVM, and LDA, and the results are shown in Table 1.  Table 2 summarizes the results obtained when the classifiers are trained using the features learned from the stacked sparse autoencoder. It can be seen that the learned features significantly improve the performance of the classifiers. Furthermore, the results show the ability of the proposed SSAE to learn a good representation of the data. To further show the effectiveness of the proposed method, the best performing model from our experiments, which is the LDA, is used to compare with other well-performing methods proposed in recent studies that have been discussed in section 4. To give a fair comparison, we focused on studies that used similar datasets. This comparison is shown in Table 3, and it can be seen that our method outperforms those in the stated literature. Also, the receiver operating characteristic (ROC) curve is employed to show the improved performance of the SSAE based LDA compared to the LDA that was trained with the raw dataset. The ROC curve is a graphical plot which shows the prediction performance of binary classifiers. From the ROC curve shown in Figure 3, it can be seen that the proposed method performed better than the conventional LDA. From the above results, we can see that our proposed approach achieved better performance compared to the other methods. The improved performance can be attributed to the proposed SSAE that was able to learn a good representation of the original input data. Also, the results have shown the capability of deep learning in achieving exceptional performance in different tasks, including feature representation. Lastly, this study has demonstrated the importance of training machine learning algorithms with suitable data, and that improved performance can be obtained not only by hyper-parameter tuning but also and more efficiently by effective feature learning. Table 3. Comparison with other methods Literature Accuracy (%) Precision (%) Sensitivity (%) F1 score (%) Prusti and Rath [26] 82.58 96.83 83.57 89.71 Sayjadah et al. [27] 81.81 ---Subasi and Cankurt [32] 89.01 --89 Hsu et al. [33] 80.2 ---Chishti and Awan [34] 82

CONCLUSION
Conventional machine learning algorithms are often ineffective in performing classification on large datasets such as most credit card datasets. Hence, in this paper, a stacked sparse autoencoder is proposed to achieve optimal feature learning. In the proposed autoencoder network, we introduced a batch normalization technique to enhance the performance, speed, and stability of the model and to prevent overfitting further. Also, the model was optimized using the AdaMax algorithm. The learned data was then used to train five shallow machine learning algorithms, and the performance tested. When compared with a case where the algorithms were trained with the raw data, our proposed method showed superior performance. Furthermore, the results were compared with methods in some literature that used a similar dataset, and the proposed approach also showed significant improvement. Future research will focus on studying the effect of different optimizers and stacking diverse autoencoders and observing the resultant impact on the feature learning process. Also, future research will consider comparing the feature leaning capability of the stacked sparse autoencoder with other feature learning and feature engineering methods.