Concept drift and machine learning model for detecting fraudulent transactions in streaming environment

ABSTRACT


INTRODUCTION
The growth of the internet significantly aid different organization/fields such as social media, In recent years, credit card fraud has been increasing since there is an increase in the usage of the internet [1]. Nowadays, many people have started utilizing their credit cards for various kinds of transactions and many people fall to scammers which may lead to fraud cases [2]. Credit card fraud refers to a scammer utilizing the user's credit card number and personal identification number (PIN) or the user's stolen credit card for financial transactions from the user's account without their knowledge. Credit card scams fall under identity theft and have become increasingly common nowadays [3], [4]. There are various ways through which credit card information is usually stolen. Some examples are skimming, dumpster diving, hacking, and phishing [5]. In addition, many scammers are employing Twitter bots to convince misguided users to send money to compromised PayPal as well as Venmo accounts. The bots seem to be launched whenever a genuine user request another one for their payment details, probably obtaining such tweets through a query for terms like PayPal, Venmo, or even other providers. By stealing another user's profile photo and coming up with an identical username, they can pass themselves off as them while asking for money from the actual tweeter. Moreover, in the past few years, Twitter spam has gotten progressively worse. Twitter's massive user base and the volume of information exchanged there both contribute significantly to the rapid spread of spam [6]. Twitter and the research community have been creating several spam detection systems by utilizing various machine-learning approaches to protect users [7]. However, a recent study found that because the features of spam tweets change over time (data imbalance and concept drift), the current machine learning-based detection methods are unable to identify spam with accuracy. Due to data imbalance and concept drift [8], [9], fraud detection and spam detection is a particularly difficult task. Additionally, fraud detection models and spam detection models differ from conventional classification in that in the initial phase, human investigators only supply a small set of supervised samples and only have time to evaluate a small number of alerts. The vast majority of transaction labels are not made available until a few days later, after customers may have detected fraud. When learning in a concept-drifting environment, it is important to carefully evaluate the delay in getting precise labels and the interaction between alerts and supervised information. Due to all these problems, there is a requirement for a model which can detect spam and fraud which has data imbalance and concept drift issues. Hence, in this paper, we propose a model which provides a solution to these problems. Furthermore, the significance of the research is: i) the continuous stream drift-identification (CSDI) employ an effective cross-validation scheme for selecting meaningful feature during the training of the predictive model and ii) the CSDI-based attack detection model achieves much better accuracy, recall, and F-measure performance in comparison with the existing ensemble-based classification model.

LITERATURE SURVEY
This section provides various research work done to address the concept drift issues and also the class imbalance issues which affect the overall accuracy during classification or prediction. Bayram et al. [10] have presented a review of various concept drift methods which have been used over the years. They discussed the various works which have been done to resolve the issue of concept drift. Further, they have discussed how machine learning (ML) can help to reduce these issues. In this paper, the researchers have also classified the various models based on their performance. They have mentioned the challenges and possible research directions when working with a concept drift issue. Mayaki and Riveill [11] have proposed a detection model for the concept drift. This model was built based on an autoregressive model known as application discovery and dependency mapping (ADDM). In [12], a method called EIStream was proposed to detect the concept drift by utilizing the ensemble and traditional ML techniques on both artificial and realtime data. Moreover, the EIStream model uses the majority voting method, allowing just the best classifier to cast a vote and decide which classifier is the best. Further, according to [13], for the detection of the data and the issue of the concept drift, they have first proposed a drift detection strategy that makes use of the principal-component-analysis (PCA) approach to analyze the constant changes in the variance of the characteristics throughout the intrusion detection data streams for the detection of the data and concept drifts. To resolve these drift issues, this method has discussed an online deep neural network (DNN) which automatically changes or adjusts the size of the hidden layers in the neural network based on the hedge weighting technique, hence, this enables the method to easily learn the intrusion and adapt to any new intrusion. Jayaratne et al. [14] reviewed the existing concept drift models and concluded that the existing concept drift detection techniques are dependent on the classifier and need labeled data. However, the cyberphysical system (CPS) data streams are dynamically unstructured and unlabeled. Jayaratne et al. [14] have proposed an unsupervised ML model which constantly detects the concept drift in the industrial cyberphysical system. Priya and Uthra [15] have proposed a model, CIDD-ADODNN, which is efficient for the detection of class imbalance and concept drift issues by employing the ADADELTA-optimizer-based deepneural-networks (ADODNN). This model is used to preprocess the streaming data into class imbalance and concept drift, handle the class imbalance and detect the concept drift and finally classify severely imbalanced streaming data.
The ADO-based hyperparameter tuning procedure [16] is used to find the DNN model's ideal parameters to improve the classifier performance. Liu et al. [17] have demonstrated how the current data augmentation techniques either neglect the distribution of data or the spatial relationships among the features. To overcome the above problem, the researchers have suggested a network anomaly detection method using the convolutional neural network (CNN) which is based on data augmentation and feature representation (NADS-RA). With the aid of the least-squares generative-adversarial-network, which mitigates the impact of an unbalanced training set and prevents over-fitting. Also, an image-based augmentation method is created to create an augmented image under the distribution pattern of rare network anomaly images. Following that, NADS-RA is applied to the CNN classification model. Zhou et al. [18] have proposed an algorithm, correlation feature selection using the bat algorithm (CFS-BA), for intrusion detection systems which is based on ensemble and feature selection methods. The proposed algorithm chooses the best subset based on the correlation among the various features which has been presented in the initial stage for the dimensionality reduction. Yotsawat et al. [19], they have proposed a novel ensemble method known as the cost-sensitive neural network ensemble (CS-NNE) for creating a credit scoring model based on a cost-sensitive neural network. The multi-base neural networks can take into account unbalanced classes in the suggested method because multiple class weights are modified to the original training data. A novel ensemble architecture that  [20]. The suggested method relies on ranking the detection capabilities of multiple base classifiers to recognize distinct sorts of attacks. The rank matrix for various attack categories is computed using the F1-score of an algorithm. Only the output from algorithms with the greatest F1-Score in the rank matrix for a certain attack category is taken into account for the final prediction. The voting strategy, in contrast, bases the final classification on the vote of all classifiers in the ensemble, regardless of whether the algorithm is effective enough to identify that assault or not.

METHOD
In this section, we discuss the standard XGBoost algorithm and its cross-validation. Further, we present a concept drift aware-machine learning framework and in the final section of the proposed methodology, a continuous stream drift identification model has been proposed. The standard XGBoost model [21], this model has an improved cross-validation technique which has been used for the selection of only the useful features. In the below section, the standard XGBoost model has been given.

Standard XGBoost model
The proposed model utilizes the standard XGBoost model which has been proposed in [21]. The standard XGBoost model in this proposed model has been used for classifying and training whether any scam or fraud is being happened during any transaction. In the standard XGBoost technique, the ℎ has been used for assuring minimal loss using the greedy technique used which has been shown in (1).
In the (1) the has been defined for second-order gradient descent (̂( −1) + ) of and ℎ has been defined for first-order gradient descent of (̂( −1) + ). Hence, the tree of the XGBoost model ℎ can be attained by reducing (1). The loss function for the tree is evaluated using (2).
In (2) the is defined for the bias function which is used for describing the feature imbalance. For optimizing the feature imbalance, cross-validation is used. The cross-validation technique is constructed using various sets of folds instead of using one set of folds. For evaluating a single fold having crossvalidation is using (3).
The (3) fails to provide a good result when the data contains imbalanced values. Hence for reducing the error in the cross-validation, Shahapurkar and Rodd [21] has proposed cross-validation having two layers. These layers contain the main features of the dataset and features which have been selected using the first layer. These two layers are used for constructing a prediction model. The two-layer cross-validation technique is evaluated using (4).
From the cross-validation, for optimizing the parameters and for the selection of an optimal value, (5) is used.

Concept drift aware-machine learning framework
The work uses the drift detection model presented in [21] for designing an improved drift detection mechanism for a streaming environment. The framework of the proposed concept drifts aware machine learning for detecting fraudulent transactions in streaming environments is given in algorithm 1. In algorithm 1, first, the XGB classifier is trained to detect whether a given transaction is normal or malicious. Then, the work uses K-L divergence for identifying distribution between different streams (i.e., present and past data distribution) using the XGB classifier. Then, continuous stream drift-identification (CSDI) is designed to identify whether the current data stream is different from past data streams; if the drift is true, establish the drift period. Then, these streams after the drift period are used for updating the classifier. Then, the updated classifier is tested with new streams for validating the model.

Continuous stream drift identification model
In this section, the continuous stream drift identification model. The algorithm to identify drift on continuous stream data is given in algorithm 2. In this work obtaining a subwindow, X undersampling process is executed on a static window. After that, the work applies a continuous stream drift identification process on the subwindow and test window W using algorithm 3. Then, the present drift pointer is considered to be in static nature if the outcome is negative and is added to the static window.
Select recent n detection sample for obtaining test window W and remaining samples from S are used to obtain a static window.
Sampling to obtain sub-window T Step 6.
Execute continuous stream drift identification test on and on the total session stream using algorithm 3.
Execute continuous stream drift identification test on and on sub-session stream using algorithm 4.
Privilege the drift time w* Step 11.
Request to update the model after w* Step 12.
E=∅, go to step 2 Step 13. Nonetheless, if the outcome of algorithm 3 is positive, then test window W is further analyzed using continuous stream drift identification using algorithm 4 for localizing the drift point. If the drift point is identified, the proposed CSDI methodology is optimized with data from the drift point to the current instance. Further, when the model is expected to overflow then half of the points in S are discarded.
Every group of streams builds at least one drift indicator Sw at given session instance as: in above equation computes the mean accuracies of entire nw data streams in a group. Once obtaining the drift identifier S and divide it into ̂a nd , in an iterative manner randomly select streams from ̂ for obtaining X until | | = | |. In general, the test window size | | provides larger significance assuring that X is representative for ̂. However, at the same time because of larger size induces a higher delay for the accumulation of streams; thus, this paper introduces a sub-session-based drift identification test for establishing good drift point accuracies. This work further assumes that : | − | > 0. There may exist certain drift points in the test window if the outcome is positive. Thus, the prerequisite is to further perform continuous stream drift identification on sub-sessionbased drift identification for localizing the perfect drift point. In improving computational efficiency in this work using algorithm 4 is used for dividing the test window into static and non-static ones. Let (1) : { 1 , 2 , … , } be self-determining streams with ~( 1 , 2 ). Let

Algorithm 3. Total session-based drift identification
Input. The optimized window X and test window W, the window size n Output. If there exists drift within W Step 1. Start Step 2. Estimate data stream average and difference of X and W Step 3. Construct two-tailed statistics Step 4. If 1 ≥ (2 − 2) Return 1 Else Return 0 End If Step 5. Stop

Algorithm 4. Sub-session-based drift identification
Input. Test window W and window size n Output. Ideal drift point − * Step 1. Start Step 2. Select recent m drift points within W for obtaining (2) = , = 1 1+( The other leftover point will be used to obtain (1) Step 3. Estimate data stream average and difference of (1) and (2) Step 4. Construct two-tailed statistics  The work assumes that the last points in W are the ideal drift points. The streams X and W are expected to be different if there exists substantial variation between the two parts (1) and (2) in W. The motivating factor is using total session-based drift identification minimum quantity of drift points can be extracted; if not enough points are established then it is difficult to get a positive outcome; thus, if we can claim variance between X and W, therefore statistically it is impossible to have a significant difference between (1) and (2) . Based on such an assumption, algorithm 4 efficiently established the ideal drift point.

RESULTS AND DISCUSSION
In this section, we discuss the metrics used for calculating the results. From the performance metrics, we evaluate the accuracy, precision, recall, F1-score, and false positive rate (FPR) for the credit card dataset. For comparing the results, we have used the standard machine learning models, support vector machine (SVM) [17], random forest (RF) [17], decision tree (DT) [17], network anomaly detection schemerepresentation and augmentation (NADS-RA) [17], generative adversarial networks (GAN) [22], standard XGBoost, and the proposed model. Similarly, using the performance metrics, we evaluate the accuracy, recall, and F1 score. To compare the results with the existing models, we have used the standard machine learning model, RF [23], k-nearest-neighbor (KNN) [23], SVM [23], XGBoost [24], and data imbalance aware XGBoost (DIA-XGBoost) [24] and the proposed CSDI model. After this, we evaluated the drift time detection of our model. Further, discuss the data imbalance and drift problems in the credit card dataset [25] and Twitter spam dataset [26].

Performance metrics
In this section for the detection of fraud in the credit card dataset and spam in the Twitter dataset, we use the ROC curve metrics, i.e., accuracy, precision, recall, F1-score, and FPR. The following metrics are calculated as follows. For calculating the accuracy of the model, we use (6), where TP is true positive, TN is true negative, is false positive and FN is false negative. For calculating the recall of the model, we use (7), for calculating the F1 score of the model, we use (8), Using all the equations we evaluate the results which are shown below in the different sections.

Experiment on credit card dataset
In this section, the accuracy, recall, and F-measure of the proposed model compared with the existing model for the credit card fraud dataset has been evaluated. The results have been shown in Figure 1.

Drift time
In this section, the drift time of the dataset for the Twitter spam dataset has been explained. The accuracy, recall, and F-measure has been shown in Figures 3, 4, and 5 respectively. The results have been compared with the MDDT+XGB [24] model which mainly focuses on the drift. This model also handles the drift using an ensemble model which also uses the XGBoost model. The accuracy performance of the MDDT+XGB keeps fluctuating each day as seen in Figure 3. The proposed model's accuracy remains constant and keeps on decreasing as the no of days increases. There is a drop after the 5 th day as the drift increases in the data. In Figure 4, the recall performance of 5 days performance can be seen. Similar to the accuracy the recall also keeps fluctuating each day for the recall. The proposed model is constant and decreases as the number of days increases. In Figure 5, the F-measure can be seen where the existing model increases on the 2 nd day but drop after the 3 rd day. The F-measure of the proposed model remains constant throughout the 5 days. Hence, the proposed model is better than the existing model in handling the drift.

CONCLUSION
In this paper, first, we have studied how class imbalance and concept drift problems arise in data and how it affects the overall system during classification and detection. Further, we have studied the various models which have been presented to detect class imbalance and concept drift. After this, we have presented a model which addresses the data imbalance and drifts problems during the detection. Furthermore, we have experimented with our model with the credit card fraud dataset and Twitter spam dataset. The results show that the proposed model attains higher accuracy when compared with the existing systems for both datasets. The proposed model provides an opportunity for addressing the class imbalance and drift issues in a given dataset and predicts better when compared with the existing models. The current proposed model can detect better with credit card fraud and Twitter spam datasets. For future work, we would try to improve the model to attain higher performance metrics with other datasets.