Optimization of network traffic anomaly detection using machine learning

Received Sep 6, 2020 Revised Dec 9, 2020 Accepted Dec 15, 2020 In this paper, to optimize the process of detecting cyber-attacks, we choose to propose 2 main optimization solutions: Optimizing the detection method and optimizing features. Both of these two optimization solutions are to ensure the aim is to increase accuracy and reduce the time for analysis and detection. Accordingly, for the detection method, we recommend using the Random Forest supervised classification algorithm. The experimental results in section 4.1 have proven that our proposal that use the Random Forest algorithm for abnormal behavior detection is completely correct because the results of this algorithm are much better than some other detection algorithms on all measures. For the feature optimization solution, we propose to use some data dimensional reduction techniques such as information gain, principal component analysis, and correlation coefficient method. The results of the research proposed in our paper have proven that to optimize the cyberattack detection process, it is not necessary to use advanced algorithms with complex and cumbersome computational requirements, it must depend on the monitoring data for selecting the reasonable feature extraction and optimization algorithm as well as the appropriate attack classification and detection algorithms.


INTRODUCTION
The cyber-attack is a form of dangerous attack that has increased rapidly in both the number of recorded attacks and the extent of their damage to organizations and businesses. The research [1][2][3] classified cyber-attack techniques into two main methods: Passive attack and active attack. According to the report [4], in 2019, cyber-attack techniques are considered as the top of the most dangerous attack techniques. From the statistics about security vulnerabilities [5] that are often exploited in the system by attackers, we can see the level and the danger of current cyber-attacks for organizations, governments, and businesses. Therefore, the problem of detecting and early warning signs of cyber-attack campaigns is very necessary today.
The studies [2,3] presented the difference between cyber-attack and other attack techniques, thus making the detection and the warning of this attack have many difficulties. Currently, there are two main methods for detecting cyber-attacks: signature-based method through the rule sets, and anomaly-based method based on data analysis and statistics to find out abnormal characteristics in the network [1][2][3]6]. The signature-based method has the ability to detect quickly and accurately but cannot detect new attack techniques [1]. The anomaly-based method is not only capable of detecting attacks but also capable of detecting abnormal behaviors, but this method requires complex calculation and processing, and has low accuracy. The anomaly-based method is usually based on two main techniques that are machine learning and deep learning to classify abnormal and normal behavior [1,2]. In this paper, we propose a cyber-attack detection method using the random forest (RF) machine learning algorithm. The RF algorithm has been proved as the current best algorithm for classification by studies [1,3,[6][7][8]. The study [1,2] listed and analyzed some data sets commonly used for cyber-attack detection such as DARPA/KDD Cup99, CAIDA, NSL-KDD, ISCX 2012, UNSW-NB15, etc. In these datasets, the UNSW-NB15 data set is built and developed relatively in accordance with real network systems [1,9]. Therefore, in this paper, we will use the UNSW-NB15 dataset to experiment with cyber-attack detection methods.
As presented above, in order to optimize the process of detecting and alerting cyber-attacks based on machine learning techniques, recent studies and recommendations often attempt to find new detection methods and techniques. However, we recognize that the new approaches are usually only suitable for existing datasets, when they are applied in practice, they often don't bring high efficiency due to the incompatibility of model building datasets with monitoring datasets. Therefore, in our point of view, instead of trying to learn or develop new detection methods, we look for ways to analyze and build experimental datasets so that they are most suitable for real network monitoring systems. In this paper, in order to optimize the abnormal detection process based on the UNSW-NB15 dataset, we propose methods of evaluating and selecting new features. The methods that we propose to use in this paper include information gain, principal component analysis, and correlation coefficient method.
Our research is presented as follows: the urgency of the research problem is presented in section 1. In section 2, we present the process of researching, surveying, and evaluating related works. The algorithms related to the problem of classifying attack and reducing feature dimensions are presented in section 3. Section 4 presents the results of the experimental process. Accordingly, section 4.1 is the experimental process of detecting cyber-attacks, in which we evaluate and compare our proposed method with some other studies. The results of the process of evaluating and comparing the efficiency of the feature dimension reduction method are presented in section 4.2. Conclusion and evaluation are presented in section 5. The practical significance and scientificity of our paper include: -Apply RF machine learning algorithm and UNSW-NB15 dataset to detect abnormal behavior in the network. In the studies that we surveyed (see Section 2.1), the authors used different machine learning methods to compare and evaluate the effectiveness of each algorithm. However, no research has applied the RF algorithm to detect anomalies based on the UNSW-NB15 data set, although this algorithm has been indicated as the current best algorithm for classification by some studies. Our experimental results presented in section 4.1 prove the effectiveness of RF algorithm in detecting anomalies and show that when building abnormal detection systems, it is not necessary to set up algorithms that are too cumbersome and complicated. In addition, based on the results of our proposed experimental scenarios, we have shown the options for selecting the dataset and parameters of the algorithm so that they are in compliance with the detection model. -Proposing methods of evaluating and selecting features. In this paper, we propose to use some methods and techniques in order to evaluate and select the best features. In addition, we will reassess the detection model based on the selected features with two criteria: accuracy and processing time. The results of the research and evaluation in section 4.2 are developments and supplements to the shortcomings of the studies presented in section 2.2.

RELATED WORK 2.1. Cyber-attacks detection based on UNSW-NB15 dataset
In the study [10], Kumar et al. proposed a method to classify cyber-attack techniques based on UNSW-NB15 by using different rule sets. However, in this study, building and applying the rule set will be limited because the coverage and the number of rule sets are not large enough. Moustafa et al. [11] proposed the geometric area analysis technique to detect cyber-attacks by using trapezoidal area estimation. To evaluate the effectiveness of the proposed method, the authors conducted experiments on UNSW-NB15 and NSL-KDD datasets. Experimental results in the study showed the superiority of the UNSW-NB15 dataset over the NSL-KDD dataset. Besides, research [12] presents a technique for building an effective anomaly detection system based on two datasets: the NSL-KDD and UNSW-NB15. This technique requires three modules: capturing and logging module, pre-processing module, and the Dirichlet mixture model that is a novel statistical decision engine based on anomaly detection technique. The first module scans and gathers network data. Then the second module analyzes and filters these data in order to improve the efficiency of  [13] proposed the cyber-attacks detection method based on Naïve Bayes, and decision trees (J48) algorithm. In their experimental section, the research team [13] used these algorithms in turn to classify different cyberattack components in the UNSW-NB15 dataset. In the study [14], the authors proposed a model to detect cyber-attacks using stacking techniques. Accordingly, in the training process of their model, the author uses machine learning algorithms consisting of K-Nearest Neighbors, Decision Tree, and Logistic Regression in order to build a model based on the UNSW-NB15 and UGR'16 datasets. The study [15] evaluated the effectiveness of 8 machine learning algorithms (consisting of 2-layer and 3-layer algorithms) for network intrusion detection. This is a good idea, but it requires the use of the Microsoft Azure Machine Learning Studio system to apply in practice. In this research, we proceeded to distinguish between attack and normal based on pure machine learning algorithms and the use of Apache Spark technology. Our results are similar to the results of the method that authors [15] proposed, but our performance and experimental configuration are much simpler than the research [15].
In addition, other studies also presented methods to detect attack components in the network using machine learning algorithms. The study [16] presented a method of detecting DDOS attacks using a technique that comprehensively simulates DDOS attacks. In their study [17], Narender et al. proposed a method to detect DDOS attacks using machine learning algorithms such as Logistic Regression, Decision Tree, and K-Nearest Neighbors. This is a relatively classic approach. Nowadays, these classification algorithms are often not as effective as the RF algorithm [7]. Jafar et al. [18] proposed a method to classify DOS, Prob, U2R, and L2R attack techniques by using some algorithms consisting of Neural Network, Genetic, and Decision Tree. However, the approach using classification algorithms with KDD 99 dataset in the study is an old one because the current cyber-attack data is much more abundant and diverse.

The problem of optimizing the anomaly detection feature on the network based on the UNSW-NB15 dataset
In the study [19], the author proposed using Pearson's correlation coefficient and gain ratio technique to evaluate features. However, the limitation of this study is that the authors didn't conduct experiments to evaluate the accuracy of each method of feature dimension reduction. In this paper, we will not only evaluate features to select important features but also evaluate the anomaly detection model based on the feature evaluation process. The study [20] proposed the Information gain method to reduce the feature dimension in the training process of the botnet detection model. However, in that study, the authors didn't specify which redundant features were removed. The study [10] described the Information gain algorithm for reducing the feature dimension. However, in the experimental part, the authors didn't compare the effectiveness of the detection method when using the feature dimension reduction technique. Bagui et al. [13] proposed methods of feature selection using K-means Clustering and Correlation based Feature Selection algorithms. In the study [21], the authors proposed using a deep learning model combining Convolutional Neural Network and long short-term memory network (LSTM) to extract and classify cyber-attacks using the CICIDS2017 dataset. Experimental results show that the classification system gives overall accuracy as 98.67% and the accuracy of each attack type as over 99.50%. However, this approach requires a lot of time and a cumbersome calculation system. Thus this method is only suitable for studies and is difficult to apply in reality.

ANOMALY CLASSIFICATION AND ITS OPTIMIZATION USING MACHINE LEARNING 3.1. Experimental data
The data set used for experiments is UNSW-NB15. This dataset was built by using the IXIA PerfectStorm tool to extract a mixture of attack operations in the network. More than 100 GB of raw network traffic are captured by Tcpdump tool and processed by Argus, Bro-IDS, and twelve algorithms written in the C# language to extract 43 features and save it in CSV format [9,10,12,13]. The selected features are divided into six groups: -Flow features: Include features used to identify network flow such as IP address, port number, and protocol.

Anomaly classification using random forest machine learning algorithm
The study [7] surveyed and evaluated some supervised learning algorithms in the cyber-attack detection problem. Accordingly, the study indicates that the RF algorithm is the current best classification technique. Therefore, in this paper, we will use the RF algorithm to detect anomalies in the network based on the UNSW-NB15 dataset. RF is an ensemble classification method [22]. This algorithm is based on an ensemble of classifiers, which normally are decision trees to make the final prediction [23]. The theoretical foundation of this algorithm is based on Jensen's inequality [23]. According to Jensen's inequality applied to the classification problems, it is shown that the combination of many models may produce less error rate than that of each individual model.

Feature evaluation and selection
In fact, not all features, which we found, are useful to build a training model to help make the necessary predictions. Using a few features sometimes reduces the accuracy of prediction and takes time to build a model. Therefore, feature selection plays a very important, necessary role in the process of building abnormal detection systems. Selecting good features will not only improve the accuracy of attack prediction but also reduce feature extraction time. In this paper, we evaluate and select features by some different methods in order to assess the effectiveness of each method for the UNSW-NB15 dataset.

Feature optimization using correlation coefficient method
The correlation coefficient is a statistical index that measures the strength of the relationship between two variables. There are many different kinds of correlation coefficients. In this paper, we used the Pearson correlation coefficient. Pearson correlation coefficient between two variables X and Y is calculated by the formula [24].
The correlation coefficient has a value between -1 and 1. The negative correlation coefficient indicates that the two variables have a negative correlation or inverse correlation (is a perfect negative correlation when the value is -1). The positive correlation coefficient indicates a positive correlation (is a perfect positive correlation when the value is 1). The correlation coefficient is zero if two variables are independent of each other. Features with large correlation coefficients have linear dependence, and thus they have almost the same effect on the dependent features. So we can reduce one of those two features.

Feature optimization using information gain method
Information gain (IG) is a feature evaluation method based on entropy function and is widely used in machine learning [25]. Information gain is defined as a quantity that measures the amount of information gained about a class from a feature. Information gain is calculated based on entropy quantity [23]. The entropy function is defined as follows [23]: Given a probability distribution of a discrete variable can receive different values * }. Suppose that the probability for get these values are ( ) with and ∑ . This distribution symbol is ( ). The entropy of this distribution is defined by formula (1) From the formula of entropy, we formulate the calculation principle of Information gain as follows: Step 1: Consider a problem with different classes. Suppose that we work with a non-leaf node with data points forming the set with the number of elements as . Suppose further that in these data points, there are points (with ) belongs to class c. The probability for each data point belongs to class c is approximately (maximum likelihood estimation). Thus, the entropy at this node is calculated as follows: Step 2: Assuming that the dataset is divided into subsets according to a feature . Based on , data points in S are divided into child nodes: with points in each child node. We define formula (3) as the sum of weighted entropy of each child node. The taking weight is important because nodes often have different the numbers of points.
Step 3: Calculate information gain value based on feature .

Feature optimization using principal component analysis method
Principal component analysis (PCA) is a method of finding a new basis so that the information of the data is mainly concentrated in several coordinates, the remainder only contains a small amount of information. To simplify the calculation, PCA will look for an orthonormal basis to make a new basis so that in this system, the most important components are in some coordinates of the first component [26]. We can see the steps for implementing PCA as follows [26,27]: Step 1: Calculate the mean vector of all data.
Step 2: Subtract the mean vector from each data point.
Step 3: Calculate the covariance matrix: Step 4: Calculate eigenvalues and eigenvectors with norm equal to 1 of this matrix, arrange them in the descending order of eigenvalues.
Step 5: Select K eigenvectors with K highest eigenvalues to build the matrix U K whose columns form an orthogonal. These K vectors are also called key components that form a subspace close to the distribution of the normalized original data.
Step 6: Project the normalized original data ̂d own to the found subspace.
Step 7: Calculate the coordinates of the new data. The new data is the coordinates of the data points on the new space according to the formula (8). ̂ The original data can be approximated according to the new data as in formula (9).

EXPERIMIMENTS AND EVALUATIONS 4.1. Experiment and evaluation of abnormal detection method 4.1.1. Experimental scenarios
The experimental dataset in our paper includes 2,540,047 records consisting of 2,218,764 normal records and 321,283 attack records. We will divide the above dataset into experimental datasets as follows: tested with the number of decision trees used as {10, 40, 60, 80, 100}. Besides, we also conduct experiments to compare the RF algorithm with some algorithms of other studies including decision tree (J48) [9,21] and LSTM [21,28] algorithms. In the study [15], the authors have proven that the KNN and logistic regression algorithms both have less efficiency than the decision tree algorithm, so to see the effectiveness of the RF algorithm, we will only compare it with decision tree and LSTM algorithms

Evaluation criteria
In this paper, we specify that the abnormal record is labeled as positive, and normal records are labeled as negative. The metrics used to evaluate the effectiveness of the abnormal detection method in our paper include:  Accuracy: the ratio between the number of points correctly predicted and the total number of points in the test dataset.
 Precision: the ratio of the number of true positive points among those classified as positive (TP+FP). High Precision value means that the accuracy of the found points is high. (11)  Recall is defined as the ratio of the number of true positive points among those that are actually positive (TP+FN). High recall value means that the true positive rate (TPR) is high meaning that the rate of missing the actual positive points is low. (12) In which, True positive (TP) is the number of abnormal records that are correctly predicted; False positive (FP) is the number of normal records that are incorrectly predicted; True negative (TN) is the number of normal records correctly predicted; False negative (FN) is the number of abnormal records that are incorrectly predicted.  Confusion matrix: This matrix will show how many data points actually belong to which class and how many data points are predicted to belong to which class. In addition, the TPR, FNR, FPR, TNR (R-Rate) criteria are calculated based on the normalized confusion matrix. Table 1 describes the calculation formulas of the above parameters.

Experimental results
a. Experimental results with dataset A From Table 2, we can see that when the number of decision trees is 40, the algorithm has the highest accuracy and precision which are 99.299% and 98.619% respectively. Besides, when changing the number of decision trees from 10 to 100, the accuracy of the algorithm doesn't change much. This shows that with a dataset balanced about the ratio of normal and abnormal records, the RF algorithm detects well and steadily. However, when the number of decision trees increases, training and testing time also increases. Table 3 shows the evaluation result of the confusion matrix in case of the number of decision trees of 40. From Table 3, we can see that the prediction model achieved very high accuracy in both normal and anomaly predictions. b. Experimental results with dataset B From Table 2 and Table 4, we can see that the accuracy and precision of dataset B are lower than the dataset A. However, the recall values don't change much. In addition, the training time of dataset B is 1.3 to 1.5 times higher than the dataset A. For the RF algorithm in dataset B, the highest accuracy (98.944%) is achieved when the number of decision trees is 60 and 80. However, the highest precision (95.965%) is  Table 5 shows the result of the confusion matrix in case of the number of decision trees of 40.   c. Experimental results with dataset C When the number of abnormal records and the number of normal records have the largest difference, all experimental values give poorer results than other scenarios. This is reasonable because this is the nature of the classification process. If the disparity in the dataset is too large, the classification model will over fit. From Table 6 it can be seen that: with a parameter of the number of decision trees of 10, the highest Accuracy and Precision are respectively 99.016% and 94.825%. The confusion matrix values are shown in Table 7.  Tables 2, 4 and 6, we can see that the RF algorithm gave good and stable classification results although there is a very large difference among the datasets. The lower the imbalance of the dataset is, the higher the measures of the correct detection rate are. Besides, for J48 [9,21] and LSTM [21,28] algorithms, when the dataset changes, the detection results and detection time also change. The J48 algorithm has the advantage of the lowest time for detection and classification due to using only one tree for evaluation. However, this algorithm has the disadvantage that its accuracy on all measurements is lower than the RF algorithm. With the LSTM algorithm [21,28], the detection efficiency has been improved but the processing time is too slow compared with other algorithms. Thence it can see that the LSTM algorithm is not really suitable for datasets without time parameters. Based on these results, we provided some criteria and basis for cyber-attack detection systems to choose in order to balance between detection performance and time cost.

Evaluation of feature optimization methods
From Table 6, we select a parameter of the number of decision trees of 80 to conduct experiments and evaluate the feature optimization method. We chose this scenario because the dataset C and the number of decision trees of 80 give the lowest accuracy and precision.

Feature selection using correlation coefficient method a. Experimental results of feature dimension reduction
According to the rule of selecting and evaluating features of the correlation coefficient method, if the two features have a large correlation coefficient, one of the two features should be removed. The reason is that if both features are kept, there is not mean much about terms of value. Accordingly, from Figure 1, we specify that if two features have the correlation coefficient greater than or equal to 0.9, or less than or equal to -0.9, one of the two features will be removed. By doing this, we removed 12 features consisting of sloss, ct_state_ttl, synack, ct_dst_src_ltm, Dpkts, dwin, ackdat, ct_srv_dst, Ltime, dloss, and ct_src_ltm. So from Figure 1, the number of remaining features is 31.  Table 8. Comparing Table 8 with Table 6, we see that the important metrics such as accuracy, precision, and training time are all improved, being the following: Accuracy value increased by 0.015%; Precision value increased by 0.146%; Training time reduced by 52.954 seconds.  Figure 2 shows the importance of each feature when using the IG evaluation method. Features with low importance scores (less than 0.01) will be removed to reduce the number of features. By doing this, we removed 15 features: dloss, dwin, stcpb, dtcpb, trans_depth, res_bdy_len, Sjit, Djit, Stime, Ltime, is_sm_ips_ports, ct_flw_http_mthd, is_ftp_login, ct_ftp_cmd, ct_src_ltm. b. Result of classification using IG method Experimental results of dataset C with 28 selected features are presented in Table 9. Comparing Table 9 with Tables 8 and 6, we see that the important metrics such as accuracy, precision, and training time are all much better, being the following: Accuracy value increased by 0.193%; Precision value increased by 2.333%; Training time reduced by 29.699 seconds.  Table 10. Comparing with the initial feature set, this experimental scenario also has better accuracy, precision, and training time values. Furthermore, reducing the feature dimension by PCA method has higher accuracy and precision than the feature selection using correlation coefficient method, but training time is more than 34.377 seconds. Comparing the experimental results in Table 10 with Table 9, the PCA method isn't as effective as the IG method. The reason is that the PCA method compresses data that could lead to the loss of important features, and the IG method performs weight evaluation to select features. Therefore, if the data set is larger, the use of the PCA method will be more effective.

Discussion
The experimental results in Tables 8-10 show that the feature dimension reduction algorithms brought good efficiency in both 2 problems: improving the efficiency of the detection process, and time for detection and warning. However, based on the different efficiency of the feature dimension reduction methods, we noticed that cyber-attack monitoring and detection systems need a trade-off between detection efficiency and detection time. The IG and correlation coefficient algorithms can give better results in terms of detection time and efficiency if we continue to choose thresholds to reduce the dimension. However, if reducing the number of features too large, it will lead to the loss of data characteristics. Besides, these algorithms are only suitable for small and medium datasets. For large datasets, it is necessary to use the PCA method. Therefore, we think that monitoring systems need to constantly update and reevaluate the training model to change the values and roles of features to ensure that all useful features are used.

CONCLUSION
Cyber-attack techniques have always been and will always be major challenges for intrusion monitoring and detection systems. With the goal of optimizing the cyber-attack detection process, in our research, we proposed two main problems: optimizing the attack detection method by using the RF supervised learning algorithm and optimizing features based on feature dimension reduction techniques. The experimental results about detecting cyber-attacks using the RF algorithm show that the RF algorithm has been effective not only for the ability to accurately detect attacks but also for the ability to limit the false detection of attacks when the experimental dataset has a large difference between normal data and cyberattack data. For the feature optimization process, feature dimension reduction methods removed many features. In particular, the correlation coefficient method decreased by 26%, IG decreased by 32%, and PCA decreased by 43% of the number of features. Although the number of features is reduced, the detection method still ensures the efficiency of accuracy as well as the detection time. This shows that dimensional reduction methods selected and eliminated accurately redundant features. With the results, our paper has not only provided network attack monitoring and detection systems with criteria to choose from to ensure the time and efficiency of the detection process but also proved that: to optimize the detection of cyber-attacks, it is not necessary to use advanced algorithms with complex and cumbersome computational requirements, it must depend on the monitoring data for selecting the reasonable feature extraction and optimization algorithm as well as the appropriate attack detection algorithms. In the future, we will continue to research and propose to apply our approach on other experimental data sets of cyber-attacks such as IDS 2018, CTU 13, etc. Besides, we will improve data dimension reduction solutions based on information representation methods of features or using graph theory.