Comparison study of machine learning classifiers to detect anomalies

Received Nov 29, 2019 Revised Apr 3, 2020 Accepted Apr 13, 2020 In this era of Internet ensuring the confidentiality, authentication and integrity of any resource exchanged over the net is the imperative. Presence of intrusion prevention techniques like strong password, firewalls etc. are not sufficient to monitor such voluminous network traffic as they can be breached easily. Existing signature based detection techniques like antivirus only offers protection against known attacks whose signatures are stored in the database.Thus, the need for real-time detection of aberrations is observed. Existing signature based detection techniques like antivirus only offers protection against known attacks whose signatures are stored in the database. Machine learning classifiers are implemented here to learn how the values of various fields like source bytes, destination bytes etc. in a network packet decides if the packet is compromised or not . Finally the accuracy of their detection is compared to choose the best suited classifier for this purpose. The outcome thus produced may be useful to offer real time detection while exchanging sensitive information such as credit card details.


INTRODUCTION
Increasing rate of cybercrimes is a grave concern nowadays. Owing to the increased usage of Internet in all zones of life privacy and security has become the need of the hour. Any manipulation done to resource by an unauthorized entity with the intension of causing harm is termed as intrusion. an intrusion detection system (IDS) is a defense system which screens the activities in a computer system or a network automatically to detect breaches and subsequently notifies the user about any violations [1].
There are mainly four catagories of attacks [2]. In DoS Attack, attackers prevent other users to use a legitimate service for a period of time by preventing access to others. Banks websites, VTU sites etc. are prone to this kind of attacks. In remote to user (R2L), the threat caused by a secluded person to gain control of a target resource. Social Engineering is one such attack. In user to root (U2R), person with local privileges abuses the system's vulnerabilities to get super user rights. Buffer overflow errors and errors caused by irregularities in environmental assumptions are some common examples.In Probing, Attacker examines the system to find all its liabilities. By using these vulnerabilities the system is abused.
Two commonly used IDS based on the location are [3] network based intrusion detection system (NIDS) (Traffic flowing in the network is examined) and host based intrusion detection system (HIDS) (Traffic originated from or is destined to a particular host is scrutinized). Based on detection techniques IDS can be categorized as [4] Misuse detection and Anomaly detection. In misuse detection signatures of all Although this technique provides a high detection rate it is very time consuming and only is effective for known attacks only. In Anomaly detection, any variation from the normal expected behavior is flagged as attacks without any prior mastery on attacks. A higher false alarm rate is obtained by this method.
Problem Definition. Increase in Internet crimes nowadays exemplifies the need for a competent intrusion detection system. Every sector in the society is computerized, thus a large volume of important information such as personal profiles and credit card information are entered, edited and transferred across the network daily. This shift from centralized computing to networked environment has invoked a need to improve the security of the networks. Faulty packet filtering technology of firewalls, generality problem of antivirus, huge cost and performance bottleneck of application gateway which slows down the network etc does not allow them to evade all attackers and are not completely efficient. Machine learning classifiers are implemented here to learn how the values of various fields like source bytes, destination bytes etc. in a network packet decides if the packet is compromised or not .This research points out the need for a proficient intrusion detection system (IDS) which exposes malicious packets effectively even if a broad range of intrusions are encountered and cannot be tampered.
Significance of Proposed Research. The first reason for choosing the research is that Internet is a part of everyday routine nowadays for most of the people encompassing all aspects from online shopping to social media. Hence ensuring that only sanctioned people should have access to private information while preserving its integrity is quite necessary. Secondly, Signature based methods despite of having low false positive rates is ineffective in providing defense against unknown attacks. Statistical anomaly based detection explores on discrepancy of traffic characteristics from normal in terms of volume. It fails when attacker is crafty enough to keep the incongruity below certain levels. Finally, machine learning algorithms are chosen as they have proven to be an effective solution in identifying abnormalities immediately without being susceptible to any sort of manipulations from attackers.

LITERATURE REVIEW
U. Cavusoglu [5] employed various machine learning algorithms to evaluate which classifier gave better detection for each attack type. Data preprocessing and new feature reduction methods CfsSubsetEval and WrapperSubsetEval were used. The method can be further extended to find one optimum classifier which gives the optimum detection for all categories of attacks.
Kang et al [6] have demonstrated intrusion detection at the cluster head by employing SNORT and MYSQL data bases. Cluster head receives aggregated information from entire network making detection quicker. The presented research is most suitable for organization having large amount of data. However the technique is implemented only on static network and SNORT although offers good detection for known attacks fails for anomaly detection.
Baykara and Das [7] incorporated a honeypot based approach for real time intrusion detection. The proposed system reduces false positive level and provides protection against attacks such as zero day attack. However this approach is costly in terms of configuration, installation and management of honeypots when compared to machine learning classifiers. If regularly the attack signatures collected from log file of honeypots are not updated in the database then the detection rate suffers.
Zhao et al. [8] used Principal Component Analysis to reduce the dimensions for large dataset to make it suitable IOT devices. Accuracy of Softmax and KNN Classifiers is compared, where softmax regression shows better time performance. Unsupervised learning algorithms can be used so that many broader range of attacks can be discovered. Also since the algorithm is to be deployed on IOT memory saving techniques should be applied.
Singh et al. [9] proposed a four tier architecture having data preprocessing in first tier, feature extraction in second tier, classification in third tier and user interface in fourth tier. Generalized discriminant analysis was used for extracting features from KDD Cup 99 data set. C4.5 offered better detection for normal and probe classes, iSVM detected normal and DoS attacks and hybrid C4.5-iSVM perceived U2R and R2L attacks. Although the individual classifiers offered good accuracy there is a room for improvement in detection of U2R and R2L attacks.
Hoque et al. [10] used genetic algorithm to detect various types of attacks. Fitness of chromosome was realized using standard deviation method which can be made better by using heuristic approaches. Lin et al. [11] developed an approach which combines log file analysis technology and BP neural network technology. Even though this technique detected both misuse and anomaly data, log files used are monitored by daemons making it less trustworthy. Leu and Lin [12] employed Chi-Square method to detect variation in packet statistics which happens usually in case of attacks. On the contrary to clearly establish normal distribution huge amounts of data must be forked through which is time consuming. Seo [13] implemented Multiple Support Vector Machines in which every hyperplane is trained to detect specific attack, thereby decreasing the false positive rate. But, since MSVM has bigger margin than classical SVM, sometimes even the normal packets are classified as attack packets. Mukkamala et al. [14] compared the accuracy of SVM and neural network on DARPA dataset. SVM was observed to be performing better than NN for the selected 13 features. However SVM was limited only in making binary classifications and the method could be extended to detect more variants of attacks. Figure 1 highlights the methodology followed in the paper where each field is described below. a. Data set The KDD99 dataset [15] embodies 41 attributes and the 'class' attributes [16] which specifies whether a given case is a normal or an attack as shown in Figure 2. b. Pre-processing Noisy, redundant, incomplete and data having different data types is observed. Without standardization the process of classification will be hampered. Various R preprocessing packages are applied to eliminate missing records having incomplete data and to get data in uniform form. c. Principal component analysis

PROPOSED METHODOLOGY
Most of the features in the NSL-KDD dataset largely do not account for most of the variance in the results. Therefore, a method called PCA [17,18] is used to get a more concise dataset with less features that account for most of the variance in the data. The PCA method, developed by Karl Person in 1901, uses an orthogonal transformation and converts the possibly correlated data to linearly uncorrelated data sorted in terms of varying degrees of contribution of variance to the final result, such that, the first component explains more variance than the next and so on. The variances explained are calculated by squaring the Eigen values. In this way, the first k components can be selected in such a way that these k components explain most of the variance in the data. In this way, only k features are obtained as a result, without much change in the variance explained. This method of reducing the dimensions of the data helps data visualization and also mediates some of the high variance problems occurring due to excess features having little or no contribution to the results. d. Categorize the packets 1) Neural network: Pre-processed data is divided into 3 sets -training, validation and test sets in 60:20:20 ratio. Model is trained using the above method [19] for different values of hidden layer data from the training set and accuracy was tested subsequently using the validation set. Classification error E=M-Y is calculated using validation set where M is the expected output vector taken from the validation set and Y is the computed output resulting from the classification (Y=W*X) having weight W and input X. When the error observed is low the training phase ends. The entire process is repeated k times (k fold cross validation) [20] for different randomly selected data samples to find the most optimum value for hidden layer ensuring that the model wont over fit the network. Models observed with the highest validation accuracy are taken and tested with the test set. Again k-fold cross validation is applied to find the optimum value. 2) Bagging: When the model is bagged [21][22][23], several resamples of the data are taken in iteration and the model is trained on these samples. Then the predictions are averaged over the samples. This method is particularly useful when the model has a low variance as it helps increase the variance of the model and having little effect on the bias. This is done with the random forests as well as with the linear SVM model to compare results. Cross validation is included in order to find the most optimum value [24][25][26]. 3) Additionally, simple decision trees with different complexity parameters (cp) (cross validation) are also used to compare results with the above models. e. Accuracy calculation and comparison: The most optimum classifier is selected in this step. Table 1 to Table 11 show the results obtained for the classifiers used. a. Neural network

RESULTS AND ANALYSIS
The Table 1 show hidden layer = 2 and 6 was chosen as the parameter for having the highest cross validation accuracy and the model was tested again, with the test set. The Table 2 show hidden layer = 2 is found to be ideal for this dataset. Final Accuracy (test set): 96.26 % using hidden layer = 2.    3) Finally, comparing and analyzing the accuracy of various classifiers and finding the most suitable classifier (subjective to the environment in which the system is deployed, the cost and computation precincts and the security level necessary) contributes considerably to the theory building for upgraded system design.

CONCLUSION
With the profusion in the usage of Internet for applications such as e-commerce web sites, online banking etc. protection of crucial information travelling over the network or residing in host machines becomes crucial. Effectiveness of any detection technique depends on the type and behavior of the data in the system, the environment in which the system is deployed, the type of anomalies and attacks that the system encounters, the cost and computation limitations assigned for the particular operation and the security level required. Firewalls act as a fence around the organization's network but do not provide protection from insider attacks. User authentication methods are costlier in terms of equipment and fails if the secret key which authenticates the person is leaked. Thus, there is a mammoth need for a detection system which can categorize any packet accurately as normal or intrusive in real time without having to rely on any database and being meddled by any attacker. Hybrid methods encompassing a combination of signature based and anomaly based detection can be implemented in future to offer real time detection with good detection rate.