Botnet detection using ensemble classifiers of network flow

Received Mar 19, 2019 Revised Oct 26, 2019 Accepted Nov 22, 2019 Recently, Botnets have become a common tool for implementing and transferring various malicious codes over the Internet. These codes can be used to execute many malicious activities including DDOS attack, send spam, click fraud, and steal data. Therefore, it is necessary to use Modern technologies to reduce this phenomenon and avoid them in advance in order to differentiate the Botnets traffic from normal network traffic. In this work, ensemble classifier algorithms to identify such damaging botnet traffic. We experimented with different ensemble algorithms to compare and analyze their ability to classify the botnet traffic from the normal traffic by selecting distinguishing features of the network traffic. Botnet Detection offers a reliable and cheap style for ensuring transferring integrity and warning the risks before its occurrence.


INTRODUCTION
Day by day the dependency on the Internet has increased in our daily lives, mainly in many important fields such as educational organizations, communication companies, government facilities, banking, and e-commerce. This adds many difficulties in managing the web and utilizing the application, for example, protecting the user data, integrity, privacy, and availability [1]. All these reasons changed the consideration of attackers to thinking about financial advantages, the attackers utilize diverse malware to accomplish their objectives. Among the different sorts of malware, Botnet is one of the most genuine ways of doing the crime online on the web [2]. Therefore, financial benefits are the main aim of generating botnets by the attacker [3]. McAfee's Threat Report for the first quarter of 2019 showed that the number of newly discovered malware threats has achieved more than 60 million threats. The whole malware estimated to reach more than 800million before the end of 2018 [4]. Moreover, the statistics revealed by CenturyLinkin the first half of 2019 showed that the average number of threats amounted to 3.8 million unique threats per month, and explained that the top five countries suspected for the movement of botnets attack are the United States, Spain, India, Indonesia, and Turkey [5]. This huge number of malware threats caused by botnets have been planned, each one becoming more resilient, unsafe, and smart. Fortunately, botnet detection methods have also developed, which employ different approaches such as traffic analysis [6][7][8], DNS based methods [9] and machine learning such as decision trees [10], Neural Network [11] and clustering [12].
The botnet detection modelin this study focuses on network traffic analysis under the behavior characteristic that is flows generated by bots be different from normal flows. With this characteristic, machine learning (ensemble classifier algorithms) can be attempted to classify flows depending on their behavior with the possible highest accuracy. It is important to select the essential features by using some This communication channel is called the command and control (C&C) channel. TheC&C is the main feature that distinguishes between Botnet and other types of malware [21]. Botnets may be categorized based on the C&C mechanism into two major types: centralized and decentralized C&C [22]. The attacker or botmaster is usually used the C&C server to direct a command to the bots in centralized botnets as illustrated in Figure 1(a). Due to its uncomplicatedness, the centralized botnet is widely used via numerous botnet groups. The IRC-based botnets and HTTP Botnet are considered among the most famous of botnet approaches. However, the single point of failure C&C server in centralized Botnet is the major problem in it. A shutdown of the C&C server might result in a lack of communication among the bots and the botmaster [23]. The next generation of botnets, attackers have started to structure Botnets based on a decentralized architecture, such as, the Peer-to-Peer botnet [24] which it adopted via many forms of the botnet, for example, Waledac, Storm, and Conficker [25]. Peer-to-Peer botnet is a form that adopted a decentralized architecture to avoid having any single point of failure. In P2Pbotnet as illustrated in Figure 1

ENSEMBLER CLASSIFIER FRAMEWORK
Ensemble method constructs a set of classifiers (base learners) from training data and combines them to classify new data examples by taking a vote (typically by weighted or un-weighted) of their decisions [27]. The main idea behind the ensemble learning is to employ several individual classifiers and combine their predictions to obtain a classifier that can work better than each of them [28]. In this research, the most three common ensemble approaches: Bagging, Boosting and random forest methods have been used, as shown in Figure 2 [29].

Bagging
Bagging or bootstrap aggregating is a method to get multiple learners, where the training data set for each learner is produced by random uniformly sampling with replacement from the original data set [30]. Bagging is consists of two parts: bootstrap and aggregation. A significant reduction in error could produce when the combination of independent base learners happens, thus, it is essential to keep the base learner independent as possible. The bootstrap distribution is utilized via the bagging technique to generate diverse base learners. Using random sampling and replacement, the bagging method produces bootstrap sampling of the training data, it implemented bootstrap sampling [31] to generate data subsets to train the base learners. Moreover, several repeats of the original dataset are formed through utilizing random selection with replacement. Next, every dataset is utilized to form a new learner and the formed set of learners is used to construct an ensemble. For aggregating the outputs of the base learners, bagging utilizes one of the most common methodologies for classification, which is voting while it uses an averaging approach to dealing with the regression problem.

Boosting
Boosting technique also called ARCing "Adaptive Resampling and Combining" [28]. It is related to the algorithms that can convert weak learners to strong learners. Generally, we can be defined as the weak learner as the learner which is slightly better than the random guess. Oppositely, the strong learner is very close to a perfect result. Boosting is a common method utilized to improve learning method performance. The concept behind boosting is that a weak learner can be boosted to a strong learner Schapire [32] proposed the boosting technique for that purpose. Boosting is consider as an advancing additive model and it utilizes the whole dataset for each stage. This technique merges the outputs from various classifiers with the aim of produce an effective classifier [33].

Random forest
The random forest belongs to the family of ensemble approaches. It grows many decision trees by utilizing randomly partitioning the training data and features, where each tree is built depends on the values of an independent set of random vectors of the training dataset. These random vectors produced from a fixed probability distribution since the probability distribution is diverse to concentrate on instances, which has difficulties to classify [34]. The randomization aids in reducing the correlation among decision trees to improve the generalization error of the ensemble [30].

PROPOSED MODEL
The proposed system for the Botnets detection, the classification of network traffic is achieved by applying three different Ensemble classifier algorithms: Bagging, Boosting and Random Forest. The results of the detection methods were verified using CTU-13 Dataset and 10 fold cross validation was adopted to evaluate the proposed model performance. The framework of our system is described in Figure 3.

Dataset
The CTU-13 dataset [35] is one of the largest NetFlow datasets available that contains botnet traffic as well as normal and background labeled data. These data were collected by the Czech Technical University (CTU), 2011. The CTU-13 dataset has 13 datasets (called scenarios) of different botnet samples. In addition to that, each of these scenarios has been recorded in a separate file as a NetFlow which using CSV notation. These NetFlow files include the following attributes: Start Time, Duration, Source IP address, Source Port, Direction, Destination IP address, Destination Port, Protocol State (e.g., UTP, TCP), SToS (Type Of Service), Total Packets (exchanged between source and destination), Total Bytes, and Label (e.g., background, normal, and botnet).

Feature selection
In the Botnet detection technique, one of the essential parts is feature extraction. By experimenting not all features have similarly contributed to the result, some of them are significant and pertinent than the other to the learning and analysis process. The redundancy of features may cause a reduction in the accuracy, to rank the features in this paper, the information gain (1) measure has been used [36].
where H(S) is the entropy of the given a training set S and H(Si) is the the entropy of the ith subset of the training set Since the attribute A is observed. The gained information is utilized to assist in ranking the attribute in machine learning and the attribute with the high IG is ranked higher than the other attributes because it has a stronger power in classifying the data. Figures 4 show that the classification of the (12) attributes of the CTU-13 dataset sorted in descending order by information gain. After ranking the attributes using information gain the best ones are selected Therefore the top 8 attributes based on their importance value are considered in this work. The selected attributes are: < Source IP, Destination IP, Start Time, duration, IP protocol, protocol state, the total number of packets and total bytes exchanged>.

Detection methods
The research introduces three Ensemble methods to identify between botnet and normal traffic by classifying the corresponding flows. We have used bagging, AdaBoost, Random Forest method of the ensemble-based classifier. The machine learning algorithms like JRip, Naïve Bayes and REPTree have been deployed as a base classifier on ensemble methods.

EXPERIMENTAL RESULTS
In our experiments, we have used CTU Botnet Dataset (Scenario 11), which already contains labeled bidirectional net flows, The selected attributes by information gain are: Source IP, Destination IP, Start Time, duration, IP protocol, protocol state, the total number of packets and total bytes exchanged, as shown in Figure 4. A data mining software called WEKA has been used to apply ensemble algorithms to this dataset. WEKA is a group of machine learning algorithms for solving data mining tasks. The algorithms can either directly applied by using GUI or called from Java code. Because the size of the downloaded data is too large to be processed by the available PC machines, so to deal with this problem a small part of the data was randomly selected that can be handled by the available devices. This sample of data was entirely random selected to guarantee that the results of the analysis stay unbiased by the selective process.
Five different measures were utilized to evaluate the performance of the proposed method, those measures are Accuracy, False Positive Rate, Precision, Recall, and F-measure. The ten-fold cross-validation technique was adopted to estimate the accuracy of the proposed method where the dataset is split at random manner into similarly exclusive and equal-sized subsets. Also, the cross-validation method guarantees that every part of the basic dataset is utilized in a similar number of times in training and testing. The generated results usingensemble methods with the three different classification schemes (JRip, Naïve Bayes and REPTree as a base classifier) are given in Table 1.  Table 1 present the comparison of ensemble algorithms over the 10 fold cross-validation concerning different comparison measures. JRip classifier achieves the highest classification accuracy (99.84%) in both AdaBoost and Bagging compared with the accuracy of Naïve Bayes (98.12%) and REPTree (85.48%) in AdaBoost and with the accuracy of Naïve Bayes (99.1%) and REPTree (85.48%) in Bagging. Furthermore, Table 1 can conclude the JRip classifier gives the lower false positive rate (0.002) in both AdaBoost and Bagging and the highest false positive rate from REPTree (0.307) and it has a low accuracy too. Random Forest also achieves high detection accuracy (95.11%) and a low false positive rate (0.103). The Ensemble with JRip Classifiers model has been compared with five different methods which are clustering, Neural Network, Recurrent Neural Network [37,38], K-medoids, K-means [12], Long Short-Term Memory (LSTM) [11], anddecision trees [10]. The comparative of results in Table 2 show that our proposal Ensemble with JRip Classifiers model achieves better detection accuracythan the existing systems for botnet detection.

CONCLUSION
In this paper, we have presented an approach to deal with botnet detection problem, which is considered as a serious and critical threat of internet security. One approach to handle this problem is by recognizing botnet actions and infected devices to provide vital safety measures. The proposed model was based on "ensemble classifiers methods" which are performing better performance through combining multiple algorithms in the process of botnet analysis. Also, through the feature selection process, the most significant features were extracted for the analysis process to increase the accuracy and decrease the time as well as resources. To evaluate this proposed methodology, we have performed experimental assessments on the CTU botnet dataset and the performance of the proposed model was assessed utilizing 10 fold crossvalidation. The results showed that the proposed model was effective and has promising results.