http://ijece.iaescore.com Forging a deep learning neural network intrusion detection framework to curb the distributed denial of service

Info 2020 Today’s popularity of the internet has since proven an effective and efficient means of information sharing. However, this has consequently advanced the proliferation of adversaries who aim at unauthorized access to information being shared over the internet medium. These are achieved via various means one of which is the distributed denial of service attacks-which has become a major threat to the electronic society. These are carefully crafted attacks of large magnitude that possess the capability to wreak havoc at very high levels and national infrastructures. This study posits intelligent systems via the use of machine learning frameworks to detect such. We employ the deep learning approach to distinguish between benign exchange of data and malicious attacks from data traffic. Results shows consequent success in the employment of deep learning neural network to effectively differentiate between acceptable and non-acceptable data packets (intrusion) on a network data


INTRODUCTION
The rapid advancement in technology over the years-has been geared towards effectively improving the way and how we live in a bid to meets specific targeted human needs. Technology seeks to better advance our living existence unto higher plains cum levels of sophistication with ease. Tremendous adoption and integration of the Internet has significantly advanced the use of data sharing programs that seeks to effectively disseminate data froom one user to another [1]. This adoption and integration has been attributed to its usage ease, ubiquity of its nature, low-cost of transaction and trust in communication channel-all of which continues to advance its popularity, adoption ease and usage. This growth has equally attracted spams [2,3] an organized business aimed at making money via use of messages without the consent of users. Their services are unsolicited adverts, phishing and malware distribution called spams. Spams are unsolicited/unwanted messages sent to users. With spams on the rise, it has proven a great concern to security experts [4][5][6][7].
Such compromises designed to evade security, obscure data privacy and weaken network infrastructure have become a great concern with negative impacts on the adoption of technology. This includes (not limited to) attacks on data, stealing of private data, intrusion, service denial and outage [8]. Reports continues to stress of intrusion to networks that effectively attacks any given target at any given time [9][10][11]. The exponential rate of attacks is as broad as the range of constructive technology it self-leading to Wu et al. [35] used a decision trees that detects attack using 15 parameters to monitor packets and flag rates in and out a system, to describe a traffic flow pattern. It detects abnormal traffic flow via a match scheme that identifies traffic flow similar to an attack flow as well as to trace back the origin of an attack based on this similarity. Lee et al. [36] used cluster analysis on DARPA 2000 dataset. Their results showed that each attack was partitioned, and their method can effectively detect precursors of a DDoS, and the attack itself. Ojugo et al. [37] used a signature-based memetic algorithm to detect attack as a classification problem. It uses seven parameters to monitor packet rate and traffic pattern. It uses a match method to identify traffic flow(s) into classes and trace them back to an attack's origin via the similarity. Karinmazad and Faraahi [38] used anomaly-based detection with packet feats, analyzed via a radial basis function network that was applied to an edge-routers on a victim networks. It uses seven-feats to train a RBF-net and classifies data into normal and attack classes. If model recognizes an incoming traffic as attack, its source packets are sent to a filter/attack-alarm routine for further actions. Else, it is sent to its destination.
Moore [26] proposed a detection scheme where each router detects traffic anomalies using profiles of a normal traffic constructed via stream sampling algorithms. Their results indicates: (i) we can profile a normal traffic reasonably accurately; (ii) identify anomalies with low false-positive and false-negative rates; and (iii) be cost effective with memory consumption per packet computation. Also, the routers exchanges data with each other to increase confidence in their detection. Results show that each router profiles capture key characteristics of the traffic effectively and identify anomalies with low false positive and false negative rate. Ojugo et al. [15] extended Ojugo et al. [37] via a genetic algorithm signature rule-based model, with 10-feats to monitor in/out packet rates. Jalili et al. [39] advanced this position using SPUNNID-an unsupervised neural net to extract traffic feats, analyse and classify traffic patterns as either a normal or DDoS attack.
Chen and Delis [40] used a distributed change point (DCP) detection that adopts change aggregation trees (CATs). This non-parametric model describes distribution of pre/post change in traffic. When a DDoS flood-attack is launched, the cumulative deviation is noticeably higher than random fluctuations. The CAT is designed so a router detects abrupt changes in traffic. A domain server uses the traffic change patterns detected at attack-transit routers to construct CATs. It works in inline-mode to inspect and manipulate ongoing traffic in real time. It continuously monitors both attacks and legitimate traffic by inspecting packets and correlating events among different sessions. It proactively terminates a session when it detects an attack.
Intrusion schemes have been devised to minimize the havoc by intrusion activities [41,42], and for networks, some behavior exists with an external event. The architecture of intrusion detection systems (IDS) seeks to unmask malicious processes either via the signature of such attack, or via an anomaly on the network traffic. An IDS goal is to secure network resources and grant a user sytem confidentiality, data integrity, and resource availability [15,16]. An IDS can retrieve data from a network section for analysis to iunveil intrusion affected component(s) using a various techniques. These techniques are characterized to depend on 3-main aspects [15,25,43,44] in Figure 1:  Formulating an effective detection scheme has its setback(s)-as malicious traffics are poised by their design architecture to evade filters, whose performance are hindered by the limited size of characters, non-availability of malicious traffic data etc-creating impediments in selecting parameters for training. And, ultimately, leading to both poor learning and classification of the learning algorithm.
To overcome these shortfalls in detecting malicious traffics, we adopt a deep neural network to reduce noise via pre-processing of traffic packets and fine-tuning messages sent as requests sent to/from a server, to enhance adequate classification.

RESEARCH METHOD 2.1. Deep neural networks (DNN)
DNN uses deep learning to adapt useful selected feats of interest and parameters, carefully constructing a multi-layer network from vast amount of data. Its deep architecture at its input, hidden and output layers-helps to improve its prediction accuracy. Its hidden layer transforms non-linearly from a previous layer to the next [21,45]. Proposed by Hinto et al. [46], a DNN is trained via two phases: pretrained, and fine-tuned processes [21,47].
The auto-encoder is an unsupervised multi-layered neural network consisting both an encoder and a decoder network. Its encoder seeks to transform inputs data-points from a high unto a low-dimension via an encoding function fencoder as in (1) where x m is a data point, and h m is the encoding vector obtained. Conversely, its decoder network seeks to reconstruct the function using fdecoder as in (2) with x m as decoding vector from h m . Thus, reverts the operations of the encoder [48]. Ojugo and Eboka [21] in Gilrot and Bengio [47] details specific algorithms for encoding and decoding functions respectively.
At the pre-training phase, N autoencoders can be stacked on to an N-hidden-layer so that with input accepted, the input layer and first hidden layer acts an encoder of the first auto-encoder. They are trained as thus, to minimize the reconstruction error. Training parameter(s) of the encoder are used to initialize first hidden layer before proceeding to second hidden layer. There, the first and second hidden layers are selected as encoder(s) and as in the earlier stage, the second hidden layer is initialized by the second trained autoencoder. This process continues till the ℎ auto-encoder is trained and initializes the final hidden layer. With all hidden layers stacked in the auto-encoder at each training N-times, they are thus regarded as pretrained. This feat has proven to be significantly better than random initialization. It also achieves better generalization [20,21,46,49].
Fine-tuning is a supervised phase that seeks to optimize a DNN's performance by retraining the network labeled training data. It computes the errors as a difference in real versus predicted values via backpropagated stochastic gradient descent (SGD), which randomly selects data, and iteratively updates gradient direction with the weight parameters. A merit of the SGD is that it converges faster and does not require the entire dataset. This makes it suitable for complex neural networks as given in (3) with E as loss function, y is label and t is output of the network [20,21]: The gradient of the weight w is obtained as a derivative of the error equation-so that an updated SGD is given by (4) with ŋ is step-size, h is number of hidden layers [20,21]: This process is optimized by the weights and threshold based on correctly labelled data. Thus, a DNN can learn accurately at its final output and direct thus, task all network parameters to perform correct classifications [20,21].

The deep learning framework/algorithm
Deep learning solves tasks by: (a) dividing training data into clusters, computing center points from each cluster point, (b) each cluster is trained and scaled so that each DNN learns the various attributes of each subset, (c) the test data applies the previous cluster centers in its first step to detect outlier(s) by the pretrained DNNs, and (d) output of each DNN is aggregated for the final result data/outliers [7,20,21]. Proposed solution is divided into 3-steps [10,20,21,50]: − Step 1 divides data into train and test clusters or partitions. DNN stores computed cluster centers, used as initialization center(s) to generate test datasets. Dataset attributes are formatted as data-points for selected parameters, and the data-points in the training dataset are aligned into groups of same class. To improve the performance of the DNN, model revises cluster numbers (to between 2 to 6) and sigma values (i.e. 0.1 to 1.0). The minimum distance from a data point to each cluster center is measured, and a data-point's nearness to a cluster, assigns it to that cluster-class. Training sets generated by clusters are taken up as input to DNNs. For training, the number of DNNs should equal the number of clusters. DNN architecture consists of five layers: an input, two hidden, a softmax and an output layer respectively. The hidden layers learn feats from each training subset, and the top layer is a fivedimensional output vector. Each training subset generated from the kth cluster center is regarded as input data to feed into kth DNN respectively. Trained sub-DNN models are marked sub-DNN 1 to k [20,21,50].
− Step 2 uses test dataset to generate k-datasets with the previous cluster center obtained from clusters in Step 1. The test sub-dataset are denoted as test 1 through test k [20,21,50].

−
Step 3: The k-test data subsets are fed into k sub-DNNs, which were completed by the k training data subsets in Step 1. Output of each sub-DNN is integrated as final output and employed to analyse positive detection rates. Then, confusion matrix is used to analyse mining performance of generated rules [20,21,50]. Proposed DNN classifies data via back-propagation learning that maps input signals to lowdimensional space that seeks to discover patterns in the datasets. Algorithm is thus [20,21,[50][51][52][53][54][55][56]

Model optimization
A major issue in ML implementation is fine tuning features that live outside the model. These feats often influence the model's behavior-rippling effects acros hidden elements called hyper-parameters. Hyperparameters are critical settings that can be tuned to control a model's behavior. They are parameters which are specific to the type of learning model we wish to optimize [7]. If a model seeks to learn these settings directly from a training dataset-there is the likelihood for the model to try to maximize these parameterswhich will lead to over-fitting. And thus, will result in poor generalization [57,58]. Major criteria of hyperparameters are [59][60][61]: − Learning rate is a hyper-parameter that controls how much and what weights needs to be adjusted on our network in lieu of gradient loss. The lower the value, the slower we travel on downward slope. Learning rate connotes how quickly a net abandons old beliefs for new ones. It can either be unsupervised and/or supervised learning. Also, with small/large learning rate, the net quickly differentiates between important feats and otherwise. Higher learning rate means the network can change and adapts flexibly, more easily. The model must be able to adequately adjust its learning rate to avoid over-fitting and overtraining.

RESULTS AND DISCUSSIONS 3.1. Data sampling
A major challenge is to get a dataset properly formatted for the task at hand. Dataset used for training (to fit the model) must be same for evaluating the model. Here, we adopt the Hochschule Coburg IDS datasets (CIDDS-2017)-a set of labeled anomaly-based IDS dataset, split as thus: training (70%) and testing (30%) [20,21]. We then adopt 8-parameters to adjust weights and coefficients in minimizing errors as in Table 1:

Encoding schemes used
Unclassified and unformatted data are often ambuiguos, incomplete, rippled with noise, imprecise and inconsistent. Encoding seeks to filter the dataset, mapping it unto the required format the model can easily understand. To encode the selected feats, we transform our dataset using the feats of interest as in Table 1. This mode will seek to modulate the raw data unto the require dataset-so that data gathered from varying sources, is adequate for analysis. We employ data type in Pandas library displayed by listing 1 algorithm [20,21].

Parameter tuning
We modeled the network using 8-neurons at the input layer (a neuron for each feat). 2-neurons were used for output layer (a neuron for each possible class). The parameters for the deep learning are the number of epochs, the activation function, its learning rate and the hidden layer topology. We employed the rectified linear unit (ReLU) activation function with 500-epochs (though optimal values were reached at 100, 300 and 500 epochs taking into account accuracy and time to train the model). There is no best practice in selecting the number of hidden layers/neurons therein and using more hidden layer(s) grants the model greater capability to perform more complex function on the data [1,2]. We seek minimum training error that will also result in the best fit, selecting the number of hidden layers (and neurons for each layer) was established via a trail-and-error method, and examining the results. The best possible number of layers was determined by running tests on a single layer with 1 to 20 neurons at the first instances-which yielded the greatest f-score with the least (constant) amount of training loss time. Addition of a second hidden layer of neurons from 1 to 20 yielded scores. Finally, the addition of a third hidden layer using the best possible number of neurons produced the greatest f-score and thus, was selected as the overall best possible hidden layer configuration. Results of the first hidden layer are seen in Table 2. Table 2 shows result of the first hidden layer with configuration of 9-neurons and f-score of 92% at 18 th -iteration and training loss of 1.140. F-score shows accuracy of each run-since we used an unbalanced dataset to train/test model with more records in normal class than in malicious class.  Table 3 shows first layer having 9-neurons and others neurons varying from 1 to 20. With hidden layer of 9 and 11 neurons yielding f-score of 93% and training loss of 0.39. The second hidden layer is favored as it yields greater f-score. Table 4 shows third configuration with first and second layer having 9 and 11 nodes and varying third hidden layer. Best configuration is 9-11-14 neurons, yielding f-score of 92% with a training loss at 0.560.

Model evaluation
We use the accuracy, recall and error rate(s) to evaluate model performance as in (5) On evaluating our parameters, result of the model is given in a confusion matrix and the classification report. The resulting classification report and confusion matrix is given in the Tables 5 and 6 respectively.  Also, Table 6 shows that 11.410 instances of the dataset were correctly classified. That is, results of the test dataset (with 12.500 points) show that we have 11.411 benign instances in the first class (label 0). The model successfully identified 11.410 correctly classified and identified benign instances as truepositives; but, 31-cases incorrectly identified benign instances were marked as false-positive. Similarly, on the second row, there were 1.059 malicious instances in second class (label 1); But, 776-incorrectly identified malicious instances were marked as false-negative, and 283 correctly identified malicious instances of them were marked as true-negative. These are further explained as: (i) For true positive, the model predicted positive and it was true; (ii) For true negative, the model predicted negative and it was true; (iii) For false 1506 positive, the model predicted positive and it was false; and (iv) for false negative, the model predicted negative and it was false (as agreed by [62][63][64]). Thus, we can say that the model predicts the results of either it's a normal attack or DDoS attack 92% accurately using the total value of the f-score. The neural network in Python may have difficulty converging before the maximum number of iterations allowed if the data is not standardized. For a more meaningful result, we decided to scale our test data. There are a lot of different methods for standardization of data, we will use the built-in StandardScaler for standardization. The result gotten after this process was done is slated as in Table 7 below: From Table 8 (confusion matrix) i.e. the predicted results, using our test data with 12.500 points we have 11.449 benign instances in the first class (label 0). Out of this, the model successfully identified 11.449 correctly identified benign instances as a True Positive but there was no incorrectly identified benign instances which is supposed to be marked as False Positive. Similarly, looking at the second row, there were 1.051 malicious instances in second class (label 1) but 24 incorrectly identified malicious instances of them were marked as false negative and 1.027 correctly identified malicious instances of them were marked as true negative. Thus, we can say that the model predicts the results of either it is a 'normal' attack or 'DDoS' attack 99% accurately using the total value of the f-score. In turn, this resulted in predicting 1.027 points as malicious samples and 11.449 points as normal samples from our test data. Furthermore, the standardization of our test data proved to be more efficient than the previous test run that was not standardized.

CONCLUSION
Our DNN model solution has a total of 56-rules with top rules found to have classification accuracy range [0.8, 0.96]. This implies that an estimated over 80% of the rules can adequately classify the dataset. Achieving a set of good rules, is much better than a single optimum rule. This increases the chances of detecting malicious data packets as well as also improves the generality of rules, providing the ability for new dataset and their corresponding generated rules to be added to the knowledgebase. The impact of the DDoS attacks to users requires a concerted effort to detect intrusion. Detection schemes simply filter through the network request, analyze them to decide which clients are uncompromised and compromised, and ultimately met out intended safety measures for further actions. Their performance can be hindered as premised on their error rate for incorrectly classified and unidentified data-points that scheme/model generates. An ideal scheme will correctly classify all request and packets with almost zero error rates of false positive/negative through tradeoffs between the number of false positives and false negatives.