Forecasting number of vulnerabilities using long short-term neural memory network

Cyber-attacks are launched through the exploitation of some existing vulnerabilities in the software, hardware, system and/or network. Machine learning algorithms can be used to forecast the number of post release vulnerabilities. Traditional neural networks work like a black box approach; hence it is unclear how reasoning is used in utilizing past data points in inferring the subsequent data points. However, the long short-term memory network (LSTM), a variant of the recurrent neural network, is able to address this limitation by introducing a lot of loops in its network to retain and utilize past data points for future calculations. Moving on from the previous finding, we further enhance the results to predict the number of vulnerabilities by developing a time series-based sequential model using a long short-term memory neural network. Specifically, this study developed a supervised machine learning based on the non-linear sequential time series forecasting model with a long short-term memory neural network to predict the number of vulnerabilities for three vendors having the highest number of vulnerabilities published in the national vulnerability database (NVD), namely microsoft, IBM and oracle. Our proposed model outperforms the existing models with a prediction result root mean squared error (RMSE) of as low as 0.072.


INTRODUCTION
Vulnerability reflects the weakness of a system which leads to most security breaches. In some cases of successful cyber-attacks, the hackers would usually exploit zero-day vulnerabilities before their existence is noticed by the security administrator [1], [2]. The importance of security focus has dramatically increased, and therefore past vulnerability trends and reviews become more valuable to be considered in the acquisition of a new software system [3]. Machine learning [4] is a subfield of computer science and is an application of artificial intelligence (AI) that "gives computers the ability to learn without being explicitly programmed" [5]. Data modelling for predicting vulnerabilities which thereby creates deployable machine learning models is a tool that can be used as a proxy to predict the number of future vulnerabilities or to predict vulnerabilities most likely likelihood to be exploitable. The authors in [5] and [6] showed that addressing security issues during software development phase is an arduous task since project managers would usually focus on cost and the timely delivery of products, thus resulting in the high probability of the existence of vulnerabilities. During software development phase is an arduous task since project managers would usually focus on cost and the timely delivery of products, thus resulting in the high probability of the existence of vulnerabilities.
Researchers have used various forms of vulnerability databases in developing models for vulnerabilities disclosure trends. The aim of most of these researchers is in finding the techniques in developing models that could predict the number of future product vulnerabilities using their historical data through regression and time series analysis [7]- [15]. However, these models suffer in various extents from the limitations stemming from their underlying assumptions, use of machine learning algorithms and high expectation for accuracy. The utilization of the linear algorithm is unable to deduce the underlying non-linear aspect of the data. The neural network model application for example, traditionally follows the black box approach that are not mathematically tractable and cannot be easily interpreted. Besides that, in non-linear models other than the long short-term memory network, the number of inputs has to be selected beforehand, thus making learning impossible for functions that depend on historical input that took place a long time ago. Since security activities for software and systems are highly resource savvy, the models are therefore expected to have high accuracy of vulnerability prediction by the vendors, end users as well as businesses [16]. In this research, the prediction of the number of future vulnerabilities is taken as a supervised sequential time series forecasting problem, and makes the following contributions:  [15] where the dataset was prepared based on a monthly aggregation and has approximately 252 examples for each. To the best of our knowledge, these are the first datasets with such feature created by grouping based on the published date to be utilized for vulnerability prediction model.  Utilization of deep learning algorithm called the long short-term memory (LSTM), a variant of the recurrent neural network (RNN), known for having the unique capability in retaining past information of series in calculating the activation functions and weights in machine learning modelling to forecast future values with higher accuracy. To the best of our knowledge, this research is the first to utilize the long short-term memory neural network with the national vulnerability dataset in developing a machine learning model to forecast the number of future vulnerabilities.  The proposed model in this research has achieved the accuracy (root men squared error) value of as low as 0.072 which has outperformed the existing models in terms of prediction accuracy and following distribution trends. The rest of the paper is organized is being as: Section 2 describes the related work while. Section 3 describes the dataset and method used in the analysis. Section 4 describes the advantages of the LSTM model in time series prediction, section 5 describes the results and analysis, and finally section 6 describes the conclusion and future works.

RELATED RESEARCH
Lyu and Lyu [9] surveyed software defect detection processes using software reliability growth models. Anderson proposed the Anderson Thermodynamic (AT) time-based vulnerability discovery which is considered as a pioneer in such research [16]. Alhazmi and Malaiya proposed a time-based application of software reliability growth modelling (SRGM) in predicting the number of vulnerabilities, and later have also proposed another logistic regression model for Windows 98 and NT 4.0 in predicting the number of undiscovered vulnerabilities [17]. Rescola proposed two time-based trend models, namely the linear model (RL) and the exponential model (RE) to estimate future vulnerabilities [18]. Kim [19] proposed a new Weibull distribution-based vulnerability discovery model (VDM) which was compared with [20], and found that their model performed better in many cases. There are also several other studies that worked further based on the existing VDMs for various software packages with the aim of improving the vulnerability discovery rate and the prediction of future vulnerability count [21]- [28].
Movahedi et al. [29] developed nine common vulnerability discovery models (VDMs) which were compared with a nonlinear neural network model (NNM) over a prediction period of three years. The common VDMs are the NHPP power-law gamma-based VDM, Weibull-based VDM, AML VDM, normalbased VDM, rescorla exponential (RE), rescorla quadratic (RQ), younis folded (YF) and linear model (LM). These models use the NVD dataset with the feedforward NNM with a single hidden layer in forecasting four well-known OSs and four well-known web browsers and assessed the models in terms of prediction accuracy and prediction bias. The results showed that in terms of prediction accuracy, the NNM has outperformed the VDMs in all cases while regarding the overall magnitude of bias, the NMM provided the smallest value against seven common VDMs out of eight.
In recent times, Pokhrel [15] proposed a vulnerability prediction model based on the non-linear approach using the time series analysis. They utilized the NVD database for the auto regressive moving average (ARIMA), artificial neural network (ANN), and support vector machine (SVM) to develop the prediction models and selected the operating systems such windows 7, Mac OS X, and Linux Kernel for the experiments. The examples of the result show that the best model for windows 7 was produced by SVM with symmetric mean absolute error (SMAPE), mean absolute error (MAE), and root mean squared error (RMSE) values of 0.12, 3.15 and 3.58 respectively. However, since nonlinearity is a common trend in vulnerability disclosure, traditional time series-based modelling may always have limited capability in attaining high prediction accuracy. In our study, the number of vulnerability predictions is modelled with newly-created datasets using the sequential LSTM network, and the prediction accuracy was found to have improved in comparison to other datasets such as the recent one by [15].

METHOD
Vulnerability data were extracted using a custom-coded web scrapper to dump the data for the topthree vendors from the MITRE's CVE website [16] spanning between 1997 to 2019. The number of vulnerabilities for Microsoft, Oracle and IBM are 6814, 6115 and 4679 respectively as shown in Figures 1 and 2. A new dataset was then created for each vendor by grouping it according to the "Published Date" attribute. Another attribute called the "CVE ID" was aggregated as the size of each group, and the dataset was later chronologically organized (in an ascending order) in preparing for the time series. Therefore, these datasets contain a lot more examples to train the deep learning algorithms compared to the previous works.  The whole research method consists of three phases. The first phase deals with data vulnerability for the top-three vendors according to the number of vulnerabilities collected from the open-source national vulnerability database with a custom-built web scrapper in a csv format. The second phase involves data preparation, cleaning and engineering which covers the bulk of the activities common to usual data modelling project. Major activities include the analysis of distribution characteristics such as stationarity, seasonality and linearity, cleaning of data, feature engineering, as well as data wrangling and aggregation in creating the larger novel training datasets. The third phase involves the experimentation and result analysis where the existing best performing model [16] was set as the baseline for comparison and improvement purposes. The long short-term memory network was utilized for the modelling, and parameter optimization was performed through grid search. The following Figure 3 shows the overall view of the processes of this research:

ADVANTAGES OF THE LONG SHORT-TERM MEMORY MODEL IN TIME SERIES FORECASTING
A special ANN called the recurrent neural network (RNN) is designed to utilize loops to persist information based on previously learnt knowledge. These looped networks work on each node in a sequence of series by performing the same processing techniques and are hence termed as recurrent. The RNNs maintain memory cells to capture information from past data points in the sequence as shown in Figure 4. In terms of modelling dependencies between two points in a sequence, the RNN models would usually perform better than NNMs. When applying the NNMs, in most cases, it is required to choose the length of the input (number of inputs) beforehand. Thus, the algorithm is unable to learn the information of a function that is dependent on data points of long time ago in a sequence. However, the RNN can avoid this drawback since it is able to store information from data points of long time ago. However, one significant drawback of the RNNs is that when the data sequence starts to increase, the network tends to lose information on historical context over time. A variant of the RNN, called the long short-term memory network (LSTM) can solve this problem since it maintains the cells for keeping information for the previous nodes irrespective of the sequence size. An LSTM network maintains [32] three or four gates where the input, output and forget gates are common as shown in Figure 5.
The notations t, C and h indicate one step in time, cell state and the hidden state respectively. The gates i (input), f (forget) and o (output) that are most usually modelled with a sigmoid layer (values ranging between 0-1) help in protecting and controlling the addition and removal of information in a cell state. The LSTMs therefore try to overcome the shortcomings of the RNN models regarding the handling of long-term dependencies by mitigating the issue when the weight matrix of the neurons becomes too small (which may lead to the vanishing of the gradients) or too large (which may cause the gradients to explode). The following steps shows the mathematical functions [33], [34] of a long short-term memory neural network for typical usage, such as for this research experiment:  Stage 1 For a forward pass in an LSTM block, if xt is considered as an input vector at a time t, N and M are the LSTM block number and input number respectively, an LSTM layer can have four weights: Weights for input: Wz, Wi, Wf, Wo ∈ R N×M Weights for peephole: pi, pf, po ∈ R N Weights for recurrent: Rz, Ri, Rf, Ro ∈ R N×N and Weights for bias: bz, bi, bf, bo ∈ R N With reference to the weights above, the formula for vector in a forward pass for an LSTM layer are as follows, (e σ, g and h are activation functions): z-t = Wz xt + Rz y t−1 + bz zt = g (z-t) [ Typically, logistic sigmoid is the activation function at the LSTM gate while the hyperbolic tangent is the activation function at the block input and output.  Stage 2 After the forward pass functions are computed, the following delta functions are computed inside the LSTM block for backpropagation through time stamps: Δy-t = ∆t + Rz T δzt+1 + Rt T δit+1+ RfT δft+1+ RoT δot+1 Δo-t = δyt • h(ct) • σ/(o-t) Δct = δyt • ot • h/(ct) + po • δo-t + pi • δi-t+1 + pf • δf-t+1 + δct+1 • ft+1 Δf-t = δct • ct-1 • σ/(f-t) Δit = δct • zt • σ/(i-t) Δz-t = δct • it • g/(z-t) Here ∆t denotes the vector of deltas propagated from the upper layers to the downward layers. The value E corresponds to ∂E /∂yt when treated as a loss function. Input deltas are only required when the lower layer of the input layer requires training, and in such situation the value of the delta for the input is: At the final stage, the gradients to adjust the weights are computed with the following functions, whereas ¤ denotes any of the vectors z, I, f and o, and (¤ 1. ¤ 2) are the outer products of the corresponding two vectors: δW* = ∑t=0T (δ¤ t, xt) δpi = ∑t=0T-1ct • δi-t+1 δR* = ∑t=0T (δ¤ t+1, yt) δpf = ∑t=0T-1ct • δf-t+1 δb* = ∑t=0T δ¤ t δpo = ∑t=0T-1ct • δo-t+1 The RNNs, unlike the ARIMA, have the capability to learn nonlinearities in a data sequence, and specialized nodes such as the LSTM nodes could handle that even more efficiently. One requirement of the LSTM network is that a long sequence is needed in learning past dependencies and for this reason this research therefore aims to create a long sequence as described in the previous section.

RESULTS AND ANALYSIS
Prior to applying a time series dataset into the machine learning algorithm, it is important to analyse the stationarity. The initial null hypothesis is that the time series at hand is non-stationary. Figures 6-8 show the rolling stats plots (red lines) where it can be observed that the number of vulnerabilities is low and stable at the beginning of time but shows spikes and lows as time grows. Generally, there will be an upward trend in the number of vulnerabilities over the course of time. However, a reliable trend or seasonality which does not demonstrate hikes and falls with a clear trend or seasonality signifies that it is a stationary sequence. The augmented dickey fuller test (ADF) with log transformation shows that the test statistical values as shown in Figures 9-11 are greater than any of the critical values, thus leading to the rejection of the null hypothesis (the data distribution is not stationary) in all cases. Although unlike traditional algorithms, the ARIMA 'stationarity' is less brittle with LSTM (RNN), the stationarity of the series will be sufficient for its performance. Furthermore, if the series is not stationary, it is then much safer for a transformation to be performed in making the data stationary and then train the LSTM. Contrary to feedforward neural networks, the RNNs maintain memory cells to remember the context of the sequence and then use these cells to process the output. The time distributed layer wrapper, provided by keras, is utilized over the dense layer to model the data similar to a time series sequence which applies the same transformation or processing to each time step in the sequence and provides the output after each step. Therefore, it fulfils the requirement of getting the processed output after each input instead of waiting for the whole process to be completed to obtain the output at the end [34]. In this way, the RNN algorithm is applied for the same transformation to every element of the time series where the output values are dependent on the previous values. The LSTM model was created with a parameter optimization for 4 hidden units and 150 epochs. Figures 12-14 show the forecasted plots for different datasets. This proposed model has outperformed the previous models in terms of accuracy. Table 1 shows the results of the accuracy performance (RMSE) of the model, together with the comparison with the previous best performing models [15].
The forecasted plots in Figures 12-14 show a promising picture. The grey, green and blue coloured lines show the true, training fit and testing forecast respectively. It can be observed that the training fit is nearly perfect for the IBM vendor vulnerability dataset, having achieved the highest accuracy (root mean spurred error value of 0.072). The testing performance or the forecast also showed a decent performance. Although other results involving vulnerability datasets for Microsoft and oracle (RMSE: accuracy values of 0.17 and 0.24 respectively) are not as good as the IBM dataset, they are much better compared to the recent best research in [15] as shown in Table 1.
The performance variation for microsoft and oracle datasets is due to the presence of high spikes in data points in recent time since greater number of vulnerabilities have been discovered in recent years (2016-2019) in the NVD database [35], and in contrary to the absence of such sudden spikes in the past data points; thus, the models had been unable to learn adequately. When more data are added into the national vulnerability database, these models will eventually have the capability to learn adequately for improved performance. Even though the forecast has deviated from the actual at certain places, the overall performance of both in terms of RMSE is promising. Through the use of the time distributed layer wrapper, these models do not only show better performance but also become simpler to reduce computational costs in comparison with the regression approach in some of the previous research. Unlike the previous research, the larger novel dataset of this research has effectively addressed the requirement for the utilization of large training data in obtaining meaningful results from the LSTM network.

CONCLUSION
In this research, the ability of the long short-term memory network (LSTM) to learn on its own and use that knowledge to determine the impact of the context and the quantity of past information in forecasting the number of post release vulnerabilities for the top-three vendors namely microsoft, oracle and IBM has been utilized effectively. In the process, we have created a novel dataset for the top-three vendors with the highest number of vulnerabilities in NVD that contain a lot of more historical data points than the previous research in training the LSTM network which prefers a larger dataset in learning to demonstrate better accuracy. To the best of our knowledge, this work is the first to utilize such dataset feature and LSTM deep learning network for this problem in order to develop a vulnerability discovery model (VDM). Our model demonstrated better performance in accuracy (root mean squared error) than other existing such models. The developers, user community, and individual organizations can utilize our model as their respective requirement in managing cyber threats in a cost-effective way. However, an inherent limitation of the LSTM network is that it is not effective to be used with small datasets. Hence, there is always a race in creating even larger datasets than the one utilized in this work for future improvements when necessary. Another limitation is that since the LSTM network uses a lot of loops and memory nodes, it therefore requires longer computation time. However, this drawback can be addressed with the use of powerful machines in the production scenario. Since the long short-term memory (LSTM) recurrent neural network is fully capable of addressing multiple input variables in predicting future values, we therefore plan to extract and generate more novel features from the NVD database in developing a multivariate time series model for this problem as our future work. We also plan to develop a model with 'Published Date' feature as the target label and to integrate it with this current model to add novel forecasting service of vulnerabilities.