A new procedure for misbehavior detection in vehicular ad-hoc networks using machine learning

Received Sep 4, 2020 Revised Oct 6, 2020 Accepted Dec 5, 2020 Misbehavior detection in vehicular ad hoc networks (VANETs) is performed to improve the traffic safety and driving accuracy. All the nodes in the VANETs communicate to each other through message logs. Malicious nodes in the VANETs can cause inevitable situation by sending message logs with tampered values. In this work, various machine learning algorithms are used to detect the primarily five types of attacks namely, constant attack, constant offset attack, random attack, random offset attack, and eventual attack. Firstly, each attack is detected by different machine learning algorithms using binary classification. Then, the new procedure is created to do the multi classification of the attacks on best chosen algorithm from different machine learning techniques. The highest accuracy in case of binary classification is obtained with Naïve Bayes (100%), decision tree (100%), and random forest (100%) in type1 attack, decision tree (100%) in type2 attack, and random forest (98.03%, 95.56%, and 95.55%) in Type4, Type8 and Type16 attack respectively. In case of new procedure for multiclassification, the highest accuracy is obtained with random forest (97.62%) technique. For this work, VeReMi dataset (a public repository for the malicious node detection in VANETs) is used.


INTRODUCTION
Vehicular ad hoc networks (VANETs) are similar to mobile ad hoc networks (MANETs) [1]. VANETs are produced by applying the principles of MANETs. VANETs have nodes which communicate to each other through message logs and are short lived [2]. All nodes share same radio channel and exchange data with other nodes [3]. Message logs consist of several features like sending time, sending Id, message Id, position, noise in the position, speed, noise in the speed, etc. Small packets are repeatedly exchanged with the other nodes in neighborhood to maximize safety in automobile driving [4]. Traditional wired network gives protection by different methods like gateways, firewalls, etc. However, wireless networks are liable to security attacks aiming the whole network from different directions. Because of different misbehaviors like spamming, bluffing, faking of identities will generate malignant nodes which can lead to transfer incorrect or inaccurate messages to the neighborhood nodes this will decrease the performance of VANET as well as road safety and increased road accidents can be seen. Looking forward safety of a passenger can be enhanced by means of inter-vehicle communication [5]. For example, if any road accident occurs, with the help of VANET communication safety alert packets are transferred when a node notices a censorious event, this will make other vehicles alert moving towards that site; with this road accidents can be minimized [6]. In this way, duty of honest nodes is to forward each accepted risk-free packets to the nodes in its transmission range.
An application of VANET support actual time communication and mainly deals with critical information related to life [7]. So as to achieve correctness and effectiveness, it should stick to security demand that is honesty, non-repudiation, privacy, confidentiality and authentication to shield against the attacker and malicious nodes [8]. To come up with preventive measures, observation of such malicious nodes and unusual activities in the network is very much important. At the end of the day stunning growth of road traffic in worldwide; it becomes very crucial to use current technologies to make safer and easier driving for the driver.
In this paper, there are five types of attacks such as constant attack (type1), constant offset attack (type2), random attack (type4), random offset attack (type8) and eventual attack (type16). the constant attacker transfers fixed, pre-configured position; the constant offset attacker transfers fixed, pre-configured offset added to their actual position; the random attacker sends a random position; the random offset attacker transfers a random position in a pre-configured triangle in a vehicle, the eventual attacker behaves normal for the sometime repeatedly.
The current work is to present a new procedure for misbehavior detection in VANETs using machine learning techniques. In this paper machine learning is going to help in classification of message logs send from a node to be honest or malicious. For the classification various features are extracted from the nodes. With the help of nearby nodes these features are calculated. After calculation observations are interchanged by the observer nodes to the other nodes in its neighborhood. In this paper two types of machine learning classification techniques are used that is binary classification and multi-classification. The accuracy of machine learning model firstly depends on the algorithm that is used to generate the classifier and secondly the features that are used to represent the instances. Different inducers and features give different performance for each classifier [3]. To overcome this new system is created so that best algorithm is automatically chosen according to the dataset and when new message log is sent from the node it is detected that message log has any type of attack or not. If it is founded that message log is malicious then the node from which this message log is transferred is also malicious and hence detected.
The approaches which are done to detect the misbehavior in VANETs are mostly simulation based. In recent years the use of ad-hoc network rises tremendously [9]. The automating the systems to detect the misbehavior in VANETs will give an aid to detect them on live environment. Here we are discussing some of the works which have been done to enhance the node detection system in VANETs.
Grover  [10]. Muthukumar and Karthick presented a topic on identifying the misbehavior nodes using trust management in VANETs. In this article they have introduce some misbehavior prevention researches in location privacy-enhanced VANETs. In the future, they have intended to improve the detection rate of the proposed system and to evaluate the performance of the proposed scheme with different vehicle densities and average velocities [11]. Barnwal and Ghosh present a survey on detection of misbehaving nodes in vehicular ad-hoc network and conclude to adopt hybrid based techniques for misbehavior detection [12].
Sedjelmaci et al. presented a topic on predict and prevent from misbehaving intruders in heterogeneous vehicular networks to prevent the occurrence of the most dangerous attacks that target HetVNet. They have analyzed the performances and demonstrated the efficiency of the proposed scheme using NS-3, which showed that it exhibits a high accuracy prediction rate, low detection time and a low communication overhead [13]. Mohammadi et al. conducted a survey on misbehavior node detection in vehicular ad-hoc networks. Compared to SVM-based, dempster shafer-based, and averaging-based detection techniques. SVM classifier gives the highest accuracy [14]. Tiwari and Gupta conducted a survey on security enhancement of misbehavior nodes in vehicular ad-hoc networks using hash function; algorithms used were J-48, RF, IBK, Naïve Bayes and AdaBoost1. But J-48 and RF gave thebest results [15].
It is found that only few works has been done regarding the malicious node detection in VANETs using machine learning. This study is going to give a research support as well for the future aspirants who want to study in this field. This paper is written with the aim of providing a procedure for the selection of best algorithm from different algorithms for misbehavior detection in VANETs.

RESEARCH METHOD
Vehicular reference misbehavior (VeReMi) dataset is a dataset for evaluation of the misbehavior detection in the VANETs. This dataset is composed of two types of files ground truth file and the message logs generated from the simulation environment. It is a part of recently published paper [16]. It is simulated generated using LuST and VEINS. It primarily discussed five attacks namely, constant attack (type1 attack), constant offset attack (type2 attack), random attack (type4 attack), random offset attack (type8 attack), and eventual attack (type16 attack). Its primary purpose is to serve as a baseline to assess how misbehavior detection mechanisms operate on a city scale. The dataset contains the five different files of the different types of attacks having 960, 1056, 4438, 21638, 20483 with initial instances in type1, type2, type4, type8 and type16 attack respectively. The combined dataset have 48,575 instances for the multi classification. The memory size of total dataset is approx 5.3 MB.
The research is carried out in two different phases; the first phase is for the analysis of the algorithms on different attacks and second phase is to design a new procedure for the detection of attacks using the different machine learning classification algorithms like Naïve Bayes, K-nearest neighbor (KNN), stochastic gradient descent (SGD) classifier, decision tree (DT) and random forest. Each algorithm is applied and accuracy is evaluated in the first phase for the individual attack. Let us understand the each algorithm one by one.

Naïve Bayes
The advancement in the Bayesian theory gets the evolution of Naïve Bayes algorithm. The Naïve Bayes is a supervised machine learning algorithm based on the Bayes Theorem [17]. The Bayes theorem for the likelihood is given as (1): Since in (1) the ( ) is constant and add extra calculation in the computation, hence it is being removed from the formula, and given as (2): In (2) gives the result of the Naïve Bayes classifier.

K-nearest neighbor
K-nearest neighbor is called the instance based learner as it stores the instances for classification. The K-Nearest Neighbor classifier works on the principle of majority voting. In this algorithm K is the number of nearest neighbors to be considered. The distance of each element is calculated from the query point and identifies the class of each neighbor. Then based on the majority voting the query point is classified. This algorithm is also known as lazy learning algorithm because after training the model it waits for the query point [18]. The formula used in the calculation of the distance of the nearest neighbors is Euclidean distance: In this work, (3) that is Euclidean distance is used for the calculation of the distance.

Stochastic gradient descent
Stochastic gradient descent is not the actual algorithm of the supervised machine learning. It is an optimization technique. This technique is efficient for the solving linear problems with support vector machines and logistic regression. The work presented is convex optimization of support vector machine. It is widely used because of efficiency and ease [19]. The stochastic gradient descent in contrast a perform a parameter update for each (x, y) is given by (4):

Decision tree
The decision trees are ways to find the conclusion based on the set of rules drawn from the tree. A decision tree is consists of the two nodes: i) Decision node and ii) Leaf node. A decision node tells about which attribute have to be selected and leaf node tells about the class. Decision trees use the up down approach to give the results [21]. The first node of the decision trees, a decision node, called as root node [22]. Each node of the decision tree is selected on the basis of information gain methods:

Information gain method
The two important formulas that are used in this method are: i) Entropy calculation and ii) Information gain, for calculating the entropy of the sample data ( ) = ∑ − log 2 After calculating the entropy, the information gain is calculated for each attribute to get decide the decision node.

Random forest classifier
Random forest is an ensemble technique. It is called as bootstrapping and aggregation is the result based on majority vote of base models on the test data [23]. Random forest is a bagging technique which feed the data to the base models by row sampling with replacement and predicting the classes. Usually, decision tree is used as its base model. Random forest applies both feature sampling and row sampling with replacement. The Figure 1 given is an example showing the random forest classification. Suppose training dataset which is being classified into 0 or 1 that is binary classification is given to different decision tree models with the feature sampling and row sampling with replacement then the results by the decision trees are given as shown in the Figure 1. Now when a test dataset is passed then the results of the decision trees aggregates using majority voting method to predict the final class [24]. A decision tree alone when classify a dataset it has low bias and high variance when the tree is grown to the maximum depth. To reduce the variance feature sampling and row sampling is used with different decision tree models.

New procedure design
After completing the first phase, a new procedure is designed for the detection of attacks on message logs send from any node as shown in Figure 2. This new procedure is including the following steps:  Selecting the ground truth data from VeReMi dataset  Loading the dataset to the environment In the pre-processing of the ground truth data, the unnecessary columns are removed and data is balanced. After balancing the data the feature selection is performed to select the important features. After selecting the important features the data is split into training and testing set and model is trained and accuracy is calculated. The accuracy is saved for all the models and best model is selected based on the highest accuracy. After getting the best accuracy model, whenever a new message log comes with suitable features it is going to be detected that which type of attack it has. Finding the attack in the message log it is also detected that node from which new message log is had been sent is also malicious. Hence malicious node is detected.

RESULTS AND DISCUSSION
The VeReMi dataset contains several folders of different versions and in this work single file is selected from the individual attacks and is analyzed on the different five algorithms namely, Naïve Bayes, KNN, stochastic gradient descent, decision tree and random forest. Using the confusion matrix the accuracies are calculated. The procedure for the checking the individual attacks is same for every algorithm. The accuracy is calculated using the (5) from the confusion matrix [25]. The classification report is also given for every algorithm containing the precision, recall, F1 score and support. The precision gives the ratio of correctly predicted positive operations to the total predicted positive observations. The recall is also called as sensitivity and gives us idea of the true positive observations to the total actual positive observations. F1 score gives the weighted average of precision and recall. Let's discuss the result of each attack one by one.

Type1 attack (constant attack)
The dataset of constant attack or type 1 attack was taken and preprocessing and feature selection is done to create the model. Then models are evaluated and their accuracies are calculated:

Naïve Bayes
The confusion matrix and ROC-curve drawn corresponding to the Naïve Bayes algorithm is shown in the Figure 3. Hence the accuracy is 100.00%. The classification report is shown in Figure 4. Similarly, other algorithms KNN, SGD, decision tree and random forest are evaluated with accuracies of 99.10% and 97.60%, 100%, and 100% respectively.

Type2 attack (constant offset attack)
The constant offset attack is a type 2 attack and its data during the evaluation is divided into the 70% for training and 30% for testing. The algorithms used for the calculation of results discussed.

Decision tree
The confusion matrix and ROC-curve drawn corresponding to the DT algorithm is shown in the Figure 5. Hence the accuracy is 100.00%. The classification report is shown in Figure 6. Similarly, other algorithms such as Naïve Bayes, KNN, SGD and random forest are evaluated with 77.84%, 95.64%, 76.32% and 99.24% respectively. Figure 6. Classification report of DT for constant offset attack

Type4 attack (random attack)
The random attack is classified with splitting the data into train and test sets. The attack is identified with the help of following algorithms:

Random forest
The confusion matrix and ROC-curve drawn corresponding to the random forest algorithm is shown in the Figure 7. Hence the accuracy is 98.03%.The classification report is shown in Figure 8. Similarly, other algorithms such as Naïve Bayes, KNN, SGD and DT are evaluated with 62.08%, 86.61%, 52.20%, and 96.70% respectively. Figure 8. Classification report of random for random attack

Type8 attack (random offset attack)
Type8 attack or random offset attack is taken into consider by splitting the data into 70% training set and 30% testing set. The algorithms show the different results on the same dataset due to their learning function.

Random forest
The confusion matrix and ROC-curve drawn corresponding to the random forest algorithm is shown in the Figure 9. The accuracy is calculated using (5)  Hence the accuracy is 95.56%. The classification report is shown in Figure 10. Similarly, other algorithms such as Naïve Bayes, KNN, SGD and DT are evaluated with 58.83%, 82.58%, 49.03%, and 95.40% respectively.  Figure 9. Confusion matrix and ROC-curve of random forest for random offset attack Figure 10. Classification report of random forest for random offset attack

Type16 attack (eventual attack)
The eventual attack is detected by creating a model using the data for the detection of the eventual attack. Let's calculate the accuracies of each algorithm used to create the models.

Random forest
The confusion matrix and ROC-curve drawn corresponding to the random forest algorithm is shown in the Figure 11. Hence the accuracy is 95.55%. The classification report is shown in Figure 12.  Table 1. For the analysis of the new procedure we have combined all the dataset of the type 1, 2, 4, 8, 16 attacks and make a function to select the best algorithm for the dataset and then choose the best selected algorithm to predict the result for the new dataset.
The discussed algorithms used for the individual attack detection is used in the new procedure. The accuracies of each algorithm are also calculated on the combined dataset. On the analysis of the algorithms and calculating the results it is found that the random forest is giving the highest accuracy in this dataset with 97.62% while the other algorithms like Naïve Bayes, KNN, SGD and decision tree is showing the accuracy of 70.38%, 88.66%, 67.78%, and 95.64% respectively. This whole procedure can be used for the any dataset by changing the feature names and data used for the algorithm to predict the attack. Let's discuss the accuracy of the random forest obtained on the new procedure with highest accuracy. The confusion matrix drawn for the random forest is shown in Figure 13. The ROC-curve is drawn between true positive value and false positive value, hence can be drawn for the binary problems by definition. So, it is tedious to draw the curve for multi class classification. However, for multi label classification it is possible to do so. The classification report of the random forest algorithm for new procedure is shown in Figure 14. The machine learning algorithms varies their results on the basis of features, dataset and algorithms used. According to no free lunch theorem [26], "There is no one algorithm that best fit for every problem". Gahleb et al. used NGSIM dataset to study the misbehavior in VANETs using artificial neural networks (feed forward neural network and back propagation). They have used the binary classification to detect the misbehavior in every vehicle separately. The features used in this work are overlapped areas, interval to loss received information, average prediction error, distance to the sender, average vehicle occurring distance and vehicle uncertainty. The total accuracy achieved is 99.74% [27]. A comparison of results with proposed work is shown in Table 2.
Bidgoli et al. used KDDCUP99 dataset for the intrusion detection using decision tree algorithm on reduced feature space. A study on 41 features and 24 attack types are done of DoS (denial of service), remote to user (R2L), user to root (U2R), and probing class [28]. The comparison with the proposed work is shown in Table 3.  The comparison of two algorithms should be done on the same dataset in the same environment only then it can be said that the algorithm is best in that scenario with corresponding dataset. In both the above papers the results are definitely little bit varying because of the dataset chosen for the experiment. Hence all the results obtained are correct and best in their own scenario.

CONCLUSION
VANETs has gained a lot of attention as it has greatly leaded in the road safety and driving conditions. The misbehavior in the VANETs can be detected to find node to be malicious or not. In this paper, the five attacks are detected by the five different algorithms and accuracy is calculated for each algorithm separately. Although a new procedure is formed for the multiple detection of the attacks using the best algorithm that is possible on the combined dataset. This new procedure can also be work as a general concept or mechanism for the malicious node detection. This approach is suitable for the detection of misbehavior in VANETs by choosing the best algorithm. This algorithm reduces the effort of writing the codes for the different algorithms separately and doing analysis of each algorithm for choosing the best one. Naïve Bayes with 100% accuracy, decision tree with 100% accuracy and random forest with 100% is obtained in type1 attack. Decision tree with 100% accuracy in type2 attack is obtained. Random forest with 98.03% accuracy in type4 attack, random forest with 95.56% in type8 attack, and random forest with 95.55% is obtained. The new procedure selects the best algorithm as random forest with 97.62%. Hence the new procedure is achieved for getting the best algorithm for the detection of misbehavior in VANETs. The advancement in this paper can be done with the application of the hybrid machine learning techniques. The implementation of the different scenarios and different attacks can also be considered in future for the detection of misbehavior.