4Data Mining Approach of Accident Occurrences Identification with Effective Methodology and Implementation

ABSTRACT


INTRODUCTION
Data mining is the prominent technology to predict or do some analytics on a domain. Traffic management and accident occurrences in different places over the globe and the reason for accident s may vary. But we need to look after some of the things which are related to the mining the most chances of accident occurrences. Let's take a survey on different machine learning classification algorithms which are used on different data sets collected from different region and we can make a decision on which classification rule or association rule have to use for our data set. We have a publicly available data set on which we implemented SVM classifier and CNB classifier with WEKA tool. The required result is to identify which classification algorithm is better for the mining the actual data and predict better with the results. The main motto behind this kind of article is because of more cases being recorded by the regional hospitals as accident cases. The injuries, damages for vehicles and so on can be considered as the main reasons. The main reasons for the deaths on road is traffic accidents [1], that is not following the traffic rules, over taking in a wrong way, over speed, not following safety measures of road. As per WHO (World Health Organization) over 4 million cases have been recorded each year worldwide because of the traffic and road accidents. The main reasons which WHO states is not following traffic rules, not following safety measures like seat belt, hell mate, over speed, wrong crossing, minor driving, lake of literacy on the traffic and road safety rules, drunk and drive. We can provide the measures to avoid this kind of things with small measures which are discussed by other researchers [2]. Data mining is mainly used to identify the severity of accidents on roads [3].
DMDW (Data Mining Data Warehousing) [4] have all the techniques to be used to predict or identify the severity of accidents on roads. DM is used to extract the semantic things over the data set that is a meaningful extract from the data available [5]. The classification techniques like clustering, anomaly detection, clustering and classification rules [6] are used for most of the DM operations on the road accidents. In this article we would like to share some literature survey on different previous operations done on different data sets and also the current research we would to do on the different data set related to the road accidents and severity. The next section will discuss short literature survey, later current work what this article will speak, experimental results, resources and finally conclude.

LITERATURE SURVEY
As we need to consider basics of Support vector machines and CNB classifiers to understand the literature review, let's make a sample collection of knowledge on SVM as it is important in this research scope. In machine learning, SVMs are controlled learning models with related learning counts that separate data used for course of action and backslide examination. Given a course of action of preparing cases, each set apart as having a place with both of two groupings, a SVM arranging check setting up a format). A Support Vector Machine points a delineation of the method as indicates in a plot, pointed or connected with the target that the examples of the instance of classes are disengaged by a sensible manner that is as wide as it could be sensible. New instances are then indentified and connected into that same hypothesis and anticipated to have a place with a class in context of which side of the instance they fall. Not with standing playing out the prompt demand, Support Vector Machines can beneficially act beyond the boundary as a nonstraight depiction using the thing what is actually identified as the part-trap, checking and connecting their duties regarding high-instance portion spaces. Right when the data isn't stamped, straight forward things related to learning isn't acceptable, and an un-supervised learning methodology is mandatory, which is leading to identify trademark gathering of the information to get-togethers, and after that guide relevant data to these surrounded social groups. The grouping identifies which leads to a chance of modification to the SVM's is called support vector assembling and it is once in a while used as a bit of mechanical methodology either when the data isn't checked or when just two or three data are named as a pre-processing for a depiction method.
Asking for data is a general undertaking in ML. Expect some shown data shows every point as a place either of the available classes and the purpose is to pick exact class alternative Data point will be using. By ideals of SVM's, a data point is identified as a p dimensional vector (a quick overview of p identifiers), and the thing we have to identify is that possible that we can isolate such pointers with a (p-1)-multidimensional hyper plane. This can be identified as directed classifier. There are different hyper lines that may total data regarding the points. The one sensitive opinion as the better hyper-plane is the one that tends to the best partition, or point, between the different classes. So we select the hyper-line so the isolation from it to the closest data-point on other side is improved. In such data-point that hyper-line identifies, it is known as the best fitted hyper-line and the quick identifier it portrays is mentioned as a most over the top data classifier; or proportionately, the perceptron of flawless security All the more generally, a SVM develops a hyper-line or set of hyper-lines in a high-or tremendous dimensional plane, which was used for depiction, fall away from the faith, or various undertakings like irregularities affirmation. Regularly, a mind blowing package is refined by the hyper-line that has the best division to the closest preparing information purpose behind any class (attested accommodating edge), since all around the more prominent the edge the lower the hypothesis spoil of the classifier

4035
The Figure 1 is a model occurrence of a SVM classifier, i.e., a SVM classifier that limits a strategy of things into their diverse social events (GREEN, RED) indicates a hyper-line. Most assembling undertakings, regardless, are not that crucial, and reliably more fanciful structure is required recollecting the genuine goal to make a flawless separation, i.e., decisively portray new difficulties (test instances) in light of the depictions that are operates (prepare instances). This situation is depicted in the structure below. Emerged from the previous semantic, unmistakably a complete section of the colors Green and also Red indication could be require a wind (which is more puzzling than a hyper-line). The Course of activity assignments in light of attracting hyper lines to see methods of different objects participating are defined as hyper-line classifiers as shown in Figure 2. Support Vector Machines are especially suited to oversee that kind of errands.
The Figure 3 below displays the critical thought behind SVM's. Here we can observe the basic differences (red part of the semantic) connected, i.e., adjusted, using a game-plan of sensible cutoff points specified as sections. The process of modifying the articles is defined as connecting. Make a note that in this new operations, the mapped objects (Green part of the semantic) is straightly unmistakable and, in like manner, instead of building the confusing turn (left semantic), we should just to locate an impeccable line that can disengage the Green and also the Red things. SVM is perhaps a champion among the most well known and talked about machine learning estimations. They were incredibly standard around the time they were delivered in the 1990s and continue being the go-to system for a high-performing count with a little tuning. In this post, you will discover the SVM machine learning figuring. In the wake of examining this post you will know: Well ordered guidelines to disentangle the various names used to insinuate help vector machines. The depiction used by SVM when the model is truly secured to the plate. How an informed SVM demonstrate depiction can be used to make desires for new data. Well ordered directions to take in an SVM show from getting ready data. Guidelines to best set up your data for the SVM estimation. Where you may like to get more information on SVM. SVM is a stimulating estimation and the thoughts are by and large direct. This post was created for architects with basically no establishment in estimations and a straight factor based math.
The Maximal-Margin Classifier is a theoretical classifier that best clears up how SVM works eventually. The numeric data factors (x) in your data (the sections) outline an n-dimensional space. For example, if you had two information factors, this would shape a two-dimensional space. A hyperplane is a line that parts the data variable space. In SVM, a hyperplane is bested isolate the concentrations in the information variable space by their class, either class 0 or class 1. In two-estimations, you can picture this as a line and we ought to expect that the larger part of our data centers can be completely segregated by this line. For example: Where the coefficients (B1 and B2) that choose the inclination of the line and the catch (B0) are found by the learning computation, and X1 and X2 are the two data factors. You can take courses of action using this line. By interfacing with entering regards into the line condition, you can process whether another point is above or underneath the line. Over the line, the condition reestablishes a regard more noticeable than 0 and the point has a place with the five star (class 0). Underneath the line, the condition reestablishes a regard under 0 and the point has a place with the beneath normal (class 1). A regard close to the line reestablishes a regard almost zero and the point may be difficult to mastermind. If the span of the regard is generous, the model may have more trust in the desire. The division between the line and the closest data shows is implied as the edge. The best or perfect line that can separate the two classes is the line that as the greatest edge. This is known as the Maximal-Margin hyperplane. The edge is figured as the contrary detachment from the line to only the closest core interests. Simply these concentrations are pertinent in portraying the line and in the improvement of the classifier. These concentrations are known as the assistance vectors. They support or describe the hyperplane. The hyperplane is picked up from planning data using a streamlining framework that lifts the edge.
When all is said in done, authentic data is disorganized and can't be separated impeccably with a hyperplane. The basic of growing the edge of the line that segregates the classes must be easygoing. This is routinely called the fragile edge classifier. This change allows a couple of demonstrates in the arrangement data manhandle the secluding line. An additional game plan of coefficients are exhibited that give the edge squirm room in every estimation. These coefficients are rarely called slack variables. This grows the multifaceted idea of the model as there are more parameters for the model to fit to the data to give this capriciousness. A tuning parameter is displayed called basically C that portrays the span of the squirm allowed over all estimations. The C parameters describes the measure of encroachment of the edge allowed. A C=0 is no encroachment and we are back to the unbendable Maximal-Margin Classifier depicted already. The greater the estimation of C the greater encroachment of the hyperplane are permitted. The taking of the hyperplane from data, all readiness cases that exist in the division of the edge will impact the circumstance of the hyperplane and are suggested as help vectors. Likewise, as C impacts the amount of events that are allowed to fall inside the edge, C impacts the amount of assistance vectors used by the model.
In this short literature survey we would like to discuss about different approaches worked out by different researchers over the globe. Machine Learning is the base concept behind the mining the severity of accidents. As we discussed previous over 4 million cases are being recorded as road accidents every year. Some of the machine learning algorithms like clustering is used as unsupervised learning technique. We need to consider clusters for a specific function in the data set. The function may be a reason of getting accident. For example over speed might be one reason so will be considering that as one of the function.
ANN (Artificial Neural Networks) [7] will be helping for analyzing the road accidents with different parameters. Tree based analyzing is one other concept [8], if we consider LCC (Latent Class Clustering) it is faster and accurate than k-NN with some functions of the data set. [9]- [13]. let's take a shore review on the data mining techniques which are being used in different domains of research over the globe by different researchers. The reason to know about the other research domains regarding the data mining techniques is to know the main functionality of each and every thing. There are few fundamental operations in the data mining and one among those is to split the data set into different clusters for the better clustering operations. Clustering is unsupervised learning in which we have no specific predicted output based on the available data and past data available we need to perform the operations and obtain the prediction results [14], [15]. If we consider the clustering we need to split the data set to identify the common and same category of the functions in the data set. Suppose if we are considering the accident severity in our case there may be different functions to be considered and some cases we need to consider the combination of the functions from the dataset. Lets take an example regarding the clustering the dataset. Consider the sample Table 2 below which is having some common things in the dataset.
By considering the Table 1 we can say that most of the accidents are happening to the car riders, reasons may be over speed, drunk and drive etc. We need to form the clusters based on the most weight reason for the accident.

PROPOSED APPROACH
We have seen some of the classification algorithms [16]- [19] and rules which are based on latest machine learning techniques. Clustering is based on unsupervised learning, K-NN, K-Means [20] is also under unsupervised learning technology. Let us take a time and execute the same data sets which are available in supervised learning. SVM (Support Vector Machines), CNB Classifier are the two classification algorithms which we are explaining in this article. Based on the three categories we would like to explain our work in accident severity. BOW (Bag of Words), word frequency and word raking. BOW is consisting of the set of pre-defined words which are mostly used to explain the research component in the application. Support if we are having data set with some words like hell mate, seatbelt, speed etc those things will be considered as bag of words. First we need to perform the pre-processing of the data set. We need to identify the missing values in the data set and we need to substitute the missing values with the related values, whether it may be considering the mean or median of the values of that function or object. Lets take a look of the sample table which will consisting of the sample data which might be available with the data set. This sample data set from Table 2 will be used for pre processing in machine learning technique may be using python or R programming. In this process we need to eliminate or handle the missing values. While handling the missing values we need to identify the text values and need to convert those to numerical format to apply prediction or data mining classification algorithm. Algorithms we are using can't be able to handle the string format in the data set always. There is a sequence to follow to predict the accuracy or to predict the main reason behind these accidents. Lets take a clear look on the flow with Figure 4.  First we need to load the data set which we need to process. Later do some pre-processing steps like eliminating the missing values and substituting those with the valid information like mean of the data of median. Then select the classification algorithm with which we need to apply. The missing values cleaned data set must be separated as training and test data set. The training dataset will be used for train the machine or classification algorithm which we are writing; test data set is used to correlate the things with the required result. We need to test the values of the data set with the training set and have to correlate with the previous work or with the training data set [21]- [23].
After selecting the classification algorithm, if we select the SVM algorithm, we need to select how main columns or rows we need to use for the test set to correlate, then submit the values. The result will be in three types. It will do BOW collection, word count and word frequency. Based on the word frequency we can estimate that which is the main reason behind the sever road accidents. The same follows with CNB classifier, but the thing will change here is we need to give sample count of columns and rows to process, it will take entire dataset without missing values and imply WEKA tool on it and produce the estimated result. In the later part of the section we will discuss the experimental results with related to the sample data set we are using for the processing of the data. To be precise there are three types of results we acquire and we have already discussed the types of results we are going to get with this experiment.
As we discussed the proposed approach to identify the accident severity using two classification algorithms it worth to know about the whether these two will completely satisfy our requirement or anything need to be included. Coming to pros of these two approaches is we need not include every function into the algorithm or the model which we are using. The entire thing we need is limited model data or functions to be implemented in the algorithm. These two will give quick results than other algorithms. As these two are oldest algorithms and classification models the expected results may be vary as we predicted. As we use limited number of functions we cannot get the complete analysis of the predicted things required.
The better way to solve the problem regarding the accidents severity we can make use of the clustering algorithms, K-Means, ANN etc. So that we can get the apt results we required predicted results.

EXPERIMENTAL RESULTS
The results we acquire here have three types and the first thing is bag of words collection (BOW). Based on the number of values we assigned we can calculate the accuracy of the algorithm. Figure 5 Describes the graph of predicted results which describes the main reason for the accidents in those areas. Accuracy is based on the time taken and the number of rows or columns processed with the given classification algorithm using Data Mining or Machine Learning [24]- [26]. By this graph we can predict the main reason for the severity of accidents in different locations. Classification problems are more related to the Machine Learning technique with which we need to train the machine with an algorithm [27]. Using ML the result we got here is classified into some of the functions. Let the Function be Reason type behind the accident. Let the City 1 may have 200 cases and out of that 100 are drunk and drive remaining are over speed, and for city 2 the total cases may be 300 and drunk and drive cases are 150 and remaining are over speed, no traffic rules are followed etc. [28], [29]. Therefore we can get the result that drunk and drive is the major function which is common in all the aspects.
We need to use Decision Trees [29], ANN from the machine learning community [30] for better prediction models for the domain of research. ANN here may be used to predict the future cause of the accidents and to identify the ratio of happening of the accident to the specific reason. That means we need to predict the reason which may cause and effect in future and how much ratio the cause may take part in the happened effect like accident in a specific region.
In This research we are planning to implement some of the advanced algorithms like ANN, Decision trees, Regression algorithms like SVR (Support Vector Regression) to design better prediction algorithm with the available data sets. We collected the public data set available from the government research web site which will give the brief information about the different reasons behind the accidents and how many number of cases are recorded region wise in the span of years .The reasons will be clear with a picture that the main reason may be not following the traffic rules and over speed are the main reasons for the accidents severity in every region. The following image Figure 6 will explain the sample about the coefficient and standard deviation levels in our algorithm related to the domain of research.
For better understanding of the decision trees and decision algorithms, and data mining techniques we can take any health care example like cancer [31]. We apply some of the data mining knowledge on that to predict the cancer percentage and the functional life time of that patient and the severity of the disease.

4039
[32]- [34]. Data mining and Machine Learning are the two areas which are used for the further research of the domains like predicting the accident prone areas and types of reasons based on the locality in the future. The future of data mining is machine learning. Figure 6. Coefficients and the Standard Error explanation Figure 7 explains the count of accidents totally in one location. Let it be one city or state. So that these are the total number of accidents done in one month and we can make a conclusion that because of Lorries more accidents are happening. Whether it may be because of the over speed or drunk and drive. We can see the combination of those in Figure 5. In Figure 5 we will get the combination of the reason of accidents in one state for one month.

CONCLUSION
The data mining and machine learning are the things we need to be considered to identify any unprocessed thing using datasets. In this article we tried to implement SVM and CNB classifiers with which we are predicting the main reason for the severity of accidents and we also predict the main reason on overall results. For example we can consider each state in india and we can predict both the things like main reason for the accidents in individual state and also main reason in overall country. For some cases SVM is showing more accuracy of 97% and some cases CNB is showing accuracy of 98%. With the obtained results both the algorithms are working well with all the conditions considered.