Adaptive traffic lights based on traffic flow prediction using machine learning models

ABSTRACT


INTRODUCTION
The intelligent transportation system is an essential part of the smart city [1]- [3]. A thorough understanding of how these systems work and the development of effective traffic management and control strategies are required to optimize their management. Traffic prediction is crucial in traffic systems. An accurate and efficient traffic flow prediction system is needed to achieve intelligent transport systems (ITS) [4], [5]. The traffic flow prediction system is used to provide accurate road status information [6]. However, traffic congestion is a growing global problem as the population increases. Therefore, each traffic intersection is likely to gather a lot of vehicles, and the peak period will be more congested than usual. Consequently, to solve this problem, numerous research studies have been carried out to explore and manage traffic flow to reduce traffic congestion. Traffic flow data were collected from the Tokyo Expressway. These data were used as input to the predictive model [7]. In [8], the model predicted the value of traffic speed on each road section at the next time stamp using data collected from Beijing. By including a multi-task layer at the conclusion of a deep belief network (DBN) [9]. Concurrently predicted traffic speed and flow. While in [10], to forecast traffic in the upcoming six timestamps simultaneously, the authors used both the iteration approach and the auto-correlation coefficient method. Regression models are easy to implement and suited for traffic prediction tasks on a simple traffic network. According to [11], the mathematical model between inputs and outputs and associated parameters are predetermined, and the relationship between each parameter and input data is relatively certain. The experimental results also show that the linear regression model is more efficient and can provide satisfactory prediction results.
There are many types of classification methods for machine learning models based on different perspectives [12]- [17]. In addition, the studies [18], [19] demonstrate the robust performance of linear regression models. The authors provide a forecasting method that combines linear regression with time correlation coefficients. Linear regression deals with the spatio-temporal relationship of road networks and can also use this information to make predictions. Hobeika and Kim [20] provided prediction models, each model combined with different traffic variables such as upstream, current traffic and historical average. Kwon et al. [21] combined linear regression with stepwise-variable-selection method and tree-based method. To avoid the intersection problem, the authors used cross-validation (CV). According to [22], the k-nearest neighbor (KNN) model is a data-based non-parametric regression method. KNN model finds the k-nearest neighbors that match the current variable value and use those K data to predict the next period's value. Furthermore, in [23], four main challenges were pointed out: i) how to define an appropriate state vector, ii) how to define a suitable distance metric, iii) how to generate forecast, and iv) how to manage the potential neighbor database. In [24], KNN model was used for traffic flow prediction in different prediction intervals (15, 30, 45, 60 min). To improve the K nearest neighbors selection process, Zheng and Su [25] chose correlation coefficient distance as the distance metric to select the more related neighbors. The authors then introduced a linearly sewing principal component method (LSPC), which steers the final prediction into a minimization problem. A rather different approach was taken in [26], in which MapReduce-based KNN was introduced for traffic flow prediction. The authors also employed distance weighted voting to mitigate the impact of the number of K.
Various techniques have been evaluated in the literature. In [27], [28], chose the support vector machines (SVM) model as the kernel of the online traffic prediction system. SVM need less computational requirements. However, SVM has difficulties dealing with a big traffic network problem. Furthermore, when using grid search, the cost of finding appropriate hyper-parameters is relatively high when compared to KNN or regression models [28]. However, methods such as continuous ant colony optimization (CACO) [29] and genetic particle swarm optimization (GPSO) [30] are used to quickly determine the hyper-parameters values. According to [23], [31], when compared to the traditional linear regression model, the non-parametric method can better capture the non-linear feature of the traffic pattern, which greatly increases the algorithm's precision. However, the non-parametric method also comes with drawbacks [32]. This model imposes high demands on the quantity and quality of historical data, and training costs for nonparametric models can be high compared to other types of methods. Recurrent neural networks (RNN) were introduced to solve this problem. RNNs can better describe the impact of a series of traffic records [33]- [35]. RNNs also have drawbacks. If the input series is too long during the optimization process, it can lead to gradient explosion or gradient vanishing problems. To avoid this problem, Hochreiter and Schmidhuber [36] proposed long-short term memory (LSTM) model to solve the long-term dependency. as highlighted by [37], LSTM is limited when the input time series is very long. Fu et al. [38] compared gated recurrent unit (GRU) and LSTM by using the traffic data collected by 30 random sensors from the performance management system (PeMS) dataset. A strategy for managing traffic is to use traffic lights [39] with sporadic fixed times for each road, but this cannot be considered a good solution because congestion will increase on specific roads with high vehicle traffic.
Furthermore, the use of cameras is complemented by modern technologies such as computer graphics, automated video analysis for traffic monitoring, state-of-the-art cameras and high-end computing power [40]. Delica et al. [41], a simulation program for an intelligent traffic light system was created using the virtual instrumentation laboratory environment (LabVIEW). It intends to measure traffic density by examining the number of vehicles in each lane. Kareem and Hoomod [42] presents an integrated three-part module for intelligent traffic signal systems. This paper presents five machine learning models: linear regression, random forest regressor, decision tree regressor, gradient boosting regressor, K neighbors regressor; compared in the task of predicting traffic flow at intersections, with the purpose of applying them to an adaptive traffic light system, this method allows for a better traffic flow without completely changing the traffic light system structure. Experiments showed that all models have good vehicular flow prediction capabilities and can be used effectively to develop our traffic management system. The rest of this paper is organized as follows: section 2 explores several machine learning algorithms that has being employed in this work; section 3 discusses the experimentation and evaluation; finally, section 4 presents the conclusion and perspectives.

METHOD AND IMPLEMENTATION
The goal of this study is to reduce traffic congestion. The measure for traffic congestion is based on the density of vehicles on the roadways being monitored. Density is a commonly used measure of traffic congestion, and it refers to the number of vehicles present on a particular stretch of road at a given time. The proposed architecture is as follows: We first collect traffic data from cameras and sensors, then deploy the cross-industry standard process for data mining (CRISP-DM) model to process the collected data and apply machine learning algorithms. After implementing ML algorithms, we apply the grid search method to find the best hyper-parameters. Next, a density reduction function is then applied based on the change of the greenlight and red-light duration. Finally, we feed the obtained results to the traffic light controller. This model is better illustrated in Figure 1. In this paper, the UK National Road traffic estimation statistics publication's road dataset was used to train, validate, and test the proposed system. This is a public dataset for traffic prediction derived from a variety of traffic sensors. It is important to note that only a few public transport data are available.
Beginning our analysis with some form of exploratory data analysis is consistently a wise decision. This way, we can determine dataset properties such as data format, number of samples, and data types. The dataset contains a total of 495,562 data points with 32 attributes. A sample data is shown in Figure 2. The dataset's variables were categorized as either categorical or numerical depending on their characteristics. Handling a large number of features may have an impact on the model's performance and increase the risk of overfitting, as training time grows exponentially with the number of features.
Using a measure of centrality is the easiest and quickest approach to imputing missing values [43]. Such measures represent the most common value in a variable's distribution and are therefore a logical option for this method. There are multiple measures of centrality available, such as the mean and median. The most appropriate measure depends on the distribution of the variable. Figure 3 shows that the data follows a normal distribution, where all data points are closely gathered around the mean. This measure is the optimal choice to impute the missing values.
The second step is to explore the relationships between variables. This analysis enables us to identify the most correlated pairs, pairs that are highly correlated represent the same variance of the data, this allows us to further investigate the pairs to better understand which attribute is most important for constructing our models. The correlation matrix in Figure 4 shows a strong correlation between start_iunction_Raod_name, end_junction_raod-name, day, month, year, road_name, red-time, and green-time attributes.

Data cleaning
A common practice during this task is to correct or delete erroneous values. The problem of dealing with missing data is an important issue, as this data cannot be ignored in a machine learning analysis. This issue can be solved by imputation techniques which consist in producing an "artificial value" to replace the missing value to generate approximately unbiased estimates. For this we chose the method of imputation by average which consists in replacing the null values by the average of the whole column. And also, the method more often for categorical fields. We have applied two functions that go through all the quantitative, qualitative data and replace them using the impute function in the sklearn library.

Encoding of categorical characteristics
This is an important preprocessing step for structured data sets in supervised learning. Most machine learning algorithms only utilize numerical data. Our dataset contains four categorical variables that must be encoded: direction_of_travel, start_junction_road_name, road_name and end_junction_road_name. This information is detailed about roads that are of a qualitative nature, as a result, we must encode the qualitative characteristics in a representation in order to convert the machine-readable form. This is accomplished using the LabelEncoder from the sklearn library.

Machine learning methods
Five regression models were implemented in Python using scikit-learn library [44]: linear regression, random forest regression, decision tree regression, gradient boosting regression, and K-neighborhood regression. All of these models have default parameters. Additionally, most machine learning algorithms have a large number of parameters that can be tuned to control the modeling process. These measures are called hyper-parameters and must be adjusted so that the model can obtain optimum results. Grid search method is used, which is simply an exhaustive search in a manually specified subset of the hyperparameter space of a learning algorithm, for ease of understanding a list of values is established for each hyperparameter each hyperparameter and then the one that gives the most accuracy is chosen.

RESULTS AND SIMULATION 3.1. Obtained results
To evaluate the performance of machine learning algorithms, we relied on metrics from the scikitlearn library [44] using mean absolute error (MAE) [45], [46], root mean squared error (RMSE) [47], [48], and R-squared (R2) [49], [50]. They are defined as (1)-(3): where is number of samples, is observed traffic flow, ̂ is predicted traffic flow, and ̅ is mean. MAE and RMSE measure absolute prediction error. Smaller numbers indicate higher prediction performance for two three metrics. The value of 2 ranges from 0 to 1, with values closer to 1 indicating a better fit of the regression model.
Once the data is prepared for model training, machine learning algorithms usually have many adjustable values or parameters, referred to as hyperparameters, that can influence the model's performance. Therefore, it is necessary to tune these hyperparameters to optimize the model's ability to solve the machine learning problem. To achieve this, we utilized the grid search method, which involves exhaustively searching through a manually specified subset of the hyperparameter space of a learning algorithm. In this approach, a set of values is established for each hyperparameter, and the one that yields the highest accuracy is chosen. Table 1 presents the various hyperparameters used for each model. Table 2 lists the performance metrics of each machine learning models. The results show that for the exception of linear regression model, random forest regressor gets better results (R-Squared of 0.98) and takes longer time to train, followed by decision tree regressor (score of 0.97) with better results and less training time, then gradient boosting regressor (score of 0.97), but with more training time, and finally K-neighbors regressor, linear regression. Therefore, we can confirm that random forest regressor model is a type of ensemble learning method that combines multiple decision trees to generate predictions. It is known for its ability to  In order to evaluate the cost-benefit of implementing the machine learning models proposed in this study, we conducted experiments to determine the average training time for each model. These experiments were carried out using the Google Collaboratory platform [51], which allowed us to perform the necessary computations in a cloud-based environment. For each of the five machine learning models tested in this study, we recorded the time it took to train the model on the UK National Road Traffic Estimation Statistics Publication's Road dataset, which was used for both training and testing. The training time was measured in seconds, and the average training time for each model was calculated by taking the mean of the training times across all experiments. Figure 4 presents the results of our experiments in the form of a bar chart, with each bar representing the average training time for a particular machine learning model. As can be seen from the Figure 5, the training time varies widely across the different models, with the decision tree model having the shortest average training time and the neural network model having the longest. Figure 6 shows a comparison between the actual data and the predicted data using five different machine learning models. Figure 6(a) shows the results of linear regression, Figure 6(b) shows the results of random forest regressor, Figure 6(c) shows the results of decision tree regressor, Figure 6(d) shows the results of gradient boosting regressor, and Figure 6(e) shows the results of K-neighbors regressor.
In each subfigure, the orange line represents the actual traffic flow data, while the blue line represents the predicted traffic flow data. The x-axis represents the time period while the y-axis represents the density values for each segment of the intersection. The results show that the random forest regressor and gradient boosting regressor models outperformed the other models, with the lowest prediction error and the highest accuracy. This comparison was based on different evaluation metrics, such as mean absolute error (MAE), mean squared error (MSE), and coefficient of determination (R-squared). These metrics were computed by comparing the actual data with the predicted data using the test split (20%) of the dataset. The results were then plotted in Figure 6 to provide a visual representation of the performance of each model. The evaluation of any model is a critical step in determining its effectiveness in solving the targeted problem. In the case of our proposed system aimed at reducing traffic congestion, the evaluation process is crucial to understand its performance and efficiency. To conduct this evaluation, we will analyze the predicted density values for each segment of an intersection before and after the application of the reduction function. By comparing these values, we can determine the extent to which our proposed system has been successful in reducing traffic congestion. Table 3 showcases the results of this analysis, highlighting the effectiveness of our reduction function in minimizing traffic congestion at intersections.

Traffic simulation
Simulation is a valuable tool for comprehending real-world scenarios, and traffic simulation is widely employed to analyze road traffic congestion. Numerous studies have employed simulation models to simulate traffic congestion, with diverse traffic models accessible, predominantly comprising microscopic, macroscopic, and mesoscopic models. Microscopic models consider the interaction of individual vehicles, encompassing parameters such as driver behavior, vehicle position, velocity, acceleration, and more. Conversely, macroscopic models analyze the collective behavior of traffic flows, while mesoscopic models integrate features of both macroscopic and microscopic models, describing the impact of nearby vehicles and approximating the cumulative temporal and spatial traffic behavior. In this scholarly paper, we designed and implemented a traffic simulation in Python employing Pygame and a microscopic module (as illustrated in Figure 6 to assess the duration of red and green lights at an intersection. At each time step, the real-time simulation module defines the state of the traffic light phase (red, green) and the duration of each phase.  Figure 7 illustrate a microscopic model in which, each vehicle in the microscope model is assigned a number . The ℎ vehicle follows the ( -1) th vehicle. For the ℎ vehicle, let be the position on the road, be the velocity, and be the length. And that goes for all vehicles.
The vehicle-to-vehicle (V2V) distance will be represented by and ∆ the velocity difference between the ℎ vehicle and the vehicle in front of it (vehicle number i-1). Figure 8   This simulation evaluates the optimal timing for the green and red signals at an intersection by using timers on each light to indicate when the light will change. Various types of vehicles are generated and their movements are influenced by the signals and surrounding vehicles. The control system applies prediction and density reduction functions to determine signal plant values, which are then sent to the traffic controller to adjust the traffic lights accordingly. The results of this simulation demonstrate the effectiveness of the proposed system in significantly reducing traffic congestion.

CONCLUSION
The focus of this paper was to develop a machine learning-based system that can predict traffic flow at intersections. To achieve this, we implemented several machine learning models and compared their performance. The results showed that the random forest regressor model outperformed the others with an R-squared and EV score of 0.98. The random forest regressor model is a tree-based ensemble algorithm that works by constructing a multitude of decision trees and then outputs the mean prediction of the individual trees. This algorithm is known for its high accuracy, but it also takes more processing time due to its complexity. The decision tree regressor model had a score of 0.97, which is slightly lower than the random forest regressor, but it provided a good balance between measurement of results and training time. The gradient boosting regressor model showed a similar performance to the decision tree regressor but took more processing time. The K neighbors regressor and linear regression models had relatively lower scores, but their performance did not differ significantly from the others. To validate the effectiveness of the proposed system, we simulated the traffic flow at intersections using a microscopic model. The model used a traffic density reduction function to adjust the green and red-light duration, which helped reduce traffic congestion significantly. The efficiency of the suggested system in decreasing traffic congestion provides justification for its deployment at key intersections. However, to enhance the system further, we aim to incorporate additional technologies. One potential area of improvement involves the data used to train the models. Although the dataset used in this study was obtained from various traffic sensors, we intend to explore the use of other data sources, such as traffic camera images, to improve the models' precision. Additionally, we plan to integrate real-time data feeds into the system, enabling it to make real-time decisions, further enhancing its efficacy in managing traffic flow at intersections. In conclusion, the proposed system demonstrates promising results in forecasting traffic flow at intersections and reducing traffic congestion. Although the random forest regressor model proved to be the most precise, the decision tree regressor offered a good balance between accuracy and training time. This system has the potential to be a valuable tool for managing traffic at major intersections, and we anticipate further development and improvement in the future.