Comparative analysis of multiple classification models to improve PM10 prediction performance

Received Jul 31, 2020 Revised Sep 22, 2020 Accepted Oct 14, 2020 With the increasing requirement of high accuracy for particulate matter prediction, various attempts have been made to improve prediction accuracy by applying machine learning algorithms. However, the characteristics of particulate matter and the problem of the occurrence rate by concentration make it difficult to train prediction models, resulting in poor prediction. In order to solve this problem, in this paper, we proposed multiple classification models for predicting particulate matter concentrations required for prediction by dividing them into AQI-based classes. We designed multiple classification models using logistic regression, decision tree, SVM and ensemble among the various machine learning algorithms. The comparison results of the performance of the four classification models through error matrices confirmed the f-score of 0.82 or higher for all the models other than the logistic regression model.


INTRODUCTION
Particulate matter is a substance made up of various sizes, shapes, and ingredients. Particulate matter, divided into 10 , 2.5 according to the size of 10 , 2.5 or less, affects our health by causing some diseases such as cardiovascular, respiratory, and cerebrovascular diseases. Accordingly, particulate matter was classified as a dangerous substance, and it is analyzed as the cause of decreasing the vitality of society members [1][2][3][4][5][6][7]. In order to avoid such harmful effects of particulate matter as much as possible, it has become a routine practice to check the information provided based on the air quality index (AQI), which is divided into four categories: 'good', 'moderate', 'bad', and 'very bad'. Korea's particulate matter prediction accuracy was approximately 60% in 2015, and the Korea Meteorological Administration's prediction process has the annual predicton accuracy of approximately 80%. However, this is information reflecting on the weather forecaster's experience, and the actual particulate matter prediction model shows the accuracy of approximately 50% [8,9].
Therefore, various attempts have been made to improve the prediction accuracy of particulate matter by applying machine learning prediction algorithms along with conventional statistical techniques [10,11]. However, the characteristics of particulate matter arising from various external factors and the problem of the the occurrence rate by concentration make it difficult to effectively train prediction models. In order to improve the prediction accuracy of particulate matter concentrations, K. W. Cho  proposed a prediction model that separated and predicted them based on a specific concentration. By dividing the low and high concentrations based on the particulate matter concentration of 81 , they compared the prediction performance through a deep neural network-based prediction model. The prediction results confirmed that the prediction performance of the low and high concentrations was improved, and especially it showed the performance improvement of 20.62% for the high concentrations [9]. The study by K. Kaya et al. proposed a solution to an unbalanced problem in order to address the prediction problem of the regression model due to the variation in the occurrence rate by particulate matter concentration. They confirmed the accuracy of approximately 80% in the entire data set by making the number of samples of the class the same for unbalanced data through the proposed upper sampling and down sampling [12].
In this paper, we propose data classification models by concentration to improve the performance of a particulate matter concentration prediction model. Of the machine learning classification models, we use the logistic regression, decision tree, support vector machine (SVM), and ensemble models. Based on the AQI, we configure multiple classification models by dividing particulate matter concentrations into 4 classes. In order to apply the optimal parameters to the models, we design the models by performing parameter search through grid search cross validation. We perform model evaluation using the error matrix.

DATA COLLECTION AND CONFIGURATION 2.1. Data collection and preprocessing
Particulate mattert is affected by various factors. Air pollutants and meteorological elements are typical, which are commonly applied to studies for predicting particulate matter concentrations [13][14][15][16]. Based on the studies, we selected the major data as shown in Table 1. The average particulate matter(< 10 ) of the previous 1 hour The average humidity of the previous 1 hour Wind Speed The average wind speed of the previous 1 hour Wind Direction The most frequent wind direction of the previous 1 hour According to the selected data, we collected the final confirmed data measured at an interval of an hour for 10 years from 2009 to 2018 at the measurement station around Cheonan in Korea. Air pollution data is composed of 10 , 10ℎ , 3 , , O 2 , and 2 , and meteorological elements consist of temperature, humidity, wind speed, and wind direction. Since some data were missing due to the power outage and maintenance of measurement equipment, we removed all data of the same time when the missing data was present. Of the meteorological elements, the largest wind direction expressed in azimuth, that is, the 0° and 360°, which were often used with mixed notation, were unified to 360°.
There is a need for data preprocessing to perform classification through machine learning algorithms using the collected data. Since we used the classification algorithm based on supervised learning, we performed classification by separately dividing the data corresponding to independent variables and the data corresponding to dependent variables. The independent variable data, which includes

Data configuration
The data used in the supervised learning model is mainly composed of a training set for learning and a test set for evaluating the trained model. The training set used to train the model is subdivided into a train set and a validation set because of the need to verify whether training is well completed. In this paper, we configure a training set of 75% and a test set of 25% with the preprocessed data. The training set is composed of a train set of 80% and a validation set of 20%. Figure 1 shows the structure of the final data used in the model, and Table 3 shows the configuration of the data set.

CLASSIFICATION MODEL DESIGN 3.1. Logistic regression model design
Logistic regression is an algorithm used to predict the likelihood of an event using a linear combination of independent variables. As in general regression analysis, it is used in future prediction models by deriving a specific function through the relationship between dependent and independent variables. However, unlike linear regression, since the prediction result is classified as a specific category when the dependent variable is categorical data, it is used as a classification technique rather than a regression technique. It is divided into binomial or multinomial depending on the category characteristics of the dependent variable. The dependent variable for training the classification model of particulate matter concentrations has four categories. Accordingly, we built a model by applying a multinomial logistic regression method.
For a model predicting a certain result, overfitting or underfitting is contingent on the intensity of training. It is difficult for a model with overfitting to predict new data since it only focuses on training data. In the case of underfitting, there is a problem that the model does not predict most of the data since it does not identify the characteristics of the data due to simple training. To solve these problems, logistic regression basically uses L2 regularization parameter c [17].
Therefore, for better prediction performance, we performed the search for the optimal c value using grid search cross validation to find the value of c that fits the model. We set the range of c values to be searched to 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, and 1000. In order to select parameters with high generalization performance, we set the cv parameter of k-fold cross validation to 5. Accordingly, the c value was sequentially accessed to compare scores using the test set after 5 repetitive training runs and validations. For preprocessing of validation fold during cross validation, we searched the c values by building the pipeline of min max scaler and the model. Table 4 shows the mean test score and c values of the top 3 rankings in the cross validation results. The cross validation results showed that the mean test score was highest with 0.808958 when the c value was 10.0, thus we selected the c value to be applied to logistic regression as 10.0.

Decision tree model design
Decision tree is a widely used model for classification and regression. It is basically an algorithm that learns by continuously answering questions to approach a specific decision. With the increase in the number of leaf nodes, the accuracy of the training set increases but overfitting may occur [18]. One of the methods used to prevent overfitting is to stop the growth of the tree when the depth of the tree reaches a certain level. The parameter that limits the depth of a decision tree is max_depth, and we are able to improve the performance of the model by adjusting the depth.
Therefore, we performed the search for the optimal max_depth value using grid search cross validation. We set the range of max_depth values to be searched to 1~24, and performed the search by setting the cv parameter of k-fold cross validation to 5. Additionally, for preprocessing of validation fold during cross validation, we searched the max_depth values by building the pipeline of min max scaler and the model. Table 5 shows the mean test score and max_depth values of the top 3 rankings in the cross validation results. The cross validation results showed that the mean test score was highest with 0.85936013 when the max_depth value was 4, thus we selected the max_depth value to be applied to decision tree as 4.

SVM model design
SVM is one of machine learning methods and is a supervised learning model for pattern recognition and data analysis. It is mainly used for classification and regression analysis. Given a set of data belonging to one of two categories, it generates a non-stochastic binary linear classification model that determines the category to which new data belongs based on a given data set. The generated classification model is expressed as a boundary in the space onto which the data is mapped. It is an algorithm to find the boundary with the largest width [19][20][21]. Therefore, SVM is a model that defines a baseline for classification between categories, which is expressed as a decision boundary.
In SVM, the difference in performance is determined depending on how the decision boundary is defined, and it is crucial to find the optimal decision boundary. The parameters applied to find the optimal decision boundary are c and gamma. C is a parameter that adjusts the allowable range of outliers by controlling the margin of the decision boundary, and gamma is a parameter that prevents overfitting of the model by controlling the flexibility of the decision boundary.
We performed the search for the optimal c and gamma values to find the optimal decision boundary using grid search cross validation. We set the range of c and gamma values to be searched to 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, and 1000, and performed the search by setting the cv parameter of k-fold cross validation to 5. Additionally, for preprocessing of validation fold during cross validation, we searched the c and gamma values by building the pipeline of min max scaler and the model. Table 6 shows the mean test score and relevant c and gamma values of the top 3 rankings in the cross validation results. The cross validation results showed that the mean test score was highest with 0.859238 when the c and gamma values were 1000 and 0.01, respectively, thus we selected the c value and the gamma value to be applied to SVM as 1000 and 0.01, respectively.

Ensemble model design
Ensemble is a technique that generates a powerful model by combining multiple models to achieve better prediction performance as compared with using an individual machine learning model. When multiple models are combined, the amount of calculation is generally increased, yet it prevents overfitting more effectively than using an individual model and it has the advantage of showing better performance than an individual model if the performance of an individual model is poor [22][23][24]. Ensemble is mainly divided into a collection methodology and a boosting methodology. The collection methodology has the predetermined set of models to be used, but the boosting methodology gradually increases the models to be used. In this study, we combined the logistic regression, decision tree, and SVM models previously designed, which corresponds to the collection methodology, to build an enemble model. Figure 2 shows the structure of the ensemble model. The training set data are used as an input variable to the combined logistic regression, decision tree, and SVM models, and the predicted results are outputted from an individual model. The final prediction results are generated by voting on the outputted results [25]. Voting is divided into hard and soft voting. Hard voting simply selects the final prediction based on the prediction results of an individual model. The voting method of the ensemble model designed in this study is soft voting, which selects the final prediction based on the sum of conditional probabilities of an individual model.

PERFORMANCE EVALUATION
We evaluated classification performance using the previously configured data set and designed the classification models. For performance evaluation, we used precision, recall, and f-score based on the error matrix. Figure 3 shows the error matrices created based on the classification results of the trained models. Table 7 shows the performance evaluation of the classification models calculated by referring to the error matrices.
When the logistic regression model predicted 'good', the precision was highest with 0.8685. When the prediction was performed based on the input data of 'moderate', the recall was highest with 0.9341. On the other hand, the classification did not work well for 'bad' and 'very bad'. Especially, the prediction was not made at all for 'very bad'. When the decision tree model predicted 'moderate', the precision was highest with 0.8977. When the prediction was performed based on the input data of 'moderate', the recall was highest with 0.9023. On the other hand, the precision and recall for 'bad' and 'very bad' showed relatively low values compared to 'good' and 'moderate'. The SVM model showed the highest precision and recall with 0.8997 and 0.8997, respectively, for 'moderate'. As in the decision tree model, the precision and recall for 'bad' and 'very bad' showed relatively low values compared to 'good' and 'moderate'. The ensemble model showed the highest precision and recall with 0.8997 and 0.8997 for 'moderate'. However, the precision and recall for 'bad' and 'very bad' showed relatively low values compared to 'good' and 'moderate', resulting in difficulty classifying the relevant classes. The analysis results based on the precision and recall showed that the precision and recall of 'good' and 'moderate' were relatively higher than those of 'bad' and 'very bad'. When analyzed through the error matrixes in Figure 3, it was confirmed that, of the input data, the proportion of

CONCLUSION
In predicting particulate matter concentrations, there is a problem of training particulate matter concentration prediction models because of the characteristics of particulate matter. In order to solve this problem, various studies have been underway such as performing prediction by dividing particulate matter concentrations based on a specific concentration. In this paper, to improve the performance of the particulate matter concentration prediction model, we proposed multiple classification models that provided particulate matter concentrations in four classes based on the AQI. To this end, we configured data sets by selecting air pollutant data and meteorological elements collected at an interval of an hour for 10 years around Cheonan. As the classification models in this study, we used the logistic regression, decision tree, SVM, ensemble. In order to apply optimal parameters to each model, we searched the parameters through grid search cross validation. We built the ensemble model by combining the logistic regression, decision tree, and SVM models into one. We used error matrixes to evaluate the performance of four multiple classification models. Logistic regression showed poor precision, recall, and f-score compared to other classification models. Decision tree, SVM, and ensemble models all showed the precision and recall with 0.85 or higher for 'good' and 'moderate' based on the AQI, whereas they showed 0.75~0.79 for ''bad' and 'very bad'. We confirmed that this was because the particulate matter data used in the classification models were unbalanced data with the high proportion of a specific class. Accordingly, we verified the scores of the models by taking into account all classes with the same proportion, and found that the models other than the logistic regression model showed a score of 0.82 or higher. Of these models, the SVM model showed the best classification performance with 0.8277. In future, in order to address the problem of unbalanced data, we are going to compare classification performance through the algorithm changes of the classification models and design a particulate matter concentration prediction model based on the improved classification models.