Prediction model of algal blooms using logistic regression and confusion matrix

Received Jul 31, 2020 Revised Sep 22, 2020 Accepted Oct 8, 2020 Algal blooms data are collected and refined as experimental data for algal blooms prediction. Refined algal blooms dataset is analyzed by logistic regression analysis, and statistical tests and regularization are performed to find the marine environmental factors affecting algal blooms. The predicted value of algal bloom is obtained through logistic regression analysis using marine environment factors affecting algal blooms. The actual values and the predicted values of algal blooms dataset are applied to the confusion matrix. By improving the decision boundary of the existing logistic regression, and accuracy, sensitivity and precision for algal blooms prediction are improved. In this paper, the algal blooms prediction model is established by the ensemble method using logistic regression and confusion matrix. Algal blooms prediction is improved, and this is verified through big data analysis.


INTRODUCTION
Logistic regression is a special case of a typical model and is similar to linear regression, however it has a difference in the relationship between dependent and independent variables. The dependent variable of logistic regression can be binary or continuous, and it is used as a model for classification or prediction when the dependent variable is binary [1,2]. If the dependent variable of logistic regression is binary, the range of its value is limited to the bivariate and the distribution of conditional probability follows the Bernoulli distribution. Logistic regression allows dependent variable values to be between 0 and 1 regardless of the range of independent variable values, so it is possible to classify the result of data into a specific classification when input data is given and predict the likelihood of an event occurring [3][4][5].
In logistic regression, where the dependent variable is binary, the predicted value can be calculated using a linear combination of the independent variables. However, since the value of the dependent variable is classified as pass or fail around the decision boundary, the value close to the decision boundary may be less accurate [6][7][8]. In binary logistic regression, since the actual value of the dependent variable is present and the predicted value can be calculated, the predicted value can be applied to a confusion matrix that can be compared to the target value [9,10]. It can be obtained sensitivity and precision from the confusion matrix using the actual and predicted values of the logistic regression, and apply it to algal blooms to create a summary of indicators such as sensitivity and precision including accuracy [11][12][13].
Sensitivity and precision are as important as accuracy in predicting algal bloom occurrence. This is because high sensitivity and precision can provide indicators that can prevent massive property damage [14][15][16][17]. The elements of the marine environment that cause algal blooms are generally known, but no study can be found to analyze the influence of each element on algal blooms and predict algal blooms. In this study, the predicted value of logistic regression is calculated by machine learning. The actual value used in logistic regression analysis and the prediction calculated through machine running are applied to the confusion matrix to create a prediction model for algal blooms.
This paper is organized as follows. The logistic regression and confusion matrix as the background theory of this study are describe in section 2. In section 3, we describe the algal blooms prediction model using the ensemble method of the logistic regression and confusion matrix proposed in this study. Here we describe the process of extracting marine environmental elements using logistic regression, obtaining red tide prediction values, applying improved decision boundaries to logistic regression, and how to improve accuracy, sensitivity and precision through confusion matrix. In section 4, we verify the proposed algal blooms prediction model using the algal blooms dataset, and conclusions are described in section 5.

LOGISTIC REGRESSION AND CONFUTION MATRIX 2.1. Logistic regression
Linear regression is a model that estimates a regression coefficient that can linearly express the relationship between independent variables X and dependent variables Y with continuous values. If the dependent variable Y is a binary variable, logistic regression is used because linear regression cannot be applied directly. Some regression algorithms can be used for classification, and logistic regression is widely used to estimate the probability that a sample belongs to a particular class. If the estimated probability exceeds 0.5, the sample is predicted to belong to the class, and if it is less than 0.5, it is used as a binary classifier to predict that the sample does not belong to the class [18,19]. To estimate the probability, logistic regression calculates the weighted sum of the input characteristics, but instead of outputting the result immediately such as linear regression, it outputs the logistic of the result value. Logistic is a sigmoid function that outputs a value between 0 and 1 [20]. The logistic function has the effect of limiting the output result to always between 0 and 1 for numerical values x, and its expression is defined as follows.
In (1), ( ) can be either a simple linear function or a multiple linear function. For classification problems with two categories, if ( ) > 0 is classified as → 1 and ( ) < 0 is classified as → 0. The decision boundary of the logistic regression model is the ( ) = 0 in hyperplane and becomes = 0.5. Errors in prediction usually occur around the decision boundary [21,22].

Confusion matrix
The confusion matrix is a tool that easily and effectively shows the performance of the classifier and has the advantage of being easy to interpret the results. A confusion matrix can be used to evaluate the performance of any models or algorithms. As shown in Table 1, the rows in the confusion matrix represent the values of the predictive class and the columns represent the values of the actual class. Each cell is one of the possible combinations of prediction and actuality. In the 2×2 confusion matrix, there are true positive (TP), false positive (FP), false negative (FN), and true false (TF) [23].
The perfect model will only have values on the diagonal, the rest of the cells will be all zeros, and the bad model will be evenly distributed in all cells. The error matrix tells us how bad a model is when it is bad. The value of each cell can identify a misclassified pattern [24]. Methods for summarizing the results of the confusion matrix include accuracy, precision, and recall.
The accuracy is obtained by dividing the accurately predicted number (TP+TN) by the total number of samples, and is represented by (2). Among the methods for summarizing the results in the confusion  (3) and (4), respectively.
Precision is a positive predictive value that measures how many of the samples (TP+FP) predicted to be positive are true positives (TP). Precision is used as a performance indicator when the goal is to reduce the number of false positives (FP).
Sensitivity measures how many of the total positive samples (TP+FN) are classified as positive classes (TP).

PREDICTION MODEL
After collecting algal blooms dataset from the National Institute of Fisheries Science, it was cleaned and refined. The first multiple logistic regression analysis was performed on the refined algal blooms dataset, and some attributes were removed through a statistical test. A second multiple logistic regression analysis was performed with the exception of the attributes removed and then the regularization was applied. After applying the regularization, a third multiple logistic regression analysis is performed and the results are applied to the confusion matrix. Figure 1 shows this process. ) = 0 + 1 1 + 2 2 + ⋯ + [25]. Therefore, for multiple independent variables that affect harmful algal blooms, the multiple logistic function that allows the dependent variable range to be between [0, 1] is as shown in (5). In (5) calculates the effect of each element of the ocean observation data, which is an independent variable, on the occurrence of a harmful algal blooms as a dependent variable. This is a basic model for estimating the probability of occurrence of harmful algal blooms.
The maximum likelihood estimation is used to estimate parameter in regression expression 0 + 1 1 + 2 2 + ⋯ + by logit transformation. The log likelihood function can be obtained from the likelihood function [26] expressed as the product of Bernoulli's probability function, and is expressed as (6). The parameter that maximizes the log likelihood function in (6) is determined from multiple independent variables that affect the harmful algal blooms. ln = ∑ ( 0 + 1 1 + 2 2 + ⋯ + ) + ∑ ln (1 + 0 + 1 1 + 2 2 +⋯+ ) The L1 regularization [27] used to eliminate low-impact independent variables among multiple independent variables that affect harmful algal blooms is shown in (7).
The properties in the marine environment observation dataset are shown in Table 2 and used as independent variables in logistic regression. In (8) is obtained by applying seven independent variables, such as water temperature, salinity, dissolved oxygen, phosphate phosphorus, nitrous acid nitrogen, nitric acid nitrogen, silicic acid silicon, to the basic model of multiple logistic regression (5).
P-value is used to determine if any independent variable was statistically significant in the results of multiple logistic regression analysis on the training dataset, and independent variables with a P-value of 0.05 or higher are excluded. The parameters for statistically significant independent variables are as shown in (9).
The regulation for removing an independent variable close to zero in order to make some coefficients zero is as shown in (7). The result is as shown in (10) when (7) is applied to the result of (9).
In (11) is the logistic regression model for algal blooms prediction obtained by applying the above process to the algal blooms dataset.
The normalization process from the (8) to the (11) is from Step 2 to Step 5 among the algorithms in Table 3, respectively. The algal blooms prediction model was normalized while performing experiments based on the algorithm in Table 3. The detailed experimental process is described in section 4. The equation for obtaining a decision boundary to increase the sensitivity and precision is defined as shown in (12). and then removing low-weight independent variables, finally setting up a logistic regression model, and finding the decision boundary finally. Table 3. Algorithm for establishing algal blooms prediction model Step Statements 1 Extraction, Transformation and Loading from collected dataset Prepare training dataset 2 Perform multiple regression analysis using (1) on the training dataset Output regression coefficients and statistical tests 3 Perform a statistical significance test •attributes P-value > 0.5 are excluded in the training dataset 4 Perform multiple regression analysis for the training dataset with attributes whose P-value <= 0.5 Output regression coefficients and statistical tests for attributes whose P-value <= 0.5 5 Regularize regression coefficients from step 4 using argmin ∑ ( − ∑ Perform multiple regression analysis using a regularized regression formula Output test dataset 6 Input test dataset from step 5 Predict probability using decision boundary = |0.5 ± 1 2 ( + + + )|based on confusion matrix

EXPERIMENT
Multiple logistic regression analysis (8) can be performed on the training dataset to obtain the results shown in Table 4. In Table 4, p-value is used to determine whether any independent variable is statistically significant, and independent variables with a p-value of 0.05 or higher are excluded. Parameters are determined for statistically significant independent variables in Table 4 and L1 regularization is applied and then the results shown in Table 5 can be obtained. In (11) of the logistic regression model for algal blooms prediction is obtained from the coefficients in Table 5. Predicting the occurrence of algal blooms from (11) gives 91.84% accuracy. Accuracy alone may not be sufficient to assess the predicted performance of algal blooms. We utilize the confusion matrix since we do not know false negatives or false positives of algal blooms. A confusion matrix for algal blooms shown in Table 6 is obtained from algal blooms dataset.  Table 7 shows the sensitivity, specificity, and precision are obtained based on the decision boundary 0.5 using the values of the confusion matrix in Table 6. The prediction rate of false negative of algal blooms is low as the sensitivity is 27.04%, and the prediction rate of false positive is also low because the precision is 51.99%. Since the sensitivity and precision are low in case of the decision boundary is 0.5, we apply proposed the decision boundary (12) in order to solve these problems, and results are as shown in Table 8. When the decision boundary proposed in this paper is applied, the decision boundary becomes = |0.5 ± 0.25|. When this is used as a decision boundary, TP=494, TN=9026, FN=327, FP=7, the sensitivity is 60.17%, and the precision is 98.6% as shown in Table 8.

CONCLUSION
In this paper, logistic regression and confusion matrix were used to predict the occurrence of algal blooms. Algal blooms datasets were collected and refined for experimental analysis of algal blooms prediction. Logistic regression analysis was performed on refined algal blooms dataset and main marine environmental factors affecting algal blooms were found through statistical test and regularization processes. Logistic regression was performed using the marine environmental factors that were influential on algal blooms and the accuracy of algal bloom occurrence was obtained. The values of the confustion matrix were obtained using the dataset for algal blooms prediction and the predicted values obtained from logistic regression. Although the sensitivity and precision for the occurrence of algal blooms can be obtained from the values of the confusion matrix, the sensitivity and precision were low when the existing decision boundary was 0.5. Sensitivity and precision were improved by using the decision boundary proposed in this study. In this paper, the algal blooms prediction model was established by the ensemble method using logistic regression analysis and confusion matrix. Also, the accuracy, sensitivity, and precision for algal blooms prediction were improved, and these were verified through big data analysis.