Feature selection for multiple water quality status: integrated bootstrapping and SMOTE approach in imbalance classes

Received Nov 24, 2019 Revised Feb 17, 2020 Accepted Feb 25, 2020 STORET is one method to determine the river water quality, and to classify them into four classes (very good, good, medium and bad) based on the data of water for each attribute or feature. The success of the formation of pattern recognition model much depends on the quality of data. There are two issues as the concern of this research as follows, the data having disproportionate amount among the classes (imbalance class) and the finding of noise on its attribute. Therefore, this research integrates the SMOTE Technique and bootstrapping to handle the problem of imbalance class. While an experiment is conducted to eliminate the noise on the attribute by using some feature selection algorithms with filter approach (information gain, rule, derivation, correlation and chi square). This research has some stages as follows: data understanding, pre-processing, imbalance class, feature selection, classification and performance evaluation. Based on the result of testing using 10-fold cross validation, it shows that the use of the SMOTE-bootstrapping technique is able to increase the accuracy from 83.3% to be 98.8%. While the process of noise elimination onthe data attribute is also able to increase the accuracy to be 99.5% (the use of feature subset produced by the information gain algorithm and the decision tree classification algorithm).


INTRODUCTION
STORET is a method used by the Minister of Environment in to determinine water quality status in river/water body [1]. The performance process of STORET method is comparing the data resulting from water sampling with the water quality standard in accordance with the classes and based on the attributes used. The more parameters used may incur more cost related to laboratory handling and measurements. It is because the observation and analysis are conducted in the laboratory for each sample of water data for each sampling point. The number of data analyzed requires automation in determining the water quality status. It requires a model improvement in the pattern recognition field that can be used to classify the water quality status. Generally there are some methods that can be used to measure the water quality status as follows: (a) water quality index as conducted by [2] who suggests The West Java Water Quality Index (WJWQI) to measure the water quality in West Java province, and [3,4]; (b) based on community suggested by [5]; (c) Water pollution Index [6]; (d) STORET index [7].
Water has a lot of parameters that can be measured to determine its quality status. Based on the value of some selected attributes, the quality status can be classified. In pattern recognition, one of the important components determining the success grade of classification process is the suitable feature  [8]. There are two process related to feature those are feature extraction process [8] and feature selection process [9]. There are some reasons why feature selection process becomes very important in the pattern recognition as follows: to improve the performance of a model of the pattern recognition system (simple model that has quick performance by eliminating the irrelevant data) [10], to visualize the data on the selection process of model, to decrease the dimension and noise on the data [11]. There are two important issues required to be concerned after the feature extraction process those are the data finding which amount is imbalance among its classes and the noise on the data attribute.
Two approaches are deliberately used to handle the imbalance class case those are for oversampling and under-sampling cases. One technique that can be used to handle both cases is called SMOTE technique [12]. For oversampling case, the duplication of data will be conducted on the minority class. On the other hand, for under-sampling case, some data samples will be eliminated from the majority class or by combining both or usually called the hybrid technique [10]. The use of this technique has the same aim, which is to find the dataset for the learning process having the same data or having almost the same data among the classes (balance). The SMOTE technique has been used to solve the imbalance class case on several studies, among others are the data for detecting the attack [13], the medical data [14][15][16] and the e-commerce data [17]. Besides SMOTE, there is another technique called bootstrapping that can be used for resampling data. Resampling technique can be used to handle the problem of the data amount on the smaller class from its quantity by changing the distribution of minority class underrepresented during the data training process in the machine learning algorithm. Resampling technique is also known as the solution on imbalance class case for learning dataset. This method is suitable to be used on the data in great scale, which is conducted to decrease the amount of data training sample. So that the training need can use fewer amount of data that represent the actual data.
The noise existence on the attribute data certainly will give impact on its classification performance. If the data used has the very great amount of attribute/parameter or feature, it certainly will give impact during the computing process [11]. Therefore, the feature selection process is required. Generally there are three approaches that can be conducted to select the attribute or feature; including filter approach [18], wrapper approach [19] or embedded approach [19]. In filter approach the process between feature selection and learning is conducted in series. It is different from the wrapper approach that is conducted in parallel. In filter approach, the process of selecting the feature subset is previously conducted based on the weight of each attribute or feature. The weighing is conducted for each attribute or feature to rank the attribute based on the threshold value that has been determined [18].
The classification stage is conducted after obtaining the selected feature. There are several algorithms for the learning process which aim for classification as follows: Decision tree (DT) [18], naive bayes [17], K-nearest neighbors (KNN) [20], random forest [21], artificial neural network [22] and support vector machine [23]. Naïve Bayes is a simple classification model and its learning process does not require a long time if compared with other classification models. Besides, it is also recognized as having good prediction accuracy performance. The use of naïve bayes algorithm is easy and comfortable because it does not need to conduct the complicated parameter estimation and it is reliable to use on the great data [24]. DT is one of the classification algorithms much implemented in several cases of machine learning. The aim of DT is to make a model that can be used to predict the value of a target class on the invisible instance test based on several input features [17,25]. Some advantages of DT rely on its simplicity, easy to understand, easy to implement, requiring a little knowledge, being able to use in dataset either numeric or categorical, and being able to handle dataset in great amount [26,27].
Based on the research conducted previously, there is no model integrating the use of bootstrapping resampling technique and SMOTE technique to handle the imbalance class case in multi class case. Besides, the feature selection process by filter approach is conducted to handle the noise on data attribute. There are five algorithms (information gain, rule, chi square, correlation and derivation) used based on the value weight produced and afterwards the performance will be compared among each other. While for classification there are four algorithms (Decision tree, K-nearest neighbours, naïve bayes and random forest). 4333 free chlorine, oil and fat, Cd, Zn, Cu, Pb, total coliform and Faecal Coliform. The total data used is 120 data samples with 22 parameters. This research has six stages as follows: data understanding, pre-processing, imbalance class, feature selection, classification and performance evaluation, as shown in Figure 1.

RESEARCH METHOD
The collected dataset has a dimension of 120 rows and 22 columns, the row shows the data taken for each location of taking the river water sample, while column shows the attribute/feature/parameter of water used to determine the status of river water quality. The STORET method is used to determine the status of river water quality based on [1]. STORET method is used to determine the status of water quality. This method compares the data from field measurement with the water quality standard in accordance with waterclass. Hence, Brantas River is included in class 2 category. Before conducting the process of determining the status of water quality, it should previously conduct the following:

. Manual feature sorting
Based on the data result collected, there are some data that cannot be filled completely for all features. It is due to several causes, one of which is each feature/attribute that is not detected by the measuring tool because each has the value under or over the threshold of the measure tool. Based on 22 features, 13 features are selected, while 8 others are not used to determine the status of river water quality.

Determination of status of water quality of Brantas River using STORET method
The determination of status class of the river water quality is conducted based on thirteen selected features. In this case, there are four classes of river quality water as follows: A (excellent), B (good), C (intermediate) and D (bad). The performance process of STORET method is by comparing the data of result of taking the water sample with water quality raw in accordance with its class and based on the parameters used. In this case, Brantas River is included in the second class category for its quality raw. Based on the classification result of the status of river water quality, the unbalance class case is found with the details of classes as follows: A=10, B=16, C=80 and D=14.

Preprocessing 2.2.1. Missing data elimination
Before conducting the process of selecting the best feature, the process of zero/empty data elimination should be conducted in order not to disturb the performance of algorithm that will be applied to the next process. There are several ways to fill the empty data. It can be filled by the average/minimal/maximal value of data on the feature, or it can be filled with zero value. This research chooses the data average value of the feature.

Data transformation
The data that will be processed needs to be statistically normal to keep staying in one range of the same value. There are several formulations or ways to normalize the data. This research uses proportion transformation. Normalization aims at getting the value on each attribute proportionally.

Imbalance class 2.3.1. SMOTE (synthetic minority over-sampling technique)
From the result of determining the status of Brantas River water quality using STORET method, it finds a case of unbalance class, so it needs to conduct SMOTE technique [12]. There are two approaches that can be conducted to take SMOTE, with random over-sampling (ROS) and random under-sampling (RUS). Considering the data used for searching the best model to determine the status of river water quality is not too big, ROS approach is selected then. The river water data included in category A, B and D is very minimal so it needs to add the synthetic data taken randomly from the same feature to get the same data mount between the minority and the majority class.

Bootstrapping sampling method
After the dataset is taken from using SMOTE exactly using random over-sampling technique, afterwards it needs to select the data sample on the data training randomly so that the data used has smaller measure.

Feature selection
The aim of the feature selection process is to eliminate the feature not having a strong contribution in determining the status of water quality. This certainly gives impact on the measure of data dimension either for data training or data testing. Generally there are four approaches to select the feature subset, among others are: filters, which is the process of feature evaluation conducted independently from the learning process; wrappers, which is the process of feature subset selection based on the evaluation result of the learning process; embedded, which is the feature selection conducted during the learning process; and simple filters by assuming the independent feature (this approach is usually used on data with many features such as on the case of textual classification). In this research, the process of feature selection uses filters approach that separates the process of evaluating the best feature subset and the learning process. The determination of the best subset is based on the score or weight produced by each feature subset. The stage of filter approach is shown by Figure 2. This research uses four algorithms included in filters approach category to get the weight value as follows: derivation, information gain, chi square, rule and correlation.

Classification
The classification stage has a role to find out how far this classification model is able to determine the status of river water quality in the right way based on the data of river for each attribute. There are four algorithms used in this research as follows: Decision tree (DT) [18], naive bayes [17], K-nearest neighbors (KNN) [20] and random forest [21].

Performance evaluation
The testing process of this research uses k-fold cross validation in distributing the dataset into two parts those are data for training and data for testing alternately for ten times. While several parameters are used to compare the performance between one model and other models those are accuracy, precision, and recall [28].

RESULTS OF RESEARCH 3.1. Data understanding
Based on the result of feature selection in manual way and the classification of status of the river water quality, the STORET method is used based on the selected feature. There are four classes those are A, B, C and D. The example of data used in this research is shown in Table 1. There are thirteen selected features (temperature, pH, DHL, DO, BOD, COD, TSS, NO 3 N, NO 2 N, PO 4 P, detergent, total coliform and faecal coliform) based on the unit of quality raw. For example, the data in the fourth column is the data sample of river water in one point of observation with the value for each feature amounted eleven features in category of status of quality A. The values of the total coliform and faecal coliform features are not detected.

Pre-processing
In this stage, there are several steps conducted to prepare the dataset that is free from the empty data and to normalize the data to get the good result of classification process. In this case, some experiments of pre-processing method are conducted with the classification method of decision tree using five-fold cross validation with stratified sampling. The result of testing experiment is different from the t-Test to get the best pre-processing method, which is shown in Table 2. Column B shows the accuracy result of using the data that previously has not conducted the normalisation of 79.2%. Column C shows the accuracy result of using the data that previously has conducted the replace process towards the missing value of 82.5%. Column D shows the accuracy result of using the data that previously has conducted the normalization of 79.2 %, and column E shows the accuracy result of using the data that previously has conducted the normalization process and the replace process towards the data with missing value of 83.3%. Based on the difference of the testing result with the t-Test, it shows that the pre-processing process (conducting the normalization of data and the replace towards the missing value) is able to increase the performance of classification result.

Imbalance class
The dataset obtained in data understanding stage has imbalance data in each class. This condition later will give effect on the data training process. Therefore, three scenarios are conducted at this stage as follows: SMOTE, bootstrapping and integration between SMOTE and bootstrapping, in which the training and testing process are conducted with the decision tree method using 10 fold-cross validation. The different testing experiment result with the t-Test in handling the imbalance class case is shown in Table 3. Column B shows the use of SMOTE method, column C shows the use of bootstrapping method, and column D is the integration between SMOTE method and bootstrapping. Based on the explanation of Table 3, it shows that the use of SMOTE method is able to increase the accuracy result of training of 96.5% and the accuracy result of training process keeps increasing using the integration method between SMOTE and bootstrapping of 98.8%. Table 3. Different testing result with t-Test to get the method to handle the imbalance class case

Feature selection
The target of this stage is to get the best feature in determining the status of river water quality. There are five algorithms used in conducting the feature selection with filter approach as follows: information gain, chi square, derivation, correlation and by rule. The coding is conducted previously for each feature. F 1 =Temperature; F 2 =pH; F 3 =DHL; F 4 =DO; F 5 =BOD; F 6 =COD; F 7 =TSS; F 8 =NO 3 N; F 9 =NO 2 N; F 10 =P o 4P; F 11 =Detergent; F 12 =Total Coliform and F 13 =Faecal Coliform. The result of selected attribute and feature for each algorithm of feature selection is shown in Table 4. Based on the data from Table 4, it can be seen that the finding of feature subset has the best score using five feature selection algorithms with filter approach. For example, in number 1 the second row there are eight feature subsets produced by the algorithm rule those are: BOD, TSS, NO 3 N, NO 2 N, DHL, COD, DO and Total Coliform. Afterwards, the feature selection result uses chi square algorithm (pH, COD, DO, BOD, NO 3 N, Faecal Coliform, Temperature, Total Coliform), information gain (BOD, Faecal Coliform, Total Coliform, COD, DHL, detergent, TSS, NO 3 N, NO 2 N, temperature), correlation (pH, faecal coliform, total coliform, COD, detergent, DHL, DHL, NO 3 N) and derivation (P o 4P, detergent, TSS, NO 2 N), which is shown in Table 4 in the next row with several selected feature subsets. Afterwards a learning process is conducted from those several feature subsets using the decision tree method to know the performance and the result is shown in Table 5. Column B to column F show the classification result using the selected feature subset produced using several feature selection algorithms (chi square, derivation, information gain, correlation and rule). The t-Test testing result shows that the use of selected feature subset produced by the information gain algorithm has the highest accuracy value of 99.5%.

Classification
After the best feature subset has been obtained, which is produced by some algorithms, a classification is conducted using four classification algorithms then the average value is calculated from the use of the selected feature subset. The classification algorithm used are: decision tree, k-NN, naïve bayes and random forest. Based on the data shown in Table 6 and Figure 3, it can be found out that the result of classification using eight feature subsets produced by chi square algorithm with the highest accuracy value is produced by the decision tree algorithm of 98.50% with the accuracy average for four classification algorithms of 96.29%. While the result of classification using four feature subsets produced by the derivation algorithm with the highest accuracy value is produced by the random forest algorithm of 98.49% with the accuracy average for four classification algorithms of 91.86%. The use of feature subset produced by the information gain algorithm amounted ten feature subsets is able to produce the highest accuracy value with the decision tree and random forest classification algorithms of 99.50%. While for the average of 96.92% the different result is also shown by the rest of the two algorithms those are correlation and rule.

4337
Both produce the best accuracy value by the same classification algorithm that is random forest of 97.99% and 99.50%. Generally, it can be concluded that feature subset produced by the information gain and random algorithms is able to produce the accuracy level more than 96.5%.

Performance evaluation
Generally the model of pattern recognition for the classification of the status of river water quality based on several water feature subsets has the sub stage of process as follows: without pre-processing, pre-processing, SMOTE technique, and bootstrapping to handle the imbalance class and the feature selection. In this case, a comparison for each sub-process is conducted using the decision tree algorithm in the classification process. Based on the testing result using 10-fold cross validation, the accuracy average value is obtained as seen in Figure 4.

CONCLUSION
The amount of data that is imbalanced in each class is proved to give effect on the learning process on the system of pattern recognition. The SMOTE technique and bootstrapping are proved to be able to handle the imbalance class case, in which there is a significant increase in the accuracy value from 83.3% to 98.8%. While to decrease the noise in the attribute, some experiments have been conducted using five feature selection algorithms (chi square, correlation, derivation, information gain and rule). If seen from the average, the use of feature produced by the rule algorithm and the information gain algorithm has the best accuracy value of 97.87% and 96.92%. The use of selected feature using the information gain with the decision tree classification algorithm shows the increase in the accuracy level of 99.5%.