Oversampling technique in student performance classification from engineering course

Received Jul 31, 2020 Revised Dec 24, 2020 Accepted Jan 18, 2021 The first year of an engineering student was important to take proper academic planning. All subjects in the first year were essential for an engineering basis. Student performance prediction helped academics improve their performance better. Students checked performance by themselves. If they were aware that their performance are low, then they could make some improvement for their better performance. This research focused on combining the oversampling minority class data with various kinds of classifier models. Oversampling techniques were SMOTE, BorderlineSMOTE, SVMSMOTE, and ADASYN and four classifiers were applied using MLP, gradient boosting, AdaBoost and random forest in this research. The results represented that Borderline-SMOTE gave the best result for minority class prediction with several classifiers.


INTRODUCTION
Nowadays, educational data mining (EDM) is widely used to help student and teacher manage their proper learning environment. Student performance, grade prediction or student dropout prediction are developed in several case studies. Some of the works focus on the classification model and feature selection. However, some studies are the approach to deal with imbalanced data which affects to performance of classification accuracy in minority class. This paper focuses on first year subject of engineering course that all engineering student must take and pass.
One problem of student performance prediction in this research is the imbalanced data from 3 different classes (low, medium and high). We use synthetic minority oversampling techniques (SMOTE), Borderline-SMOTE, SVMSMOTE, and ADASYN to generate synthetic instance data. Then, we apply oversampling data with various classifier techniques to improve performance of classification in minority class. Moreover, student can take proper action if they got a low class predicted. SMOTE algorithm is the oversampling technique to do resampling and balancing the data set. Resample data was created by interpolation within minority class data. This algorithm relies on the relation of data values. The synthetic data was created by minority class instances. The synthetic data relied on distance and the nearest neighbors which was randomly selected from the same class. Firstly, minority class sample, k of nearest neighbors and the amount of oversampling was setup. Secondly, k nearest neighbors were randomly selected from minority sample class then generated new instance data by interpolation [1]. In this paper, the author reviewed and analyzed various oversampling techniques and many imbalanced datasets. The best result showed 4 classification measures of imbalanced learning. The four first oversampling techniques was polynomial-fitSMOTE, SMOTE-IPF, ProWSyn and Lee.
In [2], the Author reviewed several sampling techniques to deal with the imbalanced dataset. Not only the prediction accuracy rate but also the learning time were appropriated to evaluate the model with balanced dataset. SMOTE and SMOTEBoost were represented as an oversampling technique in this research. The Receiver Operating Characteristic (ROC) curve was used to summaries classifier performance. The area under the curve (AUC) was used to evaluate classifier performance for ROC curve.
This paper used a wrapper approach technique to reduce the dimension of student imbalanced data set. The metric was true positive (TP) rate [3]. In [4,5], the researcher handled unbalanced data set by a hybrid resampling technique. The combination of SMOTE and DBSCAN technique was used for 12 oversampling data then result represented the highest performance of DBSM by AUC and F-measure.
In [6], the Author used SMOTE and boostrapping to handle unbalanced data. They used several feature selection methods with decision tree, k-NN and Bayes classification model. They found SMOTE and boostrapping increased the accuracy of classification.
Adaptive synthetic (ADASYN) sampling was the oversampling technique by generating minority class instances. ADASYN sampling relied on data density distributions. This technique emphasized on difficult sample set and different weight on different minority instances to compensate for slope distribution [7]. The amount of synthetic instance data depended on density distribution.
The researcher presented ADASYN sampling technique with imbalanced data set and used a decision tree for the classification model. The approach showed the efficiency of ADASYN with five evaluation metrics. ROC was used for based evaluation metric [8].
This paper used extend ADASYN with adaptive synthetic-nominal (ADASYN-N) and ADASYN-KNN (k-nearest neighbor for multiclass imbalance cancer data in Indonesia. ADASYN technique generated instance with nominal data types. ADASYN-kNN generated instance data by voting the attribute of the nearest neighbors. The result of this paper showed that ADASYN-kNN give the highest performance [9].
In [10], ADASYN algorithm and borderline-SMOTE algorithm were used to approach BASMOTE algorithm with stock market data. The result showed that BASMOTE provided higher performance than the traditional oversampling method. This research [11,12] was an approach to deal with Imbalanced data in customer churn prediction. They used ADASYN to oversampling minority class instances and backpropagation algorithms to classify churn customers. The result represented ADASYN method to increase F1-score performance by threshold correlation 0.01.
SVM was wildly used in regression analysis and classification. This algorithm approach was to create a hyperplane that classifies two or more example sets. It was a good performance algorithm that used with nonlinear classification. This methodology give the highest hyperplane to classify a different class [10]. SVMSMOTE technique used a support vector machine algorithm to synthetic new minority instance observation between minority and majority class's border [13].
This research applied SMOTE-CSVM and OS-SVM as an oversampling technique with 3 human activity open datasets. They used soft margin SVM as a classifier. Their research represented that SMOTE-CSVM and OS-SVM improved prediction accuracy of imbalances human activity datasets [14].
In [15], the researcher applied resampling SVM to various imbalanced datasets. They used 10 imbalanced datasets from UCI machine learning datasets. They applied SVM algorithm to an imbalance dataset with evaluation measurements such as F-measure, AUC. The experiment of this research presented a high performance of SVM algorithms with imbalanced data.
The borderline of minority class examples was used to generate oversampling instances. First, the borderline of minority example sets was found out. Then, the synthetic instance was generated from the borderline data and added to the original training dataset [10,16].
This paper [17] compared many oversampling algorithms in UCI example data set using SMOTE, borderline-SMOTE, safe-level SMOTE, and ADASYN. They used F-measure value to evaluate the efficiency of experiment with various classifiers such as nearest neighbor, Naïve Bayes, and SVM. The result showed safe level-SMOTE had a higher performance than other algorithms.
Credit data was used to predicted credit risk using K-XGBoost model with border line-SMOTE. The researcher applied SMOTE, safe level SMOTE and borderline-SMOTE with 3 classes unbalanced credit data. Then, they classified credit risk by XGBoost model. The result of the experiment pointed that Borderline-SMOTE had the highest performance algorithm [18].
The researcher explored various oversampling techniques in the imbalanced dataset. SMOTE, ADASYN, SPO, INOS, DataBoost, and Borderline-SMOTE was presented with imbalanced data to improve classification accuracy. The survey showed INOS outperform than other oversampling methods with time series imbalanced data [19].

3569
 Classification method a) Multi-layer perceptron (MLP) neural network MLP was a classifier technique that arranged in the layer of connecting the compute unit (neuron). Simply the process of MLP, the first layer was an input layer which sent information to calculate in hidden layers with weight. Then, the information in each layer was passed through and sent to the output layer.
DoS flooding attack was an interesting issue. This research compared an SVM classifier with other methods such as MLP neural network, k-NN, Naïve Bayes, decision tree, random forest and logistic regression for detection system. The result of comparing represented that MLP neural network was the highest accuracy performance classifier [20].
MLP was used as a classifier without a hidden layer. They used glass dataset and pregnancy dataset applied with mutual information augmentation component. The result showed that the proposed method was improved to generalize performance [21]. b) AdaBoost Adaptive boosting (AdaBoost) was a sequential ensemble method that combined multiple weak learners together, built a strong learner, increasing the final prediction's performance. The main idea was reweighting, which focused on misclassified data points. To be classified in the next round, there will be an increase in the weight of the misclassified data points and decrease the weight of the data points that are correctly classified. Then, take these data points to train and create a new weak learner. These will be sequential, and the weight will be adjusted every round.
This research proposed BSO-AdaBoost-kNN to deal with imbalance class classification. AUCarea was used as an evaluation metric. They applied the proposed method in oil-bearing of reservoir recognition and provide high precision at 99% [22].
In [23], the Resampling method in imbalanced dataset was used to improve the performance of classifier. The researcher proposed several ensemble methods with classifiers. The result showed that their proposed method had high performance. AdaBoost was used as classifier method to detect malicious URLs. They showed that AdaBoost algorithm gave more accuracy than other algorithms [24]. c) Gradient boosting Gradient boosting was a model that used decision trees to train several trees together, in which each decision tree learned from previous tree errors, resulting in greater accuracy in prediction. When there was continuous learning of the tree until there was enough depth and the model stopped learning when there was no pattern errors from the previous tree [25].
Ensemble methods were used in bank direct marketing method [26]. One method was a gradient boosting method. The researcher used ROC curve as an evaluation metric for neural network, logistic regression and gradient boosting classifier. d) Random forest The principle of the random forest was to create models from multiple decision tree models. Each model received a different data set, which was a subset of all data sets. When making a prediction, each decision tree was given. Make their prediction and calculate prediction by the highest votes chosen by the decision tree or find the mean from the output of each decision tree.
In [27], k-NN, C4.5, random forest, Naïve Bayes, ANN and AdaBoost were used in customer churn prediction. The researcher used AUC as an evaluation metric. The result represented that random forest was the best classifier in the experiment. This research used random forest with feature selection methods. The result showed that random forest gave the accuracy better than other techniques [28].  Performance metric In the unbalanced data issue, the accuracy of classification was not a good indicator for evaluating classification performance. We used various performance measures includes confusion matrix, Precision, Recall, F1-measure and area under curve (AUC).

RESEARCH METHOD
In this section, student performance classification with oversampling imbalanced data technique was presented. The procedure consisted of; i) gathering data, ii) Pre-processing data, iii) Oversampling imbalance data, iv) Performance classifier.

Data gathering
With permission from Faculty of engineering of Rajamagala University of Technology, Thailand, our research picked up 463,956 first year student data records. Then, data were grouped into 6,882 records by

Data cleaning
The example set was managed by detect outlier and remove noise data. Special character such as $ or # and missing value were removed from regrade data. The best grade in the same subject remained in regrade data.

Data discretization
The example set was separated into 3 classes (High, Medium, Low) by accumulative GPA. High GPA range was 3.00-4.00 (860 instances) Medium GPA range was 2.00-2.99 (5908 instances) and Low GPA range was 0.00-1.99 (114 instances). The imbalanced data of student performance was showed in Figure 1.

Oversampling imbalanced dataset
We generated synthetic data by SMOTE method. The example instance distribution of oversampling data was shown in Figure 2.

Student performance classifier
We used four classification techniques consisted of neural network, Ada boost, gradient boosting and random forest technique to classify student performance from oversampling imbalanced data set.

RESULTS AND DISCUSSION
In oversampling minority class step, we use 4 methods to generate new instances in low and high classes of student performance. Table 1 shows the amount of original data and oversampling data. In performance classification with oversampling data step, we apply 4 classification techniques with 4 oversampling datasets. Figure 3 demonstrates example of confusion matrix of MLP classifier with SMOTE technique.   Table 2, the result shows the performance improvement of minority class prediction. SVMSMOTE is the best method for minority class (low and high) in recall, f1 and AUC. Gradient boosting provides high-performance precision, recall, f1 and AUC likely as the same value with 4 oversampling methods in Table 3. Table 4, borderline SMOTE and SVMSMOTE are suited in low class. SMOTE is proper with high class. The best method of overall in AdaBoost is SVMSMOTE. The result of Table 5 presents the performance of oversampling method with random forest. For low minority class, borderline-SMOTE is the best performance, and high minority class SVMSMOTE is the best performance. The result of the experiment represents performance of the evaluation metric. The performance of minority classes classification from four classifiers shows that borderline-SMOTE is the highest performance. In Table 3. Because of the classifier method, oversampling is not significant in minority class classification.

CONCLUSION
Education mining in several problems such as student drop out, grade prediction and student performance has an imbalance class data problem. The researcher is developing an algorithm to solve the problem at algorithm level and data level to improve classification performance. This research emphasizes in oversampling method to improve imbalance class problem in education mining and overfitting problem also.
From the experiment result, the best performance oversampling method is borderline SMOTE. Students and instructor can use the classification model with oversampling to improve student performance. The future task, we will use more significant feature and more datasets. Moreover, we will extend the algorithm of oversampling or under-sampling methods to improve student performance.