An efficient stacking based NSGA-II approach for predicting type 2 diabetes

Diabetes has been acknowledged as a well-known risk factor for renal and cardiovascular disorders, cardiac stroke and leads to a lot of morbidity in the society. Reducing the disease prevalence in the community will provide substantial benefits to the community and lessen the burden on the public health care system. So far, to detect the disease innumerable data mining approaches have been used. These days, incorporation of machine learning is conducive for the construction of a faster, accurate and reliable model. Several methods based on ensemble classifiers are being used by researchers for the prediction of diabetes. The proposed framework of prediction of diabetes mellitus employs an approach called stacking based ensemble using non-dominated sorting genetic algorithm (NSGA-II) scheme. The primary objective of the work is to develop a more accurate prediction model that reduces the lead time i.e., the time between the onset of diabetes and clinical diagnosis. Proposed NSGA-II stacking approach has been compared with Boosting, Bagging, Random Forest and Random Subspace method. The performance of Stacking approach has eclipsed the other conventional ensemble methods. It has been noted that k-nearest neighbors (KNN) gives a better performance over decision tree as a stacking combiner.


INTRODUCTION
Diabetes mellitus is composed of a group of metabolic disease predominantly characterized by high blood glucose levels. Type 2 diabetes mellitus (T2DM) is amongst the most common chronic disorders plaguing the modern life, often complicating other pre-existing diseases. It is caused by varying levels of insulin resistance, decreased insulin secretion, and increased production of hepatic glucose by the liver. A vast majority of diabetes cases in the world are due to type 2 DM. The diagnosis of DM has significant ramifications to an individual's health as well as financial status.
Early diagnosis of type 2 diabetes is still a formidable task at large for the medical service sector. Development of newer innovative models will be required for timely detection of diabetes. This in turn will reduce the dreaded complications and also the burden on the health care system. In today's high-tech world and evolution of technology, techniques like machine-learning have boundless role to play in predicting 1,133 patients admitted in local hospitals with their due consent has been gathered and collated in the second dataset. The collated data has a significant gain over the Pima (benchmark data) as it removes the selection bias present in the latter. The biases are tuples of only females more than twenty-one years of age and it has only two labels-diabetic and healthy. In the collated data, a pre-diabetic class has also been introduced. Lifestyle modifications in pre diabetics will curb the disease (disorder) progression. Total count of attributes gathered from every individual was thirty-three. The features are: gender, weight, height, age, waist (i.e. circumference), body mass index (BMI), systolic and diastolic blood pressure (BP), hemoglobin A1c (HbA1c) level, high-density lipoprotein (HDL) and low-density lipoprotein (LDL) cholesterol, very-lowdensity lipoprotein (VLDL), serum creatinine, triglyceride level, fasting blood glucose (FPG), post prandial glucose (PPG), family history, medications for high BP, physical activity/exercise (minimum thirty minutes daily), vegetables and fruit eating every day, excess appetite, smoking status, drinking, excess thirst, frequent urination, increased fatigue, itchy skin, frequent infection, depression and stress, poor wound healing and hazy vision.

PROPOSED METHOD
Proposed methods of important feature selection and prediction model are discussed below. The flow diagram of suggested NSGA-II stacking model of T2DM prediction is shown in Figure 1. NSGA-II is one of the evolutionary algorithms developed by Deb et al. [15] and can be employed to get multiple Pareto optimal solutions in one run.

Feature selection
Binary chromosomes were employed to represent the features in this method (1-feature selected, 0-feature not selected). NSGA-II is utilized to pick the important features and rule out the irrelevant ones. It is a multi-objective optimization technique that can be developed to choose minimal number of features and maximize the accuracy. In Figure 1 the process of attribute selection is depicted. This algorithm was found to have high efficacy in the literature review. This technique emphasizes non-dominated sorting. It is notable that elitism present in the algorithm does not permit an already converged Pareto optimum solution to be removed. Population set is initialized randomly and is sorted on the basis of non-domination. A population set consisting of offspring is created by using binary tournament, mutation and recombination on parent population. A joint population set is created from the population of parents and offspring. The sorting of population is carried out on the basis of non-dominant relations. A new population set is created by including the solutions of the initial front and the subsequent fronts till the population limit is reached to preserve elitism (ensuring that previously obtained best solutions are stored). Every individual is assigned a rank based on their fitness value or on the basis of the front to which the individual belongs to. The fitness value of the first front individuals is assigned 1 and the ones in the second front are assigned as 2, similarly for 3, 4, and till the last front.
Diversity is preserved with the help of crowding distance comparison. The parameter known as crowding distance is computed for all the individuals using (1): where, is the minimum value of ith objective function, is the maximum value of ith objective function for normalization, i represents the ith objective function value.
is initialized to 0. This crowding distance is defined as the measure of the closeness of one individual with others in the population. If the average crowding distance is large, then the result will be better in terms of population diversity. For the purpose of computing the crowding distance of an individual, the average of all the crowding distances is computed with respect to the neighboring individuals who are in the same front in all the objective test functions (or dimensions) using (1). With the help of the selection of binary tournament, the parents are chosen from the population on the basis of the crowding distance and the rank they hold. For the choice of nondominant solutions, the following partial order methods in (2) is adopted: when rank based selection is carried out, the rank of the selected individual is smaller than the others. When two solutions belong to the same Pareto front, the solution with large crowding distance will be selected. In the crossover and mutation processes, offspring are produced from the population selected. The sorting process is again repeated for the current offspring and current population on the basis of non-domination. Only the P best individuals are chosen, where P indicates the size of the population. Thus, process of selection is carried out on the basis of crowding distance and rank. The pseudocode of the NSGA-II feature selection method is presented as follows. Select an individual X. Perform mutation to obtain offspring population Q. End for 5. Combine offspring Q and parents P. 6. Use Pareto dominance sort to rank individuals. 7. Compute the crowding distance of individuals in every front by the (1). 8. Select the best N individuals based on calculated ranks using (2) and maximum crowding distance to form the next generation. 9. Extract the best Pareto fronts to present results.

Prediction model
Ensemble stacking approach (SA) is a technique that puts together different prediction approaches onto a single framework, working at levels or layers. This method presents the meta-learning concept and intends to reduce the generalization errors by decreasing the bias of the generalizers. Initially, training data is used to train base learner models. In the next stage, a meta-algorithm or a combiner is trained for making the final prediction based on the outputs obtained from the base leaner models. Such kind of stacked ensembles intend to perform better than any individual base leaner models. The process is described as given below: Step 1: Learn first level classifiers based on the original training dataset.
Step 2: Creation of new dataset with reference to the outputs obtained from the base learners. The predicted outputs obtained from the 1 st level learners are called the new features and the actual outputs are set as classes to form the new dataset.
Step 3: Train a 2 nd level learner on the basis of dataset newly created. Finalize the training algorithm to be employed for training the second level learner based on the accuracy. The flow diagram of the suggested model of T2DM prediction is depicted in Figure 2. The dataset is read and is first pre-processed to fill the missing places. Once the data is pre-processed, the parameters of NSGA-II are initialized. The NSGA-II produces Pareto optimal solution set (significant features). The dataset of selected features is split into training set-60%, validation set-20% and testing set-20%. The training set is utilized to construct base learners. Following diverse base learners-linear SVM, radial basis function (RBF) 1019 SVM, Gaussian RBF kernel, KNN-1, KNN-3, KNN-5 and decision tree are harnessed to increase the performance. The predictions on validation set along with the actual labels form the level 1 data which is subsequently used for training the meta learner. Experimentation with four types of meta learners namely bagging, decision tree, linear SVM and KNN was carried out. The base learners were combined with meta learner in identifying the best Meta learner based on the results obtained. It was noted that KNN performed relatively better and has shown encouraging results with respect to the obtained accuracy. Finally, KNN was fixed as a stacking aggregator. KNN classifies records based on the closest training samples in the feature space. Majority voting technique is used to classify the instance amongst its k-nearest neighbors. After exhaustive trials, k is set to 3. The testing set is given as input to the trained meta classifier (KNN) and final predictions are produced.

EXPERIMENTAL DETAILS
In the study, 25 runs of the developed technique (NSGA-II stacking approach) were performed using MATLAB. Ballpark figure of 25 run was considered based on the results of the previous experimental work of the reference paper [9]. An extensive analysis involving comparison of the accuracy, specificity and sensitivity, error rate, of the proposed model with other approaches was done. Generalization performance of NSGA-II Stacking based predictive model of T2DM is included in this sub section. The graphs in Figures 3 and 4 represents trade off among the Pareto optimal solutions for error value-number of features selected from Pima dataset and collected dataset respectively. The graphs drawn are for a one run where Y-axis denotes error value and X-axis represents number of features selected. Graphs in Figure 3 and 4 shows that each solution corresponds to number of features selected obtained by maintaining a balanced tradeoff between error value and number features selected. It has been observed that the error is minimum when 4 out of 8 features were selected in the Pima dataset while in the case of collected dataset the classification error was minimum when 20 out of 33 features were chosen.  In the experiment performed, planned NSGA-II stacking approach has been compared with boosting, bagging, random forest and random subspace method. The performance of stacking approach has eclipsed the other conventional ensemble methods. It was noted that KNN has given a better performance over decision tree as a stacking combiner. Nine performance measures were computed for analyzing the classification performance of the NSGA-II stacking model. The parameters are accuracy of classification, precision, error, specificity, sensitivity, false positive rate, F1-score, Matthew's correlation and kappa coefficients. These performance indicators are computed as listed in Table 1. TP, FP, TN and FN denote true positive, false positive, true negative and false negative respectively. The average model performance of NSGA-II stacking model is depicted in Table 2. Performance of NSGA-II stacking model on Pima and collected data is depicted in Figure 5.  Results of the suggested method is compared with the results obtained by other researchers since last five years. The benchmark dataset for the comparison with other models used is Pima by other researchers. Hence the proposed model was validated using the collected dataset and Pima dataset. Based on the prediction accuracy selected approaches from the current work are compared with the suggested model and shown in Table 3.
The comparison of the obtained predictive accuracy of proposed work with the existing methods over Pima has been shown in the Table 3. The result indicates that though the accuracy obtained was 81.9% over Pima dataset but the predictive model has outperformed over the collected dataset. The accuracy obtained was 88.18% over the collected dataset. After twenty-five runs of the experimental study, the chosen attributes varied but the mostly the attributes selected were weight, waist circumference, body mass index, HbA1c level, HDL and LDL cholesterol, VLDL, serum creatinine, Triglyceride level, fasting blood glucose (FPG) and post prandial glucose (PPG), family history, excess hunger, excess thirst, frequent urination, infection, poor wound healing. 6. CONCLUSION NSGA-II was utilized for feature selection and stacking approach was implemented for the prediction of T2DM. Stacking exploits the capabilities of a number of high performing models for classification and produces better results than any single model in ensemble. Extensive experimentation was carried out on the collected and benchmark dataset for verifying the efficacy of the developed models to find prediabetic stage. These models were compared with the existing ones and it was seen that the accuracy obtained was increased dramatically. The developed model (proposed) can detect this illness early (prediabetic stage), so that the physician and patient can work towards prevention and mitigation of complications caused by T2DM. The proposed work can be beneficial in myriad fields such as text mining, bioinformatics, image processing and for fault diagnosis and is of utmost significance for medical and data mining community. The proposed models' accuracy over Pima can be further improved by implementing outlier removal techniques.