A comprehensive study on disease risk predictions in machine learning

Received Oct 4, 2019 Revised Feb 22, 2020 Accepted Feb 29, 2020 Over recent years, multiple disease risk prediction models have been developed. These models use various patient characteristics to estimate the probability of outcomes over a certain period of time and hold the potential to improve decision making and individualize care. Discovering hidden patterns and interactions from medical databases with growing evaluation of the disease prediction model has become crucial. It needs many trials in traditional clinical findings that could complicate disease prediction. A Comprehensive study on different strategies used to predict disease is conferred in this paper. Applying these techniques to healthcare data, has improvement of risk prediction models to find out the patients who would get benefit from disease management programs to reduce hospital readmission and healthcare cost, but the results of these endeavors have been shifted.


INTRODUCTION
Machine learning is one of the most prevalent methods used in multiple computer engineering areas and has been commonly used in the processing of natural language, picture processing, pattern recognition, cyber security, and multiple areas. One of the most dynamic researches of machine learning is healthcare industry [1]. As healthcare firms are trying to collect records of patients, estimates show that there are about 1 trillion bytes of information, which is increasing every day. These data need to be properly extracted to obtain precious information [2].
Sometimes patients fail to define their medical problems correctly and the results of laboratory research can lead to some degree of mistake. Specialists find it difficult to make decisions about the illnesses because they may not have skills in all areas. To address this issue, it is necessary to develop a disease prediction system that combines medical knowledge with an integrated system to produce the biggest results and can help society [3].
Several earlier investigations tried to use patient laboratory tests [4][5][6] and drugs [7] to predict the occurrence of disease. Such prototypes were also used to define unknown risk factors, often while enhancing sensitivity and specificity of detection simultaneously. Recent studies have been effective in predicting disease through multiple methods, including supporting vector machines [8][9][10], logistical regression [11], random forests [3,12], neural networks [4,13], and time series modeling techniques [14].
In summary, the paper focuses on various disease prediction models of machine learning. The paper is structured in the following way. Section 2 explains the basic concept of machine learning, prediction models and types of prediction models. Section 3 explains the survey of disease prediction models. Section 4 explains the comparative study of heart diseases. Section 5 summarizes a comparative study of breast cancer. Section 6 gives the conclusion and future enhancement.

INTRODUCTION TO MACHINE LEARNING
Machine learning (ML) is an artificial intelligence (AI) branch that helps analyze the data structure and fit the information into models correctly. It is one of the computer science fields, and differs from other computing technology by the way of training the computers based on data inputs and it uses statistical analysis to get the proper output. For this reason, ML is used in automated decision-making models like Facial recognition, Recommendation engines, OCR and Self driving car applications.
Machine learning methods are categorized into three classifications according to the training processes used. The categories are supervised machine learning, unsupervised machine learning and reinforcement learning. In supervised learning, the data samples with category labels are used in training. Classification and regression models are examples of supervised learning. The algorithms used in this approach are decision tree, naïve bayes etc. In unsupervised learning, the data samples are directly used in training without category label. Clustering techniques and encoders are the basic examples of unsupervised approach. Reinforcement learning is a mixture of prior two approaches. It uses agent that finds the correct action to achieve the overall goal of the application [15].

PREDICTION MODELS
A prediction model is characterized as a model that provides a way to evaluate the individual danger of a patient for the outcome of a disease. The question of when, what and how to use these models arises with the growth of such prediction models. These models can be taught over time, providing the demands of the company, to react to new information or views.
Two types of prediction models exist. They are models of classification predicting class outcomes, and models of regression predicting the relationship between a response variable Y and a predictor variable X. Various basic and advanced algorithm, listed in Figure 1, Figure 2 conducts data analysis and statistical analysis and determine information trends and patterns. While machine learning and prediction analysis can provide an opportunity for any implementation, the haphazard implementation of these options will drastically impede their ability to provide insight into the demands of the organization without considering how they fit into everyday operations. Organizations need to guarantee that they have the architecture in place to support these alternatives, as well as high-quality information to feed them and assist them learn, to make the most of prediction analytics and machine learning [16].

EXISTING TECHNIQUES FOR DISEASE PREDICTION
Mingyu Pak, Miyoung Shin [17], explored few environmental variables linked to type-2 diabetes disease and selected some variables to develop an analytical prototype of prediction of disease. For the choice of important variables, they first pre-processed all external environmental factors into numerical data and then estimated the maximum/minimum probability ratios of all the sorted exogenous variables. The top-n positioned variables were then chosen as input variables for the forecast model depending on ansan/ansung cohort 2 data collected from the Korean National Institute of Health (KNIH), the disease risk factor prediction model was created using SVM. Their prediction model showed the output of 65.97 percent precision and showed very identical performance only with genetic factors with particular environmental variables to the model [17]. The research shown in [18] focuses on disease forecasting from medical information supplied by New York's Presbyterian Hospital. Since these are medical information, predicting computerized predictions is generally unique and simpler than forecasting user text inputs. Nicolae Dragu et al discussed-on forecast of serious contagious diseases from web material sources, which is also a specific origin where open clinical terms are used [19]. Many attempts have been made to forecast selective diseases [20,21]. For instance, the authors in [20] deal with the prediction by mining content of coronary heart disease. There are also an important amount of study works on medicinal services debates that have been carried out. Lav petrov used NLP to evaluate and break down user remarks to predict disease and concentrate uncommon responses to drugs [22].
Ryan McDonald et al, use a terminology-laden interface (i.e. clients need to explore a long list of hints). It's an awkward job from the user's view, and the operation is also tedious. Also, if the customers do not find a certain indication, they are forced to prevent that symptom that is not in any manner desired [23]. Client's directed text input [24,25]. They rely on negligible symptom-disease relationship system [26,27] in any situation and use complete content database. These frameworks begin to search for accurate term match at the input of the client from each input line in the database. For example, if a client with symptoms equivalent is not specified in the database, then we cannot match the information correctly. If the user input should contain a greater number of semi-technical terms than anticipated, it will degrade its performance. The system used is especially strong and restricted to particular data types.
Xiaoyan Wang et al, proposed an automated system for disease prediction that based on the client's driven feedback. Their structure requires input from the client such as the names of symptoms and certain significant parameters and provides a list of likely illnesses (maximum infections are more likely to occur). The accuracy of the automated disease prediction scheme (ADPS) evaluated the solution of the current scheme with a standard 14.35 percent greater accuracy in examination [28]. S Manimekala et al, suggested that the Automatic Disease Prediction technique determine the most probable disease based on the feedback of the patient that facilitates early detection. The model uses data mining algorithm apriori-frequent pattern (A-FP) to identify illnesses through health data mining focused on the signs of input. This method is used for finding medical datasets from which to create association rules. The goal is to identify from the health dataset relevant and frequent illnesses [29].
Cristinel Ababei et al, discussed the overall context of computer frameworks on multicore processor systems, they provided a discussion on the most common prediction and classification methods. They introduced some prediction systems from simple to more complicated, while highlighting one frequent basic theme: each of these systems misuses the prior history of the variable of interest [30]. Ankita Dewan et al, in conjunction with the back-propagation method, they suggested a successful genetic algorithm for predicting heart disease and created a model that could identify and extract unknown data (patterns and relationships) linked to coronary heart disease from archived database records of heart disease [31]. Table 1 shows the different methods and their advantages and disadvantages. The merits and demerits are specified in terms of efficiency, prediction accuracy and how easily a model could be implemented.  [32], suggested a decision-making scheme based on fuzzy rules to assess the level of risk of heart disease. He first pre-processed the information for missing values in his method. He then performed fuzzy sets generation and created a fuzzy decision-making system. The technique selects the required attribute based on the number of events in the database to create weighted fuzzy laws. These weighted fuzzy rules are then used to create a decision-based scheme of assistance. He tested his suggested scheme on three distinct dataset kinds that are collected from V.A. Cleveland dataset. Medical centre with 303 cases of training data for 202 documents and test information for 101 documents. P. K. Anooj contrasted his model with models based on neural networks and obtained the greatest precision [32]. Tsipouras M.G., et al. [33] suggested an automated system based on fuzzy modeling and information mining to predict heart disease. His model includes measures such as inferring information decision tree, extracting guidelines, formulating crisp model, transforming crisp model into fuzzy model, and optimizing it. The information gathered from Invasive Cardiology, University Department, Ioannina Hospital. The technique provides 80% precision in awareness and 65% precision in specificity [33].
Chaitrali S. Dangare et al. [34], used heart disease prediction data mining method. He first pre-processed the missing information using the mean mode technique and used the perceptron multi-layer model to map the information. To analyse the database of heart disease, Naive bays, neural networks and decision tree were used. He gathered 303 documents from the repository of Cleveland heart disease and used it as a training set and gathered 270 documents as test information from the repository of Stat log Heart Disease. The information set consists of attribute, input and key predictable. His model provides 100% accuracy for neural networks, Decision tree with 99.62 and Naïve bayes with 90.74% precision [34].
Mai Shouman et al. [35], by incorporating decision tree and k-means clustering, suggested a technique for diagnosing patients with heart disease. For k-means, the method utilizes centroid selection technique and decision tree is used to determine the clusters. Thirteen distinct characteristics are gathered from the Cleveland Clinic Foundation Heart Disease. Compared to the current decision tree, the integrated model of k-means and decision tree obtained greater outcomes of 83.9% [35].
S. Pal et al [36], suggested using information mining to predict heart disease. He surveyed 3 distinct classifiers such as CART, ID3 and tree of Decision. The information set from the Cleveland Clinic Foundation shows that 83.49 percent of the classification and regression tree (CART) precision was much better than the decision tree and ID3 (Iterative Dichotomized 3) [36]. H. A. Huijer et al. [37], designed a decision support service to find unrest transition through decision trust measure, trust-based SVM and trust-based multilevel SVM to discover agitation transition. 240 Samples are gathered via human body sensors. The patient experiences distinctive tension inventory of state-quality scale (T-STAI), used to calculate uncomfortable adults. The technique provides 91.4 percent precision when compared to traditional vector supporting machine with 90.9 percent precision [37].
Latha Parthiban et al. [38], outlined a prediction technique for smart heart disease prediction. The method is executed using coactive neuro fuzzy inference system, neural network, and genetic algorithm. The dataset is gathered from UCI and the prototype is simulated using Neuro Solution Software. The mean square error of CANFIS was 0.000842 [38]. N. Deepika et al. [39], suggested a heart attack patient classification model. He pre-processed his information sets for missing values and then implemented the same width binning interval approach. Then numerical parameters are transformed into categorical parameters and frequent patterns are mined based on pruning-classification rule algorithm linked to heart disease. His model used an effective forecast of particular class label [39].
T. Turner [40] proposed the idea of diagnosing heart disease by combining naïve bayes and k-means clustering with distinct choice of centroids. Cleveland clinic foundation collects the data set. The precision of the embedded k-k-is 84.5 percent compared to the traditional algorithm [40]. Uzma Ansari et al. [41] used weighted associative classifier to develop a model for predicting heart attacks. The data set is gathered from the ML database of Irvine University of California (UCI). He used 2 class labels 1 in his model for "No heart disease" and other one for "Heart disease" rather than getting 5 class labels with 1 for no heart disease and 4 for four heart disease kinds. He used 80% of confidence value and 25% of support value. The prototype proposed achieves precision of 81.51 percent. He concludes that the measured associative classifier is the easiest way to acquire efficient important patterns from information setting for cardiac disease [41]. The below Table 2 shows the comparative surveyof heart disease techniques which we have discussed above.

Description of existing techniques for breast cancer prediction
A fuzzy model was created by Yassi et al. [42] to differentiate between normal and malicious breast cancer. The technique brought disorder into the hierarchical cluster of partial swarm enhancement of multispecies, prompting the improvement of chaotic hierarchical cluster-based multispecies swarm enhancement of particles (CHCMSPSE). CHCMSPSE helps to distinguish the form of cancer of the breast and to enhance the fuzzy rules. The model also discovers fuzzy rules very correctly. The dataset is gathered from the machine learning database of Irvine University of California (UCI). Thus, for worldwide search capability, the technique utilizes 11 chaotic maps. Sinusoidal chaotic map acquired 99 percent precision from those maps because it matched with the position of the issue. The model achieves more than 90% precision [42].
In defining 5, 10 and 15 years of specific breast cancer sustainability, Lundin M et al. [43] provided a prototype for accessing the precision of ANN. The data source is collected from City Hospital of Turku and Turku University Central Hospital with 951 instances. In that training set of 651 instances and a validation set of 300 instances. This prototype compares the outcomes of artificial neural network and logistical regression. The precision of breast cancer specific survival for 5 years reported as 0.909,10 years reported as 0.086 and 15 years reported as 0.883 [43].
Delen D et al. [44] implemented a data mining technique comparison approach that involves logistical regression, decision tree, and artificial neural network to predict breast cancer development. The model utilizes over 200,000 cases of an enormous information repository. Thus, logistical regression precision is 89.2%, decision tree precision (C5) 93.6% and artificial neural network 91.2%. 10-fold cross-validation for information testing, unbiased estimation measurement and 3 techniques prediction. Research indicates that the selection trees is the best determinant method for defining breast cancer growth relative to the artificial neural network and logistic regression [44].
Bellaachia Abdelghani et al. [45] used information mining techniques to present a model for predicting breast cancer development. The pre-classification process is carried out in three areas: recovery of essential status, recovery of survival time and cause of death. Three techniques of machine learning, namely: neural network propagated back, naïve bayes and C4.5 for classification performance. The data set is gathered from the National Cancer Institute (NCI) Surveillance, Epidemiology, and End Results. There are 151.886 records in the dataset, with 16 characteristics. The model provides roughly 87% precision [45].
H. Koyuncu et al. [46] implemented a biomedical pattern based on artificial neural network rotational forest algorithm (RF-ANN). Multilayer perceptron was used as the classifier of base and the model used RF algorithm as the classifier of ensemble. Using the main component assessment, different function sets are gathered from the information set. The precision of RF-ANN is therefore 98.05 percent [46].
S. M. Jamarani H et al. [47] proposed a method for recognizing the disease and helping radiologists predict breast cancer. The model combines decomposition of artificial neural network (ANN) and sub-band picture based on multiwavelet. The technique is studied using mammographic database by mammographic image analysis society (MIAS). The highest output of biorthogonal Geronimo, Hardin and massopust multiwavelet 2 (BiGHM2) was among the various kinds of multiwavelet. Thus, BiGHM2 accomplished precision in the operating characteristic curve of the receiver around 0.96 [47].
T. Nguyen et al. [48] suggested an automatic wavelet-based technique for classifying medical information and a type-2 fuzzy logic. They carried out execution from the UCI database for machine learning on 2 medical datasets: Cleveland heart disease and Wisconsin breast cancer. The outcome demonstrates that, compared to other machine learning methods, the advantage of interval type-2 fuzzy logic scheme is better [48]. Z. Mahmud et al. [49] suggested a method to use age, marital status and therapy among Malaysian women to find out about cervical cancer. The records of patients with cervical cancer are gathered from the medical center of Kebangsaan Malaysia University (UKM). The model has four phases, with 444 records of patients impacted by cervical cancer, and finding out the age and marital status of women's medical therapy. They discovered that the 46-year-old females are more likely to develop cervical cancer. So, it is suggested that Malaysian women undergo testing prior to the age of 45 and they also found that Chinese women under the age of 57 have more likelihood of being diagnosed with radiotherapy in the original phase of cervical cancer [49].
M. Seera et al. [50] suggested using hybrid smart classification to classify medical information to predict cancer. The model has a random forest, classified trees, and a regression tree and a min-max neural network. The technique of random forest is used to create a classification and regression Tree model ensemble. Fuzzy min-max is used for teaching purposes. The tree of classification and regression is used to extract the rule. The precision of this model for cancer forecast was 98.84% [50].
W. Kuo et al. [51] suggested a novel technique for breast tumour prediction in clinical ultrasonic images using the decision tree. The model concentrated on pictures from the United States. The decision tree  [51].
Seon-Hak [52] constructed a prototype using rough set structures for hierarchical classification. The model is based on the framework of hierarchical granulation to find out the laws of classification and therefore suggested a discovery of laws. The technique is validated against the Wisconsin breast cancer (WBC) data gathered dataset. The model still generates excellent efficiency when loaded with simple rules and brief conditionals. Thus, by creating minimal classification rules, the model was effective in decreasing the number of dimensions. His model makes it simpler for us to analyze the information system [52]. The below Table 3 shows the comparative study of breast cancer techniques which we have discussed above.

CHALLENGES AND RESEARCH OPPORTUNITIES
Reviews of data mining methods, classification methods, smart methods and choice of features for disease prediction were discussed here. As the selection of characteristics enables us to eradicate unnecessary data, large-dimensional data must be compressed without the loss of data, which improves the effectiveness of classification. But the difficulty of subset attribute selection is high, which is complicated as it requires complex interdependence on a wide range of factors. We could incorporate guidelines and feature selection for better results in the classifiers in the future. Additionally, new feature selection method such as ant colony optimization, etc. is possible to test to improve quality, and you can attempt experimenting with algorithm potential for most medical datasets that include distinct features such as noisy information, sparsity, missing value, etc. to enhance model accuracy.

CONCLUSION
This paper's primary focus is to discuss various prediction models and techniques used for predicting heart disease and breast cancer. The technique also sheds light on the significance in medical dataset of various classification techniques for disease prediction. The dataset that we have discussed in so many current methods is linked to heart and breast cancer. As a classifier, the different machine learning methods are used to construct a value-effective model for predicting disease. It is therefore well recognized by the exhaustive study that the extraction of the necessary data from the clinical repository helps us to promote excellently-informed testing and choices.