Analysis of WEKA data mining algorithms Bayes net, random forest, MLP and SMO for heart disease prediction system: A case study in Iraq

Received Feb 17, 2020 Revised May 29, 2021 Accepted Jun 11, 2021 Data mining is defined as a search through large amounts of data for valuable information. The association rules, grouping, clustering, prediction, sequence modeling is some essential and most general strategies for data extraction. The processing of data plays a major role in the healthcare industry's disease detection. A variety of disease evaluations should be required to diagnose the patient. However, using data mining strategies, the number of examinations should be decreased. This decreased examination plays a crucial role in terms of time and results. Heart disease is a death-provoking disorder. In this recent instance, health issues are immense because of the availability of health issues and the grouping of various situations. Today, secret information is important in the healthcare industry to make decisions. For the prediction of cardiovascular problems, (Weka 3.8.3) tools for this analysis are used for the prediction of data extraction algorithms like sequential minimal optimization (SMO), multilayer perceptron (MLP), random forest and Bayes net. The data collected combine the prediction accuracy results, the receiver operating characteristic (ROC) curve, and the PRC value. The performance of Bayes net (94.5%) and random forest (94%) technologies indicates optimum performance rather than the sequential minimal optimization (SMO) and multilayer perceptron (MLP) methods.


INTRODUCTION
The complications of heart attack can be considered as the main world's leading causative agent to, to stop attacks in conjunction with early diagnosis. Many of information, usually produced by physicians with rich hidden material, but used inefficiently for forecasting. Hence, utilizing many of data mining strategies helped in turning unused data into a useful data set. Several of signs have not been taken into account, which let to dying people. Professionals of medical should predict heart disease before it happens in any patient [1]. There are many of characteristics that may increase the possibility of heart diseases [2]: i) Smoking: Destroys the lining of the arteries by releasing a fat content, such as atheroma, which decreases the arteries that activate heart attack, ii) High cholesterol: Cholesterol is a waxy material found in the fatty plaques of blood vessels. High cholesterol doesn't really allow sufficient blood to enter the lungs, causing heart disease, iii) Inappropriate diet: Blood pressure and cholesterol are increased by eating so much unhealthy food, that can cause heart disease, iv) Lack of physical activity: An increase in the levels of

PROBLEM STATEMENT
The implementation of machine learning methods for the classification and prediction of heart disease has been investigated in previous researches. These, however, offer a model for prediction of heart disease for the diagnosis of heart disease incidence. In addition, this analysis aims to determine the best classification method to find the risk of heart disease in a case. This research is justified by a comparative study and observation using four classification techniques, i.e. sequential minimal optimization (SMO), multilayer perceptron (MLP), random forest and Bayes net. The evaluations are used at various levels. While these machine learning methods are widely used, the prediction of heart disease is a critical task requiring the highest possible precision, comparing with [11]- [17]. Therefore, the four algorithms are tested in a number of assessment levels and types. It provides medical researchers and physicians with a greater understanding and helps them find the best way to prevent cardiac disease. WEKA software should be used in the proposed framework. The Weka software tool has been used to evaluate heart disease data. This paper's key contributions are: a. Classified precision extraction is important for prediction of heart disease. b. Use the Ibn al-Bitar Hospital Cardiac Surgery and the Baghdad Medical City electronic diagnostic cardiac condition database and the collected actual information database for the training and testing of the program. c. To achieve the highest level of classification accuracy bayes net (94.5%) and random forest (94%) methods, investigate knowledgeable classification strategies. d. Evaluation of suggested classifier classification results. And check the performance of the classifiers suggested by comparing them with existing classifiers of other works. e. Evaluate the best results of the suggested WEKA software classifiers. f. Comparison of various algorithms for data mining on the dataset for cardiovascular disease. g. Classification of the best algorithms for prediction of heart disease based on the results.

LITERATURE REVIEW AND RELATED WORKS
Dwivedi [18] used six machine learning classification techniques that were applied to the heart disease dataset. In this study, this author used tenfold cross validation for evaluation and eleven performance measures for comparison. Thereafter, a study of Gharehchopogh et al. [11]. Researchers used 40 people in their medical records. Blood pressure, gender, age and tobacco use are the conditions used for detection. The model correctly anticipated 85% of cases. Multilayer perceptron (MLP) utilization on the heart disease datasets exceeded accuracy by 80.89% in the WEKA software. Ramotra et al. [12] Suggestion of a machine learning model for using the WEKA method for predicting cardiovascular disease. The data contained 303 data and 76 specifications. 297 data with 13 input functions are required for analysis after pretreatment of data and removal of missing values. The authors claim to be 80.89 percent accurate. An efficient heart disease detection system was introduced by Purushottam et al. [17] data mining utilization. It can help doctors make parameter-based decisions effectively. The device is formed and tested by a model 10 times, and the precision of 86.3% during the test and 87.3% during the training process is proven. The authors noted that the overall accuracy of the multicoyer perceptron (MLP) classification was 74.85%.
Jothikumar et al. [18] Suggestion of a model using a learning method to estimate medical history with 295 samples and 13 characteristics apply to the naive Bayes algorithm in quick producer. Other similar metrics are Kappa 0.499, absolute error 0.247%, RMSE is 0.378, and relative error 24.19%. Sarangam Kodati et al. [19] It is suggested that the preceding analysis is 77.9% in Orange and 73.4% in Recall of cardiopathy results. In the WEKA precession, 81.8 percent and recall, 81.9 percent. Comparison between the software Orange and WEKA, Weka is the best reminder and precession.
The sequential minimal optimization (SMO) method was introduced by Platt [20] in 1998 and was the fastest method for optimizing algorithmic programming. Sequential minimal optimization (SMO) is used to prepare the algebraic kernel or RBF kernel vector classification supporters. This replaces all conditional attributes with the null values and transforms them into binary ones. Aung et al. Suggests a machine learning approach for predicting heart conditions using the WEKA tool [15], design that utilizes a minimum sequential optimization strategy and a mitigation strategy for lazy classification. The Weka data mining approach has been used to predict heart disease. 66 percent of the data set (training) and 34 percent (testing) for analysis was instructive.
In order to evaluate heart disease, Mirmozaffari et al. [16] proposed a method for the classification of various data mining methods. It has developed a particular model of different filters and methods of analysis. For multi-layer pre-process filtering, the superior approach and the more precise clinical resolution assistance systems for the diagnosis of diseases are used, as are varying. The UCI system information is routinely viewed in a database or in a report. This work uses the Waikato framework for knowledge evaluation. The data sets must be in the attribute-relation file format (ARFF), to use this data for the WEKA method. In pre-processing the dataset, the WEKA method is used. Just major attributes, i.e. 13 in this case, are taken into account when evaluating all these 13 attributes, which provide better and clearer results. After all, unimportant attributes are discarded. The 13 th is essentially an expected class feature. Through analyzing the various decision tree algorithms inside WEKA tools extensively and making the choices it makes, the device will help predict the probable existence of cardiac diseases in a patient and definitely help diagnose cardiac diseases well in preparation and cure them in good time. Some of the standard machine learning of data mining challenges is in the following areas: a. Extraction of valuable information and development of scientific decision-making capability for disease treatment and diagnosis. b. Classification of the developments of effective medical treatments for various ailments. c. Too many attributes available for decision-making so must determine which the best prediction of heart disease. d. With the assistance of computerisation, voluminous real data (text, graphs, and images) are now being processed, but it is still more difficult to collect. e. Handling noisy (containing errors or outliers), confusing (containing code or name discrepancies) and lack of attributes to be pre-processed for medical data problems. f. Determine the best tools and algorithms for analysis the datasets by using WEKA tools, and for future work trying to use MATLAB program for developing the work.

RESEARCH METHOD
The purpose of this study is to successfully predict possible heart attacks from the compilation of medical data. Using prediction algorithms to evaluate the characteristics of cardiac disease by certain attributes, a model have been developed. Data mining is used in this work to create class predictive models based on features selected. The Waikato environment for knowledge research (WEKA) has been used for prediction because of its ability to discover, study, and forecast trends. It is typically possible to divide the entire process into 6 stages:

Description of the algorithms
Heart disease is a word used to describe a large variety of health circumstances associated with the heart. These medical conditions specifically describe the pathological diseases of the heart as well as all parts of it. A substantial health concern is heart disease. Over the years, the number of people who have heart disease has increased [20]. Several studies focused on the management of heart disease have been conducted. Various techniques for diagnostic data mining have been applied and various probabilities have been obtained. Many studies are being conducted to assess the inefficiency of MLP, Bayes net, SMO and random forest algorithms. There are several possible strategies to treat heart disease [21]: a. MLP: The perceptron multi-layer algorithms help the problems of regression and classification. It is also called, for short, artificial neural networks or just neural networks. Neural networks are a challenging algorithm to be used for predictive modeling since there are so many parameters of configuration that can be effectively tuned only by observation and a number of trial and error [20]. b. Random forest: An ensemble of random decision tree classifiers is a random forest that makes predictions by combining the individual trees' predictions. In the decision tree construction process, various methods are possible to incorporate randomness. To make forecasts about classification or characteristics, a random forest can be used. One of the best predictive analytics is random forests [22]. c. Sequential minimal optimization (SMO): is an algorithm for solving the quadratic programming (QP) problem that arises during the training of support-vector machines (SVM). It is commonly used for machine learning training, support and is introduced by the common LIBSVM tool. In the SVM community, the publishing of the SMO algorithm in 1998 created a lot of anticipation, as previously available techniques for SVM training were much more complicated and costly third-party QP solvers were required [23]. d. Bayes net: The Bayesian network is a combination of probability and graphic models. It is widely applicable in machine learning, data mining, and diagnostics. because it has a solid evidentiary-based conclusion that is familiar to human intuition [24].  [25]. The standard data kit had been getting from the Iraqi hospitals under the oversight of the National Minister of Health includes 200 samples. To detect heart disease with a high degree of accuracy, a large range of relevant inputs must be considered. The physician relies on all the recorded symptoms, patiently answering questions, medical testing and laboratory performances. Overall, the data had been collected from the Ibn al-Bitar Hospital and the Baghdad Medical city based on these medical factors to provide appropriate medical criteria for the detection of heart disease. There has been considerable difficulty in collecting these factors by some medical variables, such as (Maximum cardiac rate, ST depression, fairly restful exercise, the slope of the ST highest exercise section, and Number of key fluoroscopy-colored vessels) Such variables are therefore substituted for medical causes by cardiologists (heart rate, family history, smoking, hyperkinesia Echo, and an earlier angina assault) [26]. The adapted medical variables, consider the causal factor, the family medical history, besides the observed echo the probability of prior angina to get adequate medical causes; these data would include four classes of heart disease besides normal classes. Table 1 shows the availability of the five classes of cardiac diseases and includes 13 medical features required for cardiovascular treatment. To create a diagnostics system, these factors are turned into a numerical simplification [27].

Attribute description
a. Age: represents in years the numeric value of age. b. Sex: which will be represented in binary (0=male, 1=female). c. Cup Type: the abbreviation of Chest pain types, which will be introduced as follows: Value 1: Typical Burning Sensation in heart. Value 2: Acute stabbing (such as pain).

Performance metrics
The metrics used in the analysis will be defined in detail throughout this section [25]:

Precision
Precision1: the part between the accumulated instances of major cases. The precision equation is:

Recall
The small subset of the required instances in the overall number of particular instances. The recall equation is:

F-Measure
The f-measure is examined based on the 2-fold precision reminder period separated by the sum of accuracy and reminder [28]. The F-Measure equation is provided in (3).

Area of ROC
ROC equations are commonly used as visuals about any cutoff, including clinical sensitivity and accuracy, for an assessment or a variety of tests, relationships, and trade-off.

Area of PRC
The number of lower grades of patients without a diagnosis are not affected by curves for correct recall. It is particularly important to use precision recording formulas to supplement the ROC formulas to obtain the complete spectrum during analysis and selection. The classification model product [26], as shown in Table 2. a. Positive true (TP1): It was fairly expected that patients were positive (Patients are likely to require heart failure and heart cauterisation.). b. Positive false (FP1): If TP1 and TN1 are approximately 100 percent, the model is ideally predicted to be negative, because they are not supposed to have a cardiac catheterization. c. TN1 is a negative true: Healthy people are properly classified as healthy. d. FN1 is negative false: Classified incorrectly as healthy [28] heart disease patients. e. Correct classified cases (CCC): This represents the proportion of patients who need and not need heart surgery and are diagnosed correctly. Accuracy [29] is also known as elk (4).
f. Mean absolute error (MAE): A test of predictors. The calculation of 1-ACC is probable. A strong system has a very high absolute mean error [30]. g. Kappa: Prediction identification with a correct class is checked by Kappa. The statistical effect of a kappa is a score in the 0-1 range. A value greater than 0 means it is better than average for the classifier [31]. h. Root mean squared error (RMSE1): The difference between the value predicted and the value observed [32] is the root mean squared error.
1, 1 = Value predicted. = Max value of fitness applicability of j.

Proposed ٍ strategies
This research aims to predict the possibility of heart disease occurrence by early automatic diagnosis within short time. In addition, that will help healthcare professionals to treat their patients early based on accurate decision-making. In addition, the proposal has a crucial role in healthcare Organization especially for experts with having less knowledge and skills. The accurate results considered as the major limitation of existing methodology. The proposal used both data mining techniques and machine learning algorithms SMO, MLP, random forest and Bayes net, with k-fold cross-validation to predict the occurrence of heart disease. Many of medical attributes had been used to identify if the patient either has heart disease or not, such as blood pressure, cholesterol, age, blood sugar, sex, and heart rate. The data set had been analyzed and computed using the WEKA software. WEKA is open-source software that includes a set of machine learning algorithms for the data mining tasks. WEKA had been implemented with Java code. WEKA contains several tools, which are important in data mining tasks: preprocessing data, regression, clustering, classification, association and visualization. The analysis of WEKA methodology as shown in Figure 1.

RESULTS AND PERFORMANCE COMPARISONS
The Weka data mining tool has been used for clinical forecasts, Part of the dataset is used for learning and the remainder of testing the classification results are shown in the Table 3  It can also be shown that in the recall metrics, multilayer perceptron (MLP) and SMO are almost identical, but distinct in precision. Figure 2(c) demonstrates that Bayes Net performs the best, providing maximum F-measure values (0.885), ROC (0.971) and PRC (0.864). Figure 2(d) demonstrates that, while random forest, MAE (0.088) highly smaller than sequential minimal optimization (SMO) (0.2492) but RMSE of random forest (0.1992) is the least and Kappa of random forest (0.8431) is the greatest. Figure 2(e) shows RRSE of Bayes Net (54.84%), which contributes to better prediction results.

5237
To evaluate the efficiency of classification strategies for class prediction and determination accuracy, the algorithm is employed in the data set through stratified 10-fold testing. The resulting uncertainty matrix calculates the measurements for accuracy, sensitivity, and specificity. The matrix applies to samples labeled as true, others as false and others as wrong. Confusion matrix estimation reveals that sequential minimal optimization (SMO), multilayer perceptron (MLP), random forest and Bayes net show 200 instances with the positive causal factor for a heart attack. Predictions show a predictive model. The methods strongly advise that techniques for data mining can predict a diagnostic class. The matrix of uncertainty specifically classifies the functional accuracy. The matrix confirms the model's performance.
We checked the formulas for the classification of heart diseases mentioned in the work experience section of the suggested classifiers. It contrasts the suggested WEKA method classification method with other research findings in Table 4 and Figure 3, in comparison with the current methods and experimental tests, we consider that our proposed system is better than the other model in prediction and diagnosis of heart disease. Therefore, the precision of the classification of current models is improved.  Figure 3. Comparison between heart disease prediction system using different techniques and other works

CONCLUSION
In this analysis, we have submitted an effective prediction method for heart disease with data extraction and test the accuracy of heart disease prediction with a group of classifiers. The collected heart database for training and testing purposes was used from the hospital of Ibn al-Bitar and Baghdad medical city. This program will assist physicians inaccuracy, parameter-specific decisions. The research has been successfully performed in several techniques for the classification of data mining (SMO, MLP, Bayes Net and Random Forest) with a diagonal output of tenfold, and it is found that the Bayes Net algorithm gives greater accuracy than the other data set supplied (94.5%). It can also be used with many classification techniques.