Predictive model for acute myocardial infarction in working-age population: a machine learning approach

ABSTRACT


INTRODUCTION
Cardiovascular disease (CVD) is the leading cause of morbidity and mortality worldwide, affecting millions of people every year.CVD is largely preceded by atherosclerosis, which is a major precursor to the three most common vascular diseases: acute myocardial infarction (AMI), stroke, and peripheral arterial disease (PAD).Together, these diseases represent approximately one-third of all deaths worldwide [1], [2].
The AMI is defined as a necrosis in a clinical setting compatible with acute myocardial ischemia, which is commonly secondary to thrombotic occlusion of the coronary artery.Pain is a dominant characteristic symptom of this condition.It should be noted that the AMI is also based on the presence of myocardial damage detected by the elevation of cardiac biomarkers in the context of evidence of acute myocardial ischemia [3].The AMI contributes more than 80% of cases of ischemic heart disease due to atherosclerosis, being more frequent in men than in women.Additional  include obesity, a high-calorie diet, smoking, type 2 diabetes mellitus, hypertension, and dyslipidemia, among others [4].Atherosclerotic risk can be predicted by measuring atherogenic indices based on the lipid profile, which consist of the mathematical ratio or proportion between the levels of total cholesterol, triglycerides, high-density lipoprotein (HDL), or low-density lipoprotein (LDL) [5].However, it is important to note that these indices are not explored in the working population of the region, which is necessary for the economic development of the territory.Therefore, it is necessary to include the work environment in this study to obtain more accurate results on the risk of developing AMI.
It is important to highlight that people with this pathology or a history of it have significantly reduced quality and life expectancy due to premature deaths and years of life lost due to disability.This represents a high health cost.Specifically, in individuals of working age, significant economic burdens are evidenced due to work disability, either due to absenteeism or loss of productivity.Therefore, it is necessary to create mechanisms that allow predicting the probability of occurrence of AMI in working individuals to create strategies that mitigate its effects within the work environment [6].
The use of autonomous learning algorithms or machine learning algorithms applied to the field of medicine is a novel topic.Currently, several studies in the literature report the use of algorithms such as artificial neural networks, case-based reasoning, Bayesian networks, decision trees, or k-means to support the diagnosis of diseases such as breast cancer, prostate cancer, cardiovascular diseases, hypertension, Parkinson's, infarctions, rheumatoid arthritis, among others, and the prediction of mortality or survival after cardiovascular events [7].The use of these techniques becomes a fundamental support for healthcare personnel since some pathologies could be prevented and adequately treated before they present major complications, improving the survival chances of patients.Its application is also carried out in the emergency area of hospitals when diagnosing patient triage, reducing assessment times, and assigning the correct care shift [8].Several studies related to this topic have been summarized in this document.
For pancreatic cancer, the study by [9] used neural networks to predict individual long-term survival of patients undergoing radical surgery for this type of cancer, with a performance of 79%.On the other hand, the research by [10] exposed the application of neural networks to recognize complex patterns for the prediction of advanced bladder cancer in patients undergoing radical cystectomy, breast cancer, and also prediction of survival after hepatic resection for colorectal cancer.The implemented model resulted in a disease prediction rate of 90.5% and individual survival prediction of 72% [11]- [13].
The research by [14] shows how classification algorithms such as logistic model tree (LMT), Bayesian networks, naive Bayes, J48, and naive Bayes simple were used for the diagnosis of pathologies in the Spine, to decide which is the best algorithm for the diagnosis of this disease.The results obtained during classification show that the LMT decision algorithm obtained a success rate of 85.48%.The Bayesian networks algorithm had a success rate of 80%; the naive Bayes algorithm correctly classified 248 instances with an absolute error of 2%, and finally, the naive Bayes simple algorithm correctly classified 241 instances of the 310, reaching the conclusion that the best decision algorithm is the LMT.
In [15], important results were obtained for the diagnosis of rheumatoid arthritis, as well as its categorization and potential application in personalized medicine for individuals affected by this disease.Computational models were designed for classification, among which are artificial neural networks that using 5 variables obtained a sensitivity of 92.3% with a specificity of 86.66%, and with Bayesian networks, a sensitivity of 92.3% and a specificity of 93.33% were achieved.Using artificial neural networks in [16], a model capable of recognizing 3 types of values: cirrhosis, non-cirrhosis, and non-identifiable with a success rate of almost 90% was obtained.
For the prognosis of bladder cancer mortality, [17] used seven learning methods to predict mortality at five years after radical cystectomy, including neural networks, radial basis function networks, extreme learning machine (ELM), regularized ELM (RELM), support vector machine (SVM), and the nearest neighbor classifier (K-NN).The results indicate that RELM achieves the highest prediction accuracy with 80%.In patients with stroke (ischemia and hemorrhage), for the prognosis of mortality 10 days after the event, the research by [18] applied neural networks to obtain a predictive model and achieved a sensitivity and accuracy of 87.8% for the hemorrhagic group and specificity of 75.9%, sensitivity of 85.9%, and accuracy of 80.9% for the ischemic group.
On the other hand, [19] show the training and testing of different neural networks for the diagnosis of myocardial infarction.The training and testing of several neural networks with different architectures were carried out for the diagnosis of infarction, based on the data from the Braunwald angina probability rating scale [20].40 networks were generated and tested in 5 experiments, of which the diagnostic accuracy was higher with the model of 5 electrocardiographic inputs plus troponin.Several of the networks designed for this case had a sensitivity and specificity close to 99%.In turn, [21] in their research expose different machine learning algorithms for the classification of breast cysts through thermographic images using artificial neural networks.Their results indicate a sensitivity of 78% and specificity of 88%.The overall efficiency of the system was 83% In other areas of the medical field, artificial neural networks have been proposed to predict prolonged hospital stay for elderly patients in the emergency department, as well as prolonged stay in the intensive care unit and mortality, achieving a sensitivity of 62.5% and specificity of 96.6% for hospital stay and 82% for mortality prediction [22]- [24].Previous research has shown that the use of autonomous learning algorithms is an effective tool for the diagnosis and prediction of disease onset, allowing the medical team to provide timely treatment and thus improving the patient's quality of life.Within the literature of the last five years, consulted by the authors of this study, there is no evidence of predictive models focused specifically on the probability of suffering from AMI related to atherogenic indices, anthropometric measures or paraclinical variables for the working population.

METHOD
The study is an observational, cross-sectional descriptive study, whose database started to be analyzed from January 2022.A cross-sectional study was conducted in 427 workers aged ≥40 years in the city of Popayán, from which 202 individuals were screened, considering a confidence interval of 95% and an error of 5%.Subsequently, epidemiological, clinical, and paraclinical data were collected, the latter from a peripheral blood sample after obtaining informed consent.All the questionnaires, procedures, and protocols were reviewed and approved by the Ethics Committee for Scientific Research at the University of Cauca; the guidelines used in their view were based on the bioethical principles established in the Helsinki in 1975 declaration and the parameters outlined in Resolution 8430 of the Colombian Ministry of Health in 1993.
Atherogenic indices (AI) were calculated: total-cholesterol-high-density-lipoprotein ratio (TC/HDL), low-density-lipoprotein-high-density-lipoprotein ratio (LDL/HDL), TC-HDL/HDL, TC-HDL, logarithmic triglycerides-to-high-density-lipoprotein (LOG(TG/HDL)), and TG/HDL.Cardiovascular risk was measured using the Framingham scale, using variables such as age, sex, systolic blood pressure, diabetes, smoker, total cholesterol levels, and HDL.Later, a correlation will be made between the atherogenic indices and the percentage of cardiovascular risk, estimating the risk of developing atherosclerosis [5] and the coronary risk according to the Framingham adjusted function, recommended by the Colombian guide [24].The Framingham-adjusted scale (Framingham cardiovascular risk × 0.75) proposed recently by the clinical practice guide for the prevention, early detection, diagnosis, treatment, and follow-up of Dyslipidemia in Colombia [25], [26] will be used.
According to the purposes of this study, the occurrence of AMI was taken as the dependent variable, which is categorical and binary, with output values of "yes" or "no."The remaining variables that may be related to it are quantitative (cholesterol, triglycerides, glucose, HDL, LDL, very-low-density lipoprotein (VLDL), blood pressure, peripheral arterial disease (PAD), body mass index (BMI), abdominal perimeter (AP), age, ICI_CT/cHDL, cholesterol-high-density-lipoprotein CHOL-HDL/HDL) and some are categorical (sex, smoking, physical activity, marital status, education, race, origin, occupation, type of contract).
The analysis begins by determining the relationship between the different variables, so a logistic regression is performed given the nature of the dependent variable.Different combinations of variables are performed using the JASP statistical software until the most favorable result is obtained, where a relationship was found between the occurrence of acute myocardial infarction and the variables summarized in Table 1.

Logistic regression
According to the Akaike information criteria (AIC) and Bayesian information criteria (BIC) metrics, the alternative hypothesis (H1) has the lowest values, suggesting a significant relationship between the output and predictor variables as shown in According to the confusion matrix as shown in Table 3, out of the 202 entered data, 104 data that should have been classified as NO were correctly classified, and 77 that should have been classified as YES were correctly classified, showing an accuracy of 92.8%.The sensitivity is 83.7% and the specificity is 94.5% as seen in Table 4.These results show a good performance of the algorithm that also eliminates subjectivity and the need for highly trained medical personnel.

Machine learning algorithms
Given the results obtained in the logistic regression, it was possible to clearly identify the variables that will be part of the predictive model that is intended to be generated.For the construction of the model, the machine learning classification module is used, which proposes several methods, including boosting, decision tree, k-nearest neighbors, random forest, support vector machine [27].Comparing the different metrics of the prediction methods related to machine learning classification, similarity between them is evident.The average accuracy for boosting, decision tree, and random forest is 87.5%, and their precision is 88.1%, while for k-nearest neighbors and support vector machine, the average accuracy is 85%, and their precision is 85.1%.With these differences, although not significant, the selection is reduced to the methods: boosting, random forest, and decision tree, where the first two are based on decision tree ensembles, which may have an advantage over the decision tree technique, since according to the literature, this technique is not as robust, as a small change in the data can cause large changes in the final estimated tree [28].The boosting and random forest methods differ in their training approach, since for random forest, each tree is trained individually with a slightly different random sample of the training data generated by bootstrapping, while in the boosting method, the trees are trained sequentially, so that each new tree tries to improve on the errors of the previous trees [29].Given that the results obtained are similar among the tree-based ensemble techniques and due to its ease of interpretation, it was chosen to generate the predictive model using the random forest technique.

Random forest algorithm
The database was inputted into JASP software and by using the random forest method of the machine learning classification module, it was observed that with 14 trees each with 3 predictors per split and taking 129 training data (64%), 33 validation data (16%), and 40 evaluation data (20%), an accuracy of 87.5% is obtained for both YES and NO values.As for precision, it is 92.9% for YES values and 84.6% for NO values as shown in Tables 5 and 6, which represents very satisfactory results.Figure 1 shows the increase in accuracy for training data and validation data with respect to the number of trees.From 14 trees onwards, the accuracy for validation is 90.9% and slightly lower for the training data (87.5%).Finally, the ROC curves shown in Figure 2 indicate that both for the classification of YES and NO, their shapes approach the desired one, with an area under the curve of AUC of 0.863, that is, there is an 86.3% probability that the classification is correct.

CONCLUSION
For the first time in the municipality of Popayan, a model has been established that allows predicting the probability of suffering an acute myocardial infarction in a population over the fourth decade of life with occupational activity.The mentioned model achieves a prediction accuracy of 95%.This result will help reduce the incidence and mortality rates of the study population, improving their quality of life.
The predictive model was generated in two computational tools: the statistical software JASP and the programming language Python, where the latter had better performance with a 2% difference in accuracy.The obtained classification model showed satisfactory results in its accuracy, where in the training phase, this value was 90.2% while in the validation phase, it increased to 95%.In the future, a cardiovascular risk calculator could be generated to be used in a simple and non-invasive way in clinical practice.

Figure 1 .Figure 2 .
Figure 1.Relationship of accuracy with respect to the number of trees Figure 2. ROC curves of the classification model of the generated model risk factors that can cause AMI Predictive model for acute myocardial infarction in working-age … (Astrid Lorena Urbano-Cano) 855

Table 1 .
The performance of risk variables related to AIM

Table 2 .
[27]McFadden R2 value is 0.339, indicating that it is a good 857 model as its value falls within the range of 0.2-0.4.The p-value is 0.001, which, being less than 0.05, shows the good performance of the model.The Nagelkerke, Tjur; and Cox and Snell R2 values are very useful when comparing different models for the same data, where the model with the highest R2 values is considered the most appropriate.However, this is not the purpose of this analysis[27].
Predictive model for acute myocardial infarction in working-age … (Astrid Lorena Urbano-Cano)

Table 2 .
The performance of logistic regression

Table 3 .
The performance of confusion matrix obtained with logistics regression

Table 4 .
The performance of metrics

Table 5 .
Summary of the classification model generated

Table 6 .
Confusion matrix of generated classification model