Breast cancer detection using machine learning approaches: a comparative study

ABSTRACT


INTRODUCTION
The rapid spread of breast cancer is clearly noticeable, especially in some developed countries. It is considered to be one of the most significant reasons of death among females [1], [2]. In other words, in most cancer-affected women, breast cancer is the most frequent cancer [3] and the primary cause of death [4]. In 2018, about two million new cases were reported [5]. The complications of the breast cancer diagnosis process and its late discovery are the reasons behind the low survival rate. Its early detection prevents its progression and thus reduces the risk. The early detection and treatment of this disease improves the survival rate [6]- [10]. However, the absence of the breast cancer symptoms at the beginning stage of this disease makes its early discovery harder [11]. Therefore, the computer-aided diagnosis (CAD) methods are highly required as they are very effective in prediction process. CAD methods have the potential of being valuable tools to help radiologists [12].
In comparison with other types of cancer and regarding the number of new cases, a recent report has shown that breast cancer among females received the highest number and it is the leading cause of death [13]. Figures 1 and 2  Moreover, other sources indicated that woman's breast cancer is said to be the second largest cause of death worldwide [14]; therefore, computer-aided detection techniques which help radiologists in detecting abnormal behavior are much needed to enrich survival rate through early detection of this deadly disease. Early detection and treatment are important for patients with breast cancer. It is possible for 95% of patients with early breast cancer to be cured completely [15]. Figure 1. The death rate of some common cancers among females worldwide for the year 2018 [13] Moreover, to this end, a method to prevent the occurrence of breast cancer has not yet been presented therefore cancer tumor in the beginning periods could be treated effectively [16], [17]. This reflects the need to develop adequate methods to support the early diagnosis of this disease. Developing such methods is mainly based on taking benefits of advanced computing techniques to enrich the diagnostic capabilities.
Generally, CAD assists radiologists to screen patient images and therefore increase the detection accuracy. It has been widely used to diagnose different kinds of disease [18]. Interested readers are referred to the study in [19] which sheds light on CAD methods that use mammograms to diagnosis or detect breast cancer. Figure 2. The rate of new cases of some common cancers among females worldwide for the year 2018 [13] In this paper an enhanced version of Wisconsin breast cancer diagnosis (WBDC) dataset has been used to train and test eight popular machine learning models that are commonly used in medical image classification. One of these classifiers is an ensemble classifier based on stacking techniques. It takes the output of all other classifiers to form an ensemble classification model. The dataset is thoroughly improved through applying five feature selection methods that include Chi-square (X2), ReliefF, Analysis of variance (ANOVA), Gini index, and Gain ratio. The purpose of applying these efficient feature selection methods is to reduce the data dimensionality and therefore neglect all unweighted features and consider only features that have considerable impact in the classification process. When a model learns from irrelevant features, this will affect its accuracy, so an important key of success is to have only the relevant features and ignore others. Using the relevant features reduces the computational cost and improves model performance. A thorough review on the common classification models to predict breast cancer occurrence has been given by this paper. These classifiers are carefully selected, taking into consider their capabilities to attain high prediction accuracy. In addition, these classifiers include two classifiers that are based on an ensemble approach; one uses bagging technique and the other uses stacking technique. These variations are being considered to base our work on different techniques and scenarios to enrich our findings. These classifiers will be trained and tested under an enhanced version of a common dataset (which is another contribution of this work) to pick the classifiers that are capable of achieving high detection accuracy, and this is the main aim of this paper. Generally, we can conclude that many phases of enhancement have been applied to reflect trusted and reliable findings, and these phases will be thoroughly illustrated throughout this paper.
The paper is organized: the spread of breast cancer worldwide has been thoroughly discussed in the first section and many statistics supporting that have been given. In addition, this section sheds lights on computer-aided techniques and their important role in detection and diagnosis capabilities. The following section provides a review of relevant classification models that are commonly used to predict breast cancer. The limitations and achievements of these models are given. Mainly, this section gives more focus on eight classification methods that have ensured their capabilities in obtaining high performance to predict breast cancer tumors. Our proposed method is given in section 3 which gives a clear description of the method and highlights its effectiveness in achieving the expected goals. Comprehensive details about our enhanced dataset, the applied feature selection methods, implementation tool and validation techniques are fully illustrated by this section. Section 4 shows the experimental work and thoroughly discusses the results obtained. Finally, the conclusion is summarized in section 5.

RELATED WORK
It is noteworthy that computer aided techniques such as machine learning and deep learning methods are widely becoming common techniques in the medical field. The way to exam and evaluate patient data is one of the most important factors that influences the diagnostic process, so the automation of this process can greatly help to gain accurate results. Machine learning classification models have a great influence in minimizing the errors that can be caused by inexperienced physicians and they are capable of giving accurate results in a short time. However, some problems arise with machine learning techniques that may happen due to using inefficient validation techniques, inadequate classification models, and redundant unweighted features. In addition, and due to the complexities of breast tissues, classification and prediction of breast cancer in medical imaging is said to be a critical task [20]. Therefore, utilizing machine learning and deep learning techniques is a vital part in improving the diagnosis process, if these techniques are implemented perfectly considering all phases of work flow starting from using efficient tools for image reprocessing and selecting proper features. up to applying adequate validation methods. The rest of this section gives a comprehensive review of the classifiers that are selected to be under test by this work.
Gao et al. [21] pointed out the potential solutions that CAD techniques can offer compared to traditional methods. Generally, the advanced development of computer techniques in machine learning, data mining, and deep learning, has playing a huge role in improving clinical care systems and supporting early diagnosis of many diseases and therefore the survival rate.
Segmentation is one of the essential CAD system components [22] that paly important role in classification accuracy. A comparative analysis for three segmentation techniques that include: spatial fuzzy c-means (SFCM) with level set, selective level set, and spatial neutrosophic distance regularized level set (SNDRLS) was conducted in [23]. The performance of these techniques was evaluated on a dataset of breast cancer images. Their results showed that SFCM with level set works effectively to extract cancerous cells compared to other techniques. Kaushal and Singla [24] proposed an automated segmentation technique for breast cancer images. The authors listed some advantages of this technique that lead to enhance segmentation process and therefore the proposed technique can identify cancerous cells correctly. The proposed technique was evaluated and showed its effectiveness.
An experiment that uses the logistic regression (LR) classifier model has been carried out. The dataset was used in this experiment is Wisconsin diagnosis breast cancer (WDBC), and the authors highlight that the proper selection for feature combination is important to improve classification accuracy. Their findings show that the logistic classifier accuracy can reach up to 96.5% in case of selecting maximum perimeter and maximum texture as two characteristics [25]. On the other hand, another study has compared LR and decision tree classifiers to point out that the decision tree (DT) algorithm has performed slightly better than LR; however, both classifiers obtained high accuracy rate [26].
A study was conducted to compare two classifiers-multi-layer perceptron (MLP) and radial basis function (RBR). Both are neural network classifiers. Their results reflect that MLP has outperformed ring- between-ring (RBR) by attaining high accuracy compared to RBR [27]. Zheng et al. [28] outlined that neural network classifiers have become a popular method to classify cancer data. In addition to that, another study pointed out that MLP is the most effective classifier [29]. This result is based on their results obtained by comparing three classifiers that include MLP, naïve Bayes (NB) and C4.5 tree. NB classification model is built on Bayes theorem. A comparison study was conducted on three classifiers to indicate that MLP is more effective; however, it has poor interpretability. In spite of violation of one of NB's assumption, NB has attained good performance with good interpretability [29].
Karabatak [30] indicates that NB is one of the most powerful classification models; however, it has some drawbacks. To overcome these drawbacks, Karabatak proposed a new NB classifier (weighted NB). The author claimed that, the conducted experiments showed that weighted NB has obtained better accuracy than the regular NB. On the other hand, the researcher stated some drawbacks for this new classifier to be overcome in future research.
Showrov et al. [31] conducted a comparative study among three classifiers-artificial neuron network (ANN), support vector machine (SVM), and NB-using WDBC dataset of nine features for breast cancer prediction. In terms of classification accuracy, their results reflected that linear SVM topped the other classifiers while radius basis function neural network (RBF NN) comes next and then Gaussian naïve Bayes. Moreover, another study has shown that support vector machine has precise diagnosis capacity [32].
Using extracted features from mammography images-after applying some image processing techniques Al-Hadidi et al. [33] trained and tested two classification algorithms using MATLAB software and reported their results accordingly. These two classifiers are logistic regression (LR) and back propagation neural network (BPNN). Their observation reflects that a good regression value exceeding 93% has been obtained using BPNN with no more than 240 features.
The random forest (RF) classifier is an ensemble approach which contains multiple algorithms. Each one can be implemented in a decision tree. The combined result of these multiple decision trees leads to enhanced classification accuracy [34]. In other words, the RF is a combination of multiple decision trees that represent an ensemble classifier to promote performance and prediction accuracy. A classifier that is based on RF was developed in [35]. Their model has been trained using two different datasets and has obtained promising results with high accuracy.
Using the WDBC, a comparison study has been carried out for three classifiers, including NB, RF and k-nearest-neighbor (KNN). These classifiers were trained and tested to examine their prediction accuracy of breast cancer tumors based on the aforementioned dataset. Accordingly, the authors conducting this comparison observed that KNN outperformed the other classifiers as it has obtained higher accuracy compared to NB and RF classifiers and generally all classifiers have obtained detection accuracy rates above 94%. The KNN classifier does just obtain the best accuracy among the others, it also outperforms them in terms of precision and F1 score values [36]. Price and Lindqvist indicate that ANN classifier has performed well when applying feature selection methods compared to SVM, NB, and decision trees classifiers. It attained considerable improvement in its performance that reaches 51% increase [37].
Using a small dataset of 275 instances, the authors in [38] compared the performance of two machine learning classifier models, extreme gradient boosting (XGBoost) and RF. Their results show that RF has performed better than XGBoost in terms of detection accuracy for breast cancer; however, using a small dataset may reflect unreliable results, so the authors stated that a large dataset is required to support their findings. A recent study compared nine classification models that include LR, Gaussian NB, RBF SVM, linear SVM, DT, RF, XGBoost, KNN and gradient boosting. These models were trained and tested using Wisconsin diagnosis cancer dataset. The obtained results indicate that KNN for supervised learning and LR for semi-supervised learning achieved the highest accuracy [39].
Ensemble learning approach is one of the most partial ways to offer a trade-off between variance and bias. Many studies show that combining single classifiers to build an aggregated classification model can improve classification performance compared to the performance that can be obtained by any one of these classifiers. The three basic techniques of ensemble classification are stacking, boosting and bagging. In the stacking approach, the outputs from different classification models are to be aggregated into a new dataset [40]. Readers interested for more information are referred to [41]- [46].
Hosni et al. [47] based on their review study, highlighted that most classifiers that are frequently used to build ensemble classification models include artificial neural networks, support vector machines, and trees. In addition, the dataset most frequently used by researchers is WBCD. The results of this thorough review have motivated us to base our work on WBCD which is a trusted dataset by the research community.
According to our mentioned survey work, seven classification models have been chosen to be under investigation throughout this work. Additionally, an ensemble classifier based on stacking technique will be used to include all the aforementioned seven classifiers. Therefore, our study will include eight classifiers: RF, SVM, logistic regression, DT, KNN, MLP, NB and stack.

RESEARCH METHOD
Eight classification models have been selected carefully to be under test throughout this work. These classifiers are: LR, ANN, RF, SVM, MLP, NB, DT, and stack. Two of these classifiers are ensemble classifiers that are based on different ensemble techniques: RF classifier and stack classifier. RF depends on bagging technique. It is a collection of tree-structured classifiers. While the stack classifier is based on stacking technique, it takes the outputs of different classification models. This variation of choosing different classifiers that are based on different techniques and scenarios is to boost up our findings and therefore lead to trusted results. Each classifier has been trained and tested under four different train-test sets that taken from the enhanced dataset that described below.

Dataset
In this work an enhanced dataset has been used to train and test the classifiers that are under test. The enhanced dataset has been developed based on the WBCD, the well-known dataset that available in UCI machine learning repository website [48]. The WBDC include 569 instances with 30 features. The enhanced process that applied to this dataset has reduced the number of features to only 17 features. The features have been reduced by applying five feature selection techniques and accordingly the top 17 features that have considerable wight have been selected and the other features (redundant and unweighted features) have been neglected. This process is considered a huge contribution of this paper which results in improving the classification accuracy and at the same time reduces the classification process overheads. It is so smart to gain high accuracy and to reduce the computation overhead simultaneously. Figure 3 shows the revised dataset with 17 features.
Four different sets of our revised dataset have been used to train and test all classifiers. These sets are: set 1: 60% to train and 40% for test; set 2: 70% to train and 30% for test; set 3: 80% to train and 20% for test; and set 4: 90% to train and 10% for test. It is commonly known that applying different training sizes leads to having in-depth experiments that give ensured trusted results.

The feature selection approach
Using an adequate feature selection approach is an important step to improving classification accuracy. This process reduces the features by taking into consideration the features with impact weight into the classification process and ignoring other features that have not. It has two benefits: it boosts classifier predictability and reduces its computational overheads. There are many feature selection methods that are commonly known and have great impact in improving classification performance. In this paper five selection methods are applied to our dataset to take only features that received considerable wights by these methods. These feature selection methods include Chi-square (X2) [49]- [51], ReliefF [52], [53], ANOVA [54], Gini index [55], [56], and Gain ratio [57]. Choosing these feature selection methods allows these methods to use different metrics to select optimal features. For example, Chi-square (X2) computes the chi-score to rank the features, and information gain is ranking the features depend on their value.
Using these five feature selection methods, the weight for any given feature has been calculated, and then the feature has been ranked accordingly. A feature that received high rank by all feature selection methods will be selected automatically, while for the other features that received variant ranking, the average of the weights will be calculated to select only those have considerable average of weights. By applying this scenario, only 17 features have received the best ranking by the five features selection methods and the other feature received poor weights and therefore has been neglected. In other words, the average is calculated when the features selection methods give variant weights, but in case they all agreed to give high rank for a given feature, then this feature will be selected automatically. Table 1 shows a part of the calculated weights that were given by the five selection feature methods for two features of the dataset. According to this important phase of enhancement, the following 17 features were selected as they received the top weights compared to the rest of the features: texture_worst, radius_worst, perimeter_worst, perimeter_mean, radius_mean, concave points_worst, concave points_mean, area_worst, area_mean, concavity_mean, concavity_worst, radius_se, area_se, perimeter_se, compactness_mean, compactness_worst, texture_mean. This method perfectly enhances our dataset and therefore reflects a high degree of detection accuracy.

Validation methods
Validation is an essential phase for any model to gain acceptance. In our experiments and for attaining realistic and reliable results, the random sampling validation technique has been repeated 10 times for each classification model. The classification accuracy for each model has been calculated through the common accuracy in (1): where a true positive (TP) refers to positive instances that are predicted correctly by a classification model. A true negative (TN) refers to negative cases that are predicted correctly by a classification model. A false positive (FP) refers to negative cases that are predicted incorrectly by a classification model. A false negative (FN) refers to positive cases that are predicted incorrectly by a classification model. Then the confusion matrix has been used to evaluate our models' performance. The confusion matrix measures the classification error in terms of false negatives (FN) and false positives (FP).

Models implementation tool
Orange data mining software has been chosen to conduct all the experiments throughout this work. It is a very powerful tool that has huge capabilities to visualize data efficiently and professionally. It is open source that is very useful to implement different machine learning classification algorithms.

RESULTS AND DISCUSSION
After a thorough investigation on eight classifiers that are mostly used in classification of medical images, this paper comes up with different findings that include nominating the two classifiers as the best in diagnosing breast cancer; these classifiers are SVM and MLP, while SVM is dominant. SVM has attained the highest classification accuracy followed by MLP. The results illustrated in Table 2 show the accuracy of each classifier. The accuracy is read after investigating each classifier under four different training-testing sets. SVM has outperformed all classifiers under all testing scenarios by obtaining classification accuracies of 97.7%, 97.5%, 97% and 97% over the four sets 80%-20%, 90%-10%, 60%-40%, and 70%-30% respectively. It noticed that the highest accuracy is obtained when the training size is 80%. SVM outperforms two types of ensemble classifiers-stack and RF. Both stack and RF use different ensemble techniques, bagging and stacking respectively. It is noteworthy to mention that SVM has outperformed Stack classifier in spite of the fact that stack classifier takes the outputs of all other classifiers-under our investigation including SVM-to enhance prediction accuracy. MLP and stack come next, next to each other in the second position after SVM.
Moreover, Table 3 Figure 4 clearly reflects that. Based on our findings and this thorough evaluation, the paper concludes that SVM ranks at the top; however, MLP is the closest competitor. SVM outperformed even the stack classifier which is an ensemble classifier that is based on the stacking technique.   Figure 4. The classification performance in terms of accuracy obtained by each classifier over four training sets: 60%, 70%, 80% and 90%

CONCLUSION
It is noteworthy to mention that machine learning techniques have ensured their efficiency to discover and define patterns from large amount of medical images and therefore this will assist to classify and sort out these images accordingly. In return this will highly enrich the detection process. However, choosing a trusted dataset, an adequate machine learning approach, a proper feature selection method and accurate validation technique is a key factor to introduce a reliable and efficient detection scheme. This paper applies weighty enhancements to enrich its findings. Many phases of improvements have been implemented that include, but are not not limited to: i) using an enhanced version of publicly trusted dataset by applying five different feature selection methods and, accordingly, the features which have the greatest influence on the prediction process have been selected to improve the prediction accuracy while the other features that have no weighted influence are discarded. Decreasing the number of the features reduces the computational overheads. Therefore, this phase of enhancement has two important improvements: i) it enriches classification accuracy and decreases the model computation cost; ii) applying four sets of training-test scenarios to boost the trustworthiness; and iii) using variant evaluation methods. The main contribution of this paper is ranking the SVM as the best classifier. It obtained a classification accuracy reaching 97.7% with the least classification errors 0.029 false negatives (FN) and 0.019 false positives (FP). This was followed by MLP and stack classifiers. Stack is an ensemble classifier based on stacking technique. Also, the paper presented a comparative study that classifies these classifiers into three levels or groups based on their performance. While excluding the stack classifier-as it is an ensemble classifier that is based on all other classifiers-the top group includes SVM and MLP followed by the next group that includes LR and RF and then the last group that includes DT, KNN and NB. On the other hand, the paper contributes by introducing an enhanced version of the dataset that has been improved by applying the five-feature selection methods to improve prediction accuracy and reduce computational cost. Generally, SVM has topped all classifiers, even the stack classifier which is an ensemble classifier of all classifiers under investigation of this paper.