Breast cancer classification with histopathological image based on machine learning

ABSTRACT


INTRODUCTION
Breast cancer has a high death rate [1]. However, nowadays the chances of cure are excellent with the ongoing advancement of advanced treatment levels and the upgrading of equipment. Current breast-conserving surgery has a therapeutic result comparable to major resection and modified radical surgery, which not only protects the structural integrity of women's breasts but also reduces their physical and psychological trauma. [2]. Early identification of breast cancer has a greater survival percentage than medium and late stages. Due to the fact that the cancer cells in the early stages of breast cancer have not disseminated, they are amenable to treatment with local surgery, radiation, chemotherapy, hormone therapy, and other comprehensive treatments with a very high cure rate.
Current medical technology and pharmaceuticals for treating breast cancer have advanced significantly in comparison to the past, as it has the rate of early identification of breast cancer. As a result, early diagnosis of breast cancer may boost the success rate of therapy and assist to decrease the death rate of breast cancer patients [3]. This work proposes the use of pre-trained convolutional neural network (CNN) algorithms for distinguishing breast cancer histological images of patients. BreakHis is the dataset that employed in this study. This paper also demonstrates how the CNN performance could be mixed with machine learning methods for classification. Finally, a state-of-art result of the models to categorize breast cancer as benign or malignant, and compared the performances of all of the pre-trained models that are employed.

METHOD 3.1. Convolutional neural network models
CNN's [19] are deep learning algorithms in image recognition and processing. A CNN is similar to a multilayer perceptron in that it is based on the same technology as a multilayer perceptron but is optimized for minimal processing needs. A CNN consists of the input layer, output layer, and a hidden layer comprising several convolutional, pooling, fully connected, and normalizing layers. Figure 1 illustrates a sample of conventional CNN architecture. A brief explanation of each layer is included in the following sections.

Convolutional layer
CNN's essential component is the convolutional layer. It bears the lion's share of the network's computational cost. Besides that, this layer performs a dot product on two matrices, one of which includes the kernel's set of learnable parameters and the other of which contains the receptive field's restricted area. Kernels are smaller than pictures yet contain a higher amount of information. So, if the image is under three red green blue (RGB) channels, the kernel height and width are minimized in terms of spatial dimensions, but the depth includes all three channels.

Pooling layer
A pooling layer plays an important role in CNNs in reducing the complexity of the feature maps produced by the convolutional layer. This is typically achieved through down sampling process which involves summarizing the local features of the filter's output. The pooling process operates on a small window of the input data, selecting a pre-defined value (commonly average or maximum value) in the window to create a simplified  Figure 2 depicts examples of the two most common pooling used in CNN, maximum pooling (left), and average pooling (right). The maximum pooling selects the maximum value in the window (2×2 pool size), emphasizing the most salient feature in the windows while the average pooling computes the mean value in each window to reduce the impact of small variations in the input data. Both possess the effect of reducing the resolution of the output feature map by a factor of the pooling window size.

Fully-connected layer (FC layer)
As with feedforward neural networks, neurons in the fully-connected layers are entirely dependent on all activations in the previous layer. FC layers are always the last to be implemented in a network. This layer usually receives input from the pooling layer that has been flattened after the image's features have been extracted. The output is then sent to the layer responsible for picture classification.

Proposed models
The pre-trained CNN models utilized for the BreakHis dataset in this study include the ResNet-50, VGG-19, AlexNet, and Inception-v3. Besides, the ResNet-50 is also used as a feature extraction to obtain the feature from the BreakHis images, then the extracted features are fed into the (RF) random forest and k-nearest neighbors (k-NN) for classification. The architectures of the adopted pre-trained CNN models are described in the subsequence sections.

ResNet-50
The ResNet-50 has 50 layers. The input components of the ResNet network are all made up of a huge convolution kernel with a stride of 2 and a maximum pooling with a size of 3×3 and a stride of 2. In this phase, the input image must be changed to 224×224×3 dimensions. It contains five convolutional layers for producing feature maps, a rectified linear unit (ReLU) layer for dimensionality reduction, a FC layer, and a softmax activation function for categorizing the images of the breast cancer dataset into two categories (benign or malignant). The first convolution layer provides low-level properties like color, edges, and gradient operation while the deeper convolution layer delivers high-level characteristics of the input data. When the input is received, it processes by the first convolutional layer, which contains 7×7 with 64 filters and has a stride value of 2. The input size is then reduced to 112×112 and the data is processed through the following convolutional layers: 1×1 with 64 filters, 3×3 with 64 filters, and 1×1 with 256 filter stride values of 3. The image is then scaled down to 56×56 pixels and sent through the next convolutional layer, which is 1×1 with 128 filters, 3×3 with 128 filters, and 1×1 with 512 filters with a stride value of 4. The dimension is then reduced to 28×28 and the dimension is transmitted through the next convolutional layer, which is 1×1 with 256 filters, 3×3 with 256 filters, and 1×1 with 1,024 filters with a stride value of 6. The image is then reduced again to a size of 14×14 and transmitted through the next convolutional layer, which is 1×1 with 512 filters, 3×3 with 512 filters, and then 1×1 with 2,048 filters with a stride value of 3.
In the implementation phase, the final layer's block is made trainable alone, and the learning rate is set as 0.00001 to compile the frozen and unfrozen top blocks. The epochs are repeated 15 times, and the data batch size is set as 16. The ResNet-50 architecture is shown in Figure 3 and the configuration of the model is tabled in Table 1.

ResNet-50 with machine learning classifier
The ResNet-50 is combined with another machine learning model to create a hybrid model comprising two blocks each shown in Figure 4. Within the first block, the weight of the pre-trained ResNet-50 which was trained by ImageNet is adopted to retrieve the properties from the input images. Then, the deep features extracted are passed to the second block. The second block uses a machine learning algorithm, including the random forest and k-NN as classifiers to classify deep features. It is utilized in the second segment to categorize the deep feature maps of the CNNs.

VGG-19
The size of the image that the VGG-19 receives as input is 224×224×3. The filter size for the convolutional layer is 3×3, followed by the stride and padding of the convolution layers and five max-pooling layers. The max-pooling was performed using a 2×2 window, and the stride is set at 2. For the entirely interconnected layers, the first layer included 256 neurons, followed by a layer containing 256 neurons. Categorization is achieved using a softmax output layer.
During the implementation phase, a learning rate of 0.001 is used. Additionally, it refines the model such that each layer is trainable. The epoch size of the experiment has been set to 20, and the call-backs function has been used to control the learning rate and early stopping. Figure 5 depicts the VGG-19 design, while Table 2 displays the VGG-19 model's configuration.

Inception V3
The required input image size for the Inception-V3 network is 224×224×3 resolution. This CNN architecture consists of 42-layer convolutional layers. It creates a basic yet robust deeper network, allowing us to reduce computational expenses to a minimum. This network design is summarized in Figure 6.
The factorization of the n×n convolutional is the inception module. By merging the outputs of all the layers, the output vector is formed. The output of the inception layer is subsequently passed on to the next layer. The inception layer aids the preceding layers in identifying the proper filter size. The stochastic gradient descent (SGD) is utilized as the optimizer throughout implementation, and all the layers are frozen in order to train only the extra layers. It only fine-tunes the model by retraining a few of the end layers of the Inception-V3 model. The learning rate for the SGD is set to 0.001, decay to 0.00001, and the Nesterov to true. The experiment's epoch is 20 times, and the batch size is set to 20.

AlexNet
The AlexNet is composed of eight convolutional layers. This model is provided with images with dimensions of 224×224×3. The first layer of AlexNet is a convolutional layer which consists of 96 filters with 11×11×3-pixel widths and strides of 4 pixels followed by a pooling layer and a batch normalization layer. Then a ReLU activation layer is used in the convolutional layer. The output feature map now is 27×27×96. Convolutional layers make up the second layer, as do the first. In this layer, the convolutional contains 256 filters with 11×11 filter sizes and 1-pixel strides. The third layer is also a convolutional layer, with 384 filters that are 3×3 pixels in size and the output feature map is 6×6×384. The fourth layer is a convolutional layer as well, with 384 filters that are 3×3 filters in size and the output feature map is 4×4×384. Similarly, the fifth convolutional layer consists of 256 filters that are 3×3 pixels in size. A layer of overlapping pooling and local response normalization follows, and the output features map is 1×1×256. In entirely connected layers, the sixth and seventh levels have 4,096 neurons with dropout function and ReLU activation. The classifications are then performed on an outer layer using a softmax activation function. After that, the top two complicated layers are selected to train, while the first eight layers are frozen, and the remaining layers are unfreezing. SGD is employed as the optimizer, using a learning rate of 0.001, a momentum of 0.9, and a decay of 0.05. Figure 7 shows the summary architecture of the AlexNet pre-trained model.

EXPERIMENTAL SETTINGS
The dataset named as BreakHis [23] is used in this study. This dataset consists of a collection of 9,109 samples of breast tumor tissue acquired from 82 cases at a range of magnification which 40, 100, 200, and 400. The database presently has 5,429 malignant as shown in Figure 8 and 2,480 benign as shown in Figure  9. The images from the dataset each have the characteristics of 700×460 pixels, three-channel RGB, and eightbit depth for each channel. Details of the samples in the dataset are listed in Table 3.  The BreakHis dataset has been constructed as shown in Table 3 for the training, validation, and testing phase. First, each image from the benign and malignant image folder is loaded and resized from 700×460 pixels to 224×224 pixels so that it is compatible with the input size of the pre-trained CNN models that are feeding in the next phase. Figures 10 and 11 illustrated the resized images in RGB format. Both the benign and malignant data are separately loaded and assigned for train and test variables. After that, the benign train and test variable are labelled as zero class while the malignant train test is labelled as one class. Then, the test data for benign and malignant are combined and mixed into one testing dataset which is used in the testing phase. Following the labelling and merging of the data, the train test split function is used to divide the data into 5,931 for training, 4,745 for validating, and 3,164 for testing the models, as shown in Table 4. The splitting ratio is set as 75:60:40 for training, validation and testing, respectively. After completing the dataset construction, begin training and testing the suggested model.

RESULT AND DISCUSSION
In this section, the experimental results and discussion in utilizing transfer learning are presented. Transfer learning is a powerful technique that enables the leverage of knowledge (weights, biases and other learned features) that was previously trained on CNN models to solve the new domain problems. To conduct the experiments, the Dell G15 Special Edition 5521 laptop equipped with the NVIDIA GeForce RTX 3050-Ti Laptop GPU were used. For implementation, the Python programming language version 3.9.12 and TensorFlow-GPU version 2.5.0 were used which allow us to take advantage of the powerful computational capabilities of the GPU to accelerate the experiments. A series of experiments and evaluation of the results based on various metrics, including accuracy, precision, recall, and F-feature are reported in this section. Figure 12 shows the accuracy of all the pre-trained models for binary classification of breast cancer from the BreakHis database. Figure 12 demonstrates that the Resnet-50 network scored the highest test accuracy with 97.60%, followed by the VGG-19 models with an accuracy of 95.4%. The accuracy rate of the lightweight AlexNet model was the lowest of the pre-trained models, at 81.42%. This demonstrates that increasing the number of convolutional layers in a model may dramatically enhance its classification performance. Figure 12. Accuracy performance of each pre-trained model Intelligent systems often require a substantial quantity of training data to achieve human-level diagnostic abilities. For some illnesses, the quantity of data available for model training may be insufficient, leading to limited generalization of frequently used deep learning models. CNNs are an innovative technique for learning picture representation, with convolutional filters learning to recognize image information via backpropagation. Frequently, these attributes are given to a classifier, such as a softmax layer [24].
However, in this research, instead of using traditional backpropagation, a softmax layer is trained for classification. Deep forest, a decision trees that outperforms deep neural networks (DNN) in a broad variety of applications. Therefore, ResNet50 was used to retrieve information from breast cancer images before categorizing them using deep forest. Deep forest has been proved to be particularly efficient in circumstances when only small-scale training data is available, hence this technique was chosen. Furthermore, the deep forest network chooses its own complexity, it tackles the dataset imbalance problem that we observed in this circumstance. Aside from that, the standard k-NN [25] is a straightforward alternative technique to small class prediction was also employed.
The accuracy based on Figure 12 has shown that the ResNet-50 with RF is better than the ResNet-50 with k-NN since ResNet-50 with RF is achieved at the 89.22% and ResNet-50 with k-NN is 73.67%. Figure 13 shows the actual and predicted results. Table 5 details the area under the curve (AUC), recall, accuracy, and F1-score obtained from all models. It could be seen that the ResNet-50 network achieved the highest performance among the AUC of 97%, precision of 97% and 98%, recall of 95% and 98%, and F1-score of 96% and 98% for benign and malignant, respectively. Figure 14 shows the confusion matrices for the ResNet-50 model, which achieved the best accuracy performance among the adopted models. This confusion matrix was created utilizing a 75% training and 40% testing technique. The results show that the true positive rate is 68.14%, the true negative rate is 29.27%, the false positive rate is 1.55%, and the false negative rate is 1.04% during the testing stage.

CONCLUSION
Overall, this research has offered five distinct types of pre-trained CNN models on the classification that may be utilized to categorize benign and malignant breast cancer histopathology images. The models utilized in the study include ResNet-50, VGG-19, Inception-V3, and AlexNet, with the ResNet-50 also working as a feature extractor, extracting features from images and passing them to machine learning algorithms for classification, in this case, a RF and k-NN. The experiments of this study were based on a publicly available dataset named BreakHis.
In conclusion, the ResNet-50 has a higher accuracy for the classification of medical color images, but accurate cancer diagnosis is critical for protecting a person's life, and this research can aim to improve its accuracy as much as possible in the future. This study's proposed architecture might be used in real-world medical imaging in real time. More testing in real-world scenarios is required for clinical applications, and the system should be strengthened. Because of their speed, such devices may allow clinicians to consult with more patients. Furthermore, because the suggested approach is more accurate in cancer classification, the death rate from breast cancer might be greatly lowered.