Enhanced convolutional neural network for non-small cell lung cancer classification

Lung cancer is a common type of cancer that causes death if not detected early enough. Doctors use computed tomography (CT) images to diagnose lung cancer. The accuracy of the diagnosis relies highly on the doctor's expertise. Recently, clinical decision support systems based on deep learning valuable recommendations to doctors in their diagnoses. In this paper, we present several deep learning models to detect non-small cell lung cancer in CT images and differentiate its main subtypes namely adenocarcinoma, large cell carcinoma, and squamous cell carcinoma. We adopted standard convolutional neural networks (CNN), visual geometry group-16 (VGG16), and VGG19. Besides, we introduce a variant of the CNN that is augmented with convolutional block attention modules (CBAM). CBAM aims to extract informative features by combining cross-channel and spatial information. We also propose variants of VGG16 and VGG19 that utilize a support vector machine (SVM) at the classification layer instead of SoftMax. We validated all models in this study through extensive experiments on a CT lung cancer dataset. Experimental results show that supplementing CNN with CBAM leads to consistent improvements over vanilla CNN. Results also show that the VGG variants that use the SVM classifier outperform the original VGGs by a significant margin.

It is crucial to provide an accurate distinction between the NSCLC subtypes for deciding on the optimum therapeutic modality [12]. Different NSCLC subtypes exhibit varying sensitivity and resistance to different chemotherapies [1]. For example, Pemetrexed (a multiple-enzyme inhibitor) is substantially more effective in the treatment methods of Adenocarcinoma patients compared to squamous cell carcinoma patients [13]. Moreover, deciding the treatment protocol for the miss-identified NSCLC subtype could even be life-threatening. For instance, patients with adenocarcinoma have a good response to Bevacizumab (a type of medication). In contrast, patients with squamous cell carcinoma could suffer from life-threatening hemoptysis (i.e., coughing up of blood) when treated with the same type of medication [14].
However, it is challenging and time-consuming for radiologists and pathologists to give a proper diagnosis. This is due to the high heterogeneity of tumor tissues [15] and subtle morphological differences between NSCLC subtypes [16]. The visual assessment of CT images of lung tissues is one of the main methods used by pathologists to determine the stage and subtypes of lung cancer. However, the manual assessment of CT scans has been shown to be subjective and inaccurate in certain diagnosis cases of tumor subtypes and stages [15]. A recent study shows that the overall agreement for classifying adenocarcinoma and squamous cell certified by peer review is unsatisfactory even among expert lung pathologists [17]. Furthermore, the number of pathologists qualified to perform a visual-only assessment of CT images may not be enough to meet the rising demands.
In recent years, deep neural networks (DNNs), especially convolutional neural networks (CNNs), have become the dominant method in classifying and discriminating lung nodules from medical images (e.g., CT scans). These models are capable of capturing complex patterns and recognizing the subtle differences between the subtypes of lung cancer and predicting the stage with an impressive speed and accuracy. Combined with regular screening of at-risk populations (e.g., smokers), deep learning-based diagnosis systems techniques can substantially improve the disease's early-stage detection, thus increasing the survival rates.
In this paper, we investigate the effectiveness of several deep neural models, including basic CNN, visual geometry group-16 (VGG- 16), and VGG19 in detecting NSCLC and identifying its main histological subclass. CNN, VGG16, and VGG19 are neural architectures composed of feature extraction layers and a classification layer. The Feature extraction layer stacks several convolutional layers, pooling layer, and nonlinearity on top of each other to obtain a deep architecture capable of obtaining a strong representation of the lung CT scan. These representations capture subtle differences between the CT scans of different cancer types that cannot be noticed easily by the naked eye. The classification layer typically implements a SoftMax function to produce a probability distribution over the four classes (i.e., normal lung, adenocarcinoma, large cell carcinoma, and squamous cell carcinoma). VGG 16 and VGG19 use deeper neural architecture than CNN, thus generating better representations. We also propose variants of VGG16 and VGG19 that employ support vector machine (SVM) [18] to perform classification as an alternative to the SoftMax function. Moreover, we empower CNN by combining it with convolutional attention block modules [19]. The attention mechanism added enabled the CNN algorithm to focus on the most important informative special part of the CT images. Thus, enhancing its capability to extract more powerful representation from the images, hence improving the classification performance. The contributions of this paper are: − Implementing and comparing three deep learning algorithms, namely CNN, VGG16 and VGG19 to classify lung CT images under one out of four classes: normal lung (no cancer), adenocarcinoma, squamous cell carcinoma, and large cell carcinoma, − Developing modified versions of VGG16 and VGG19 that use SVM for classification instead of the typical SoftMax function, − Enhancing the basic CNN architecture by incorporating convolutional attention block modules to help the CNN to focusing on the important regions of the lung CT scan, which is critical to obtain a correct classification − Presenting experimental results that demonstrate the effectiveness of the deep learning approaches explored in this study in identifying NSCLC subtypes. All models are trained on a multi-class NSCLC dataset publicly available on Kaggle. In addition, we proved experimentally the consistent advantage that SVM adds to VGGs. We also showed the superior performance of attention-enhanced CNN in comparison with the basic CNN.
Ahmed and Kashmola [28] provided an approach for the classification of malicious skin diseases. They made a comparison between the precision of their prediction since two pernicious diseases of skin diseases were predicted which are basal cell carcinoma and melanoma disconnectedly with photos of nevus. They proposed architecture for processing in deep learning. It depends on the convolutional neural network technology which relies on a set of layers. Bedeir et al. [29] applied both dermatologists and current deep learning techniques for the classification of skin cancer. The achievement of two pre-trained CNNs was assisted, and the integration of their results suggested the best method for skin cancer classification on the HAM10000 dataset with a precision 94.14%.
Other research showed the efficiency of data mining for classification in other diseases. Panda et al. [30] introduced in their paper an efficient technique with high accuracy to detect diabetes disease. The authors utilized the K-nearest neighbor method to decrease the processing time. On the other hand, they used also support vector machine to assign its particular class for every sample of data. They also utilized the feature selection method to build their machine learning model. Overall, the authors used four techniques to assist which can with ease detect whether or not a person will endure from diabetes. Nugroho et al. [31] focused on efficient modeling that is able to detect coronary artery disease (CAD) by utilizing feature selection attribute for dealing with high dimensional data and feature resampling to deal with unbalanced data. Hyperparameter tuning to locate the best integration of parameters in SVM is also performed.
Regarding lung cancer particularly, deep learning algorithms, especially CNN and its variants, have shown exceptionally good performance in detecting lung cancer [32], identifying its subtypes [15], [16], [27], stages [33], providing treatment assessment [7], [34], and perform lung field segmentation [35]. With the availability of digital histopathology images, and the unprecedented advances in the computational resources, deep neural models can be trained on millions of histopathology images, and effectively capture the distinctive histopathology patterns of cancer cells. Khosravi et al. [15] demonstrate the power of deep learning approaches for identifying NSCLC subtypes. They experimented with basic CNN architecture, Google's Inceptions, and an ensemble of Inception and Reset to distinguish adenocarcinoma from squamous cell carcinoma. Moitra and Mandal [17] utilized CNN to identify non-small cell lung tumor regions from adjacent dense benign tissues and to identify the major subtypes of both adenocarcinoma and squamous cell carcinoma. Another work that is proposed by Guo et al. [36] presents and assesses the performance of two novel CNN-based architectures trained on a dataset of CT images to differentiate between lung adenocarcinoma, squamous cell carcinoma, and small cell lung cancers. Han et al. [37] Tashtoush) 1027 learning algorithms and combined with several feature selection methods in differentiating the histological subtypes of NSCLC. The recent availability of digital histopathology whole-slide images (WSIs) leads to the development of neural models that can be learned from high-resolution WSIs. Dealing with WSIs directly is considered challenging due to their high spatial resolution. Usually, they are divided into multiple patches and receive detailed annotations before being fed to models. Chen et al. [16] propose a slide-level diagnosis model to detect adenocarcinoma and squamous cell carcinoma trained on WSIs. The proposed model incorporates the unified memory (UM) mechanism and several GPU memory optimization techniques to train CNNs on WSIs using slide-level labels, which alleviate the need for laborious multiple-patch annotation by pathologists.
A similar study is carried out by Coudray et al. [27] who trained inceptionv3, a deep CNN-based deep model, on histopathology to classify whole-slide pathology images of lung tissues into normal, adenocarcinoma, or squamous cell carcinoma. Kanavati et al. [38] built a CNN model based on the EfficientNet-B3 architecture, using transfer learning and weakly-supervised learning, to predict carcinoma in whole slide imaging (WSIs) using partially labeled training data. Another pertaining line of research is tracking the changes that happen to lung cancer patients after receiving treatment. Xu et al. [7] use a deep learning model, namely CNN and RNN (recurrent neural networks), to predict survival and other clinical outcomes. Time-series CT images of patients are analyzed with locally advanced NSCLC by incorporating pretreatment and CT images are follow-up. Chaunzwa et al. [34] proposed a radionics approach to predicting NSCLC tumor histology using trained and convolutional neural networks (CNNs). They used a CT scan dataset of early-stage NSCLC patients who received surgical treatment at Massachusetts General Hospital (MGH). The study focuses on the two most common NSCLC types: adenocarcinoma (ADC) and squamous cell carcinoma (SCC). They compared CNN against two machine learning approaches (SVM and k-nearest neighbors) and found that both showed a comparable performance.
In another related line of research and as a pretreatment step in the diagnosis of lung diseases, authors proposed methods for pathological lung segmentation (e.g., [39]- [41]). Wang et al. [40] introduced a data-driven model, called the central focused convolutional neural networks (CF-CNN), to segment lung nodules from heterogeneous CT images. In the same line, Aresta et al. [41] proposed iW-Net, a deep learning model that allows for both automatic and interactive segmentation of lung nodules in computed tomography images. To the best of our knowledge, there is no other research article reported results on this dataset before. The summary of related work is presented in Table 1.

MATERIALS AND METHODS
Deep learning techniques have become one of the most promising and widely used in medical images analysis. Convolutional neural networks (CNNs) are the most popular and regularly used deep learning algorithms [42]. They represent a huge breakthrough in image classification. In general, CNN-based architectures are composed of an input layer that reads the image, hidden layers that work together for feature extraction, and a classification layer that outputs the predicted class. The hidden layers are stacked blocks of convolutional layers, pooling, and nonlinear activation functions. Convolutional layers apply a convolution operation to the input images and pass the resulting feature maps to the next layer. The non-linear activation functions (e.g., rectified linear unit (ReLU)) are applied to add non-linearity to the network and increase its ability to learn very complex structures from the images. Pooling layers reduce representation size and allow the network to achieve spatial variance. There are two pooling methods commonly applied: max-pooling (MP) and average pooling (AP). This section describes the main CNN-based architecture used in this work, including basic CNN, VGG16, and VGG19. Also, we describe the architecture modification we have added to the CNN and the VGG models to increase their predictability. We additionally describe the dataset used in this work and explain the evaluation metric and the training process of our deep models. The methodology of our research project is explained in Figure 2.

CNN model
The architecture of the CNN model we use for NSCLC subtype classification is shown in Figure 3. It shows our CNN 5 convolutional layers and 3 max-pooling layers placed after the third, the fourth, and the fifth convolutions. We apply ReLU non-linear activation function after each convolution operation. All the convolutional layers use filters with size 3×3. The first convolutional layer uses 256 filters with size 3. Each of the following convolutions uses half of the filter count of the previous layer with the same filter size (i.e., the numbers of filters are: 256, 128, 64, 32, 16). The max-pooling size is 2×2. To perform classification, we use three Fully Connected (FC) layers that are followed by the SoftMax function in the output layer. The SoftMax function gives a 4D (four-dimension) vector representing the probability distribution over the four classes used in this study. The most probable class (class with the highest probability value) is selected. We place a flattened layer right before the first fully connected layer to convert the 3D feature map extracted by the last convolution layer to 1D being fed to the first fully connected layer.

CNN with attention (CNN+CBAM)
This section describes our approach to enhance our basic CNN by being combined with convolutional block attention module (CBAM). CBAM was originally proposed by Woo et al. [19]. The idea of augmenting CNN with attention mechanism is not new; it has previously been applied to a variety of natural language processing and computer vision tasks such as sentence pair modeling [43], relation extraction [44], sentence classification [45], and image classification [46]. The main intuition is that CBAM allows CNN to emphasize the informative regions in the CT images and filter out the meaningless background regions that do not differentiate different classes. The framework of our augmented CNN model (CNN+CBAM) is the same as the CNN model, with three CBAM blocks inserted after the third, fourth, and fifth convolutional layers. The architecture of CNN+CBAM model is presented in Figure 4. In the following, we describe the convolutional block attention. We use the same notations as the original work [19]. CBAM aims to emphasize the informative feature and suppress the non-useful information by using channel and spatial attention modules as illustrated in Figure 5. Whereas the channel module focuses on what to attend, the spatial module concentrates on where to attend. Our intuition behind using CBAM is that the channel attention module could assist in differentiating abnormal spots in lung from normal tissues, while the spatial attention helps to focus on the location of the tumor, which is essential in deciding the cancer if any as different types tend to infect different parts of the lung.  [19] Each CBAM block takes the feature map from the previous convolution layer as input and output of a refined feature map that is passed to the next layer. Let F ∈ R C×H×W be the input feature map to CNBM block, where C,H and W are the number of the channels, the width, and the height of F respectively. F is passed first to the channel attention module, which output is M c ∈ R 1×H×W , a one-dimensional channel attention map. The first refined feature map, where ⊕ denotes element-wise multiplication. F ′ is assumed to summarize the important information of the image. After that, F ′ is passed to the spatial attention module and outputs the final refined feature F ′′ that is computed as (2).
F ′′ (the output of the CBAM block) is assumed to capture the important information located within the image. To learn about the details of the channel and spatial attention modules, please refer to [19].

VGG16 and VGG19
VGGNet [47] was originally invented by visual geometry group (VGG) from the University of Oxford. It was the runner-up of the localization and classification tracks of the ImageNet large scale visual recognition challenge (ILSVRC-2014) competition. VGGNet reveals that increasing the depth of the network while using small convolutional filters (e.g., 3×3) contributes to learning more complex representations without increasing the number of parameters to be learned. VGG16 and VGG19 are two versions of VGGNet that were more successful in the ILSVRC competition than the others. They have gained wide popularity and applied to a variety of tasks [48].
VGG16 consists of 13 convolutional layers and three dense fully connected layers with 4,096; 4,096; and 1,000 neurons followed by a SoftMax layer on top to perform classification. VGG19 differs from VGG16 in that it has 16 convolutional layers instead of 13. Both networks use a stack of small convolutional filters of size 3×3 with stride 1, followed by multiple max-pooling and ReLU activation function to achieve non-linearity as shown in Figure 6. We have chosen VGGNets in this study as it has proven to work well in practice with small-size image datasets [49].

Modified VGG 16 and VGG19
Classical VGG models employ the SoftMax activation function for prediction. In addition, the training is performed by minimizing cross-entropy loss function. Tang [50] demonstrated that replacing the SoftMax layer with a linear multi-class support vector machine yields a small yet consistent advantage to the predictability of convolutional neural network. This is in concordance with what mentioned in study [48].
Inspired by that, we propose modified versions of VGG16 and VGG19 using SVM for classification instead of SoftMax. We refer to the modified VGG nets as VGG16-SVM and VGG19-SVM. As presented in Figure 7, the modified VGGNets replace the three fully connected (dense) layers in the original models with only two dense layers with 256 and 128 neurons. Besides, learning is carried out by minimizing a margin-based loss instead of the cross-entropy loss. The 3D feature map extracted by the last convolution layer is mapped to 1D vector by going through a flattened layer before being fed to the first fully connected layer.

Dataset
The dataset we use in this work is freely available Kaggle dataset. It has been developed and released to encourage data scientists and medical informatics communities to develop effective lung cancer detection algorithms. The dataset consists of 1000 CT slicers from different cases divided between CT scans of normal lungs and NSCLC-infected lungs. Each CT scan is a whole-slide image of size 256×256 annotated with one of the classes: normal, adenocarcinoma, squamous cell carcinoma, or large cell carcinoma. The scans either have JPG or PNG format (low resolution one-slide images).
The dataset is divided into training/testing/validation sets. We use the training set to fit the models while the testing set is used to evaluate the performance of the final models. The validation set is used to choose between the possible design options and provide an unbiased evaluation of a model trained on the training dataset while tuning model hyperparameters. The class distribution of the dataset is presented in Table 2. To the best of our knowledge, there is no other research article that reported results on this dataset before.

Training and evaluation
To select the most appropriate architectures for a given task and classification aim, we carried out several experiments for each model by varying the choices of hyper-parameters. For every model we trained, we tuned all the hyper-parameters using the validation set before the final evaluation on the testing set. Hyperparameter optimization was explored iteratively. The hyper-parameters we tuned are the learning rate, the batch size, the optimizer, the number of epochs, and the use of dropout (by trying several values ranges from 0.1 to 0.5) vs not using it. The predictive performance of the models was evaluated with accuracy (A), which is the percentage of images that have been classified correctly by a model. Inputs to the CNN models are 64×64-pixel image patch and it is 64×64 for all the VGG models. We use Adam optimizer [51] with learning rate set to 0.001 for CNN and CNN+CBAM and 0.0001 for VGGNets (VGG16, VGG19, VGG16-SVM, and VGG19-SVM). All models are trained for 25 epochs with early stopping. We noticed performance degradation on the validation set after epoch 25 the batch size was set to 32. Before being fed to CNN and CNN+CBAM, all images are scaled to 64-by-64 pixels. The image sizes for VGG16, VGG19, and their variants are selected as pixels. Tables 3 to 5 show the best hyperparameter of our deep models that performed the classification outcomes.

RESULTS AND DISCUSSION
In this section, we present the experimental results of the CNN-based neural architectures utilized in this study. Figure 8 shows the testing accuracy of basic CNN, CNN+CBAM, VGG16, VGG19, VGG16+SVM, VGG19+SVM. It can be easily noticed that VGG16+SVM is able to detect NSCLC and its subtypes with highly reliable accuracy (83.4%), and it outperforms all other models by a significant margin. Table 6 shows the result of precision, recall, F1-score, support, macro averaging (AVG), weighted AVG, and accuracy for each model of classes adenocarcinoma, large cell carcinoma, normal and squamous cell carcinoma. VGG16 with SVM can efficiently detect NSCLC and its subtypes with highly reliable accuracy (83.49%). It outperforms all other models by precision, recall, F1-score, support, macro AVG, and weighted AVG, as shown in Table 6. Notably, attention mothed outperforms the CNN model in macro AVG and weighted AVG of (precision, recall, F1-score) by 70.16%, 72.09%, and 70.59% in macro AVG and 74.49%, 72.38%, and 72.96% in weighted AVG, respectively. Also, improve the accuracy by 7.3%. Recently, there has been some discussion about whether data is the new oil or not. The training images are the variation typically found within the class. If the images in a category are similar, fewer images might be acceptable. Usually, about 100 images are good to train a class. NSCL cancer is a leading cause of cancer mortality worldwide. The early detection and the accurate diagnosis of NSCL are essential to improve survival rates and to set customized treatment protocols. Our study used a small dataset to classify lung  Tashtoush) 1033 cancer, and we got a good result, as shown in Figure 8. We encourage studies to work on the small dataset, and we agree that the data is the new oil. For example, when we talk about classifying small cell lung cancer, it constitutes a small percentage of all cases, but it has a high risk. The advantages that SVM brings to VGGNets and CBAM brings to CNN suggest that combining CBAM and VGGNets for feature extraction with using SVM for classification could lead to further performance, which we will leave for future work.

CNN+CBAM vs CNN
The superior performance of CNN+CBAM (72.38) over CNN (65.08) as shown in Figure 7 demonstrates the ability of CNN+CBAM to learn a higher level of structural patterns typical to subclasses of NSCLC. We believe that the channel attention modules help in differentiating abnormal cancer cells from the normal cells, which is important to distinguish the normal lung from the cancer-infected lung. On the other hand, the spatial attention module assists in capturing the location of the tumor, which is crucial in differentiating the subtypes that usually affect varying locations within the lung. For instance, adenocarcinoma tends to grow in peripheral areas of the lung. On the contrary, squamous cell carcinoma tends to appear in the central airways of the lung. Table 7 reports the per-subtype accuracy achieved by CNN and CNN-CBAM. The results show a consistent gain of incorporating CBAM for all classes.

VGGNets vs the modified VGGNets
As we can see in Figure 8, the results suggest that replacing the SoftMax function with SVM increases the prediction accuracy. We believe the performance gain is largely due to SVMs that can draw the optimal hyperplane separating different classes in the dataset [18]. It exhibits the greatest possible distance (maximum margin) between the closest data points across different classes. Once the relevant support vectors (the closest data points to the margin) are contained in the training set, the optimization will always result in the same classifier, even with limited training data that is considerably small [52]. Thus, SVMs are much more robust to learn generalizable models from the small amount of training data as the one we use in this study. Opposing our expectations, VGG16+SVM gives better performance than VGG19+SVM by 3.5% as shown in Figure 8, even though VGG19 is deeper than VGG16 and is supposed to extract a stronger representation of the CT scans. This might be due to the relatively small-size training set used in this work, which could make more complex architecture suffer from overfitting as more parameters are needed to be learned. Figures 9(a) and (b) plot the training and the validation accuracy and loss of VGG16 and VGG16-SVM against epoch numbers. The curve reveals the consistent advantage of incorporating SVM for classification starting from early epochs (starting from epoch 4). This in turn means that the modified VGG16-SVM model can achieve comparable performance to VGG16 in shorter training time. It is also remarkable to notice that the validation accuracy keeps improving slightly even after the training accuracy is stabilized (at epoch 5). We think that is because the quality of the margin that SVM learns keeps improving even after the best possible training accuracy is achieved. However, we can see that the validation accuracy and loss curves of VGG16-SVM are more fluctuating than the loss of VGG16, which means that the number of epochs needs to be tuned more carefully with VGG16-SVM to get the best possible models. in Figure 9(b). The modified VGG16 with the SVM model is better than the VGG16 model in accuracy and loss. In terms of the gap between the training and validation of the two models, the modified VGG16 with the SVM model gets an accuracy of 99.94%, while the VGG16 model gets 98.91% in the last epoch.

CONCLUSION
Non-small cell lung cancer is a leading cause of cancer mortality worldwide. The early detection and the accurate diagnosis of NSCL are essential to improve survival rates and to set customized treatment protocols. In this study, we build several deep learning models, including basic five-layers CNN, VGG16, and VGG19 to differentiate normal lungs from lungs infected by one of the following subtypes of NSCLC: adenocarcinoma, squamous cell carcinoma, and large cell carcinoma. In addition, we propose an extension to CNN that involves CBAM attention blocks to enable the network to focus on the informative part of the image to obtain a correct classification. Furthermore, we propose VGG16-SVM and VGG19-SVM variants of VGG16 and VGG19 models that employ SVM in the classification layer instead of the SoftMax function. They are trained by optimizing max-margin objective function instead of categorical cross-entropy.
Results demonstrated the utility of CNN and VGGNets in detecting and classifying NSCLC. The results show also that the attention-augmented CNN (CNN+CBAM) outperforms the basic CNN by 7.3%. Besides, the experiments also manifest the consistent advantage the SVM brings to VGGNets in comparison to the SoftMax function. The modified VGG16 with SVM outperforms all models by 83.49%. Despite the small size of the dataset, we got a good result by using the attention method and the enhancement of pre-training models on VGGNet by using SVM instead of SoftMax. The attention method and SVM as a classifier give us more features extracted and significantly more learning parameters to learn.
In the future, we will try to collect data of a suitable size to get better results by training the model on many different images to increase the learning rate and the ability to identify the type of lung cancer accurately. Also, we will attempt to build a model that outperforms the well-known deep learning models, and we will try to use the attention method to get better results. Moreover, we will try new techniques to show better results, such as ensemble or generative adversarial networks (GANs).