A transfer learning with deep neural network approach for diabetic retinopathy classification

ABSTRACT


INTRODUCTION
Diabetic retinopathy is an eye disease caused by high blood sugar and pressure which damages the blood vessels in the back of the eye. Based on [1], around 40.3% of United States adults 40 years and older suffer from retinopathy with 8.2% have vision-threatening retinopathy. Diabetic retinopathy is the root cause of more than 1% of the blindness worldwide. People with Diabetic retinopathy are at the great risk of developing other eye diseases such as glaucoma and Cataracts. This disease is progressive meaning that it advances from one stage to a more serious stage, if it has not treated well. Early detection with effective treatment of the diabetic retinopathy can reduce vision loss by 90% [2].
To overcome the aforementioned problem and detect diabetic retinopathy early and efficiently, we have developed a deep learning model that is capable, with high accuracy, to detect if an eye suffers from diabetic retinopathy or not. If the eye suffers from diabetic retinopathy, our model detects the severity level of the disease and hence preventing the disease from progressing. Machine learning and mainly deep learning have improved drastically during the past decade [3]. Deep learning algorithms advanced many research fields such as speech recognition, decision making, and image processing.
Convolutional neural network (CNN) is a deep neural network model that is widely used in computer vision and image classifications. CNN consists of three main components: i) single or multiple convolutional blocks which is a central component of CNN, ii) sampling layers (pooling layers) such as max- sampling and mean-sampling, and iii) number of fully connected layers. Image classification can be defined as the process of labeling images with a category from a predefined set of categories. The process of image classification consists of many phases starting from collecting a dataset of images, labeling them, preprocessing the images, image segmentation, features extraction, and finally, object classification using a deep learning model [4]. Many researchers built various deep learning architectures based on CNN such as LeNet-5 [5], AlexNet [6], ZFNet [7], VGGNet [8], GoogleNet [9], ResNet [10], Inception V2 [11], Inception V3 [12], InceptionResNet (Incption V4) [13], DenseNet [14], GapNet [15], SNet [16], Xception [17], EfficientNet [18]. These deep learning architectures are used for building various deep learning models.
In this research, we have designed and developed 6 different transfer learning techniques to detect the severity level of the diabetic retinopathy to stop blindness before it is too late. The pre-trained models are: i) ResNet [10], ii) Inception V3 [12], iii) InceptionResNet (Inception V4) [13], iv) DenseNet [14], (v) Xception [17], and vi) EfficientNet [18]. The proposed techniques were trained and evaluated on real-world medical images [19]. Each image in the training dataset is manually labeled with its severity level by a clinician.
To summarize, this paper makes the following contributions:  Innovative transfer learning model: we have leveraged various state-of-the-art CNN architectures to build various transfer learning-based models. The CNN architectures have been used as pre-trained models to our models. CNN-based models have been used successfully in image classification tasks.  Theory: in this paper, we show that leveraging transfer learning improves the performance of deep learning models and increases its detection accuracy.  Experiments: we have conducted several experiments on a large medical image dataset. Our experimental results show the high ability of our model for detecting the severity level of diabetic retinopathy disease.
In addition, we compared the performance of six different transfer learning-based models. The remainder of this paper is organized as follows. Section 2 provides an overview of the related work in the medical image processing filed. Section 3 describes our method to design and develop a deep learning model to detect the severity of a diabetic retinopathy eye. Section 4 presents our experimental results and findings and we discuss them in section 5. Finally, the paper concludes with avenues of future work on section 6.

RELATED WORK
Medical image classification is one of the most important images processing tasks. Its main goal is to classify medical images into different categories to help physicians and clinicians to diagnose patients faster [20]. Physicians rely on their practical experience as well as manually spotting of various features in an image to determine its medical condition. Such a process is error prone and tedious task. Therefore, the medical image classification emerged to help physicians classify medical images faster and more conveniently. To keep the paper concise and readable, we will compare and contrast between our research work in light of the related research effort in detecting diabetic retinopathy using machine and deep learning methods.
Diabetic retinopathy is one of the eye's diseases that is the root cause of blindness around the world. Detecting the severity level of diabetic retinopathy eye early is crucial for preventing possible advancement of this disease. Due to the importance of this problem, many researchers have developed various machine learning techniques for detecting diabetic retinopathy including [21][22][23][24][25][26][27][28][29][30][31].
Our research shares with the previous research effort the idea of detecting diabetic retinopathy but it is significantly different. For example, our dataset contains 3,562 original images whereas many previous work trained their model on a small dataset with less than 500 images such as [21][22][23][24][25][26][27]. Other research work trained their models on a bigger dataset such as [28] and [29] with 1,200 images and [30][31][32] with around 35,000 images. Nevertheless, our research work outperformed these research efforts in many ways. For example, [28] and [29] used traditional machine learning such as SVM and AdaBoost. In [30] and [31] used a single CNN model. Similarly, [32] used mainly two different models and their best obtained model achieved a kappa score of 0.72. However, in this research, we have utilized 7 different state-of-the-art deep learning models. Finally, we have developed a transfer deep learning model and our results outperformed the previous research efforts.

RESEARCH METHOD 3.1. The dataset
To train and evaluate our deep learning model, we have utilized the dataset available for the APTOS 2019 Blindness Detection Kaggle competition [19]. The dataset is a real-world dataset obtained from multiple clinics in India using different cameras over a period of time. The images are labeled by experts. However, they might contain some noise in both the images and the labels. The clinics labeled the images in the dataset with the severity level of the diabetic retinopathy starting from normal eye image to proliferative diabetic retinopathy eye image. Table 1 shows the labels of the images in the dataset along with the number of images that belong to each label. The table shows that the dataset has two problems. First, the dataset is unbalanced. The images that belong to the "NO DR" class are more than half of the dataset. Therefore, if we train a classifier on this dataset, the classifier will be bias toward this class. Second, the dataset is relatively small for deep learning tasks. To solve the first problem, imbalanced data, we leveraged a data-oversampling technique. To overcome the second problem, small dataset, we used a data augmentation technique. Next section describes these two techniques in more detail. Figure 1 shows an illustrative example of each diabetic retinopathy severity level. The size of the images has been reshaped to fit the page.

Data preprocessing
This section describes the data pre-processing techniques we leveraged to normalize the images as well as to enlarge the dataset to make it ready for deep learning tasks.

Image normalization
The images in our dataset are colored images with red, green, and blue channels (RGB) and their size varies from one image to another. Therefore, to standardize the size of the images, we reshaped the images to 512 x 512 pixels. We choose this size since its more efficient to run on our computers and to have enough features for the model to learn about the images.
Image normalization is a crucial step in deep learning that allows the gradient decent algorithm to converge faster and hence improving the performance of the deep learning model. There are several methods to normalize images, after converting them to integer vectors, such as: i) dividing each pixel in an image by the mean of that image vector, ii) subtract the mean per channel calculated over all images in the dataset, or iii) in picture images datasets, dividing each pixel by 255 is a simple and efficient technique. The third approach has been used to normalize the images in our dataset.

Data over-sampling
Over-sampling is a group of techniques to solve the imbalanced data problem. Such a technique tries to make the dataset balanced with equal number of instances in each class. The over-sampling technique that we used in this research is based on implementing a simple duplicate of random records from the minority classes. Figure 2 compares between the dataset before leveraging the over-sampling technique, Figure 2(a), and after leveraging the data over-sampling technique, Figure 2

Data augmentation
Training deep learning models such as DenseNet, ResNet, or EfficientNet, require large dataset to produce stable models. Small datasets produce models that overfit the training dataset and hence their results cannot be generalized. To avoid such a problem, we have performed a data-augmentation technique to enlarge the dataset. Data augmentation is a process of generating (manufacturing) data from the existing data to increase the diversity and the number of the instances in the dataset while maintaining the same label of the original image. Data augmentation techniques perform various operations on images including image scaling, geometric transformation, adding noise to images, changing the lighting conditions of the images, images flipping.
As depicted in Figure 3, we performed various data augmentation operations on the dataset including flip the image horizontally, flip an image vertically, scale the size of an image, rotate the image, shearing an image, and elastic and perspective transformation which tries to project an object of an image in a different point of view. To augment our images, we have utilized the ImgAug [33], a Python library for image augmentation. For each input image, we generate 64 different images other than the original one. After the data augmentation step, we ended up with 515,775 different labeled images.

Performing pseudo-label
The work of [34], the pseudo-label was implemented in this research to enhance the model performance. pseudo-label is a simple and efficient semi-supervised learning technique to improve the performance of deep neural network models. The model that uses pseudo-label is trained using supervised learning mechanism with labeled and unlabeled (test) data at the same time. For unlabeled data, the model is trained using the labeled data. Then, the trained model is used to predict the test (unlabeled) data. Finally, we re-train the same model in a supervised mechanism using the labeled and predicted data and make a new prediction of the test data. Such a simple approach improved the performance of a state-of-the-art neural network model [34].

Transfer learning technique
Instead of training a deep learning model, mainly CNNs, from scratch, many researchers, especially in the medical images processing, leverage transfer learning technique to generate efficient models [35]. In this research, we developed a transfer-learning-based model after fine-tuning a pre-trained CNN model, trained on different images, and included the pre-trained model as an input to our model. The leveraged CNN models are pre-trained on the ImageNet dataset. The ImageNet dataset contains massive number of images and the CNN models are available to public. Such an idea greatly improved the performance of our model. Nevertheless, an ensemble model out of the best performing models is implemented.

Pre-trained models
In order to detect the severity level of diabetic retinopathy images and to compare between various pre-trained models, we have leveraged 6 state-of-the-art CNN models. The pre-trained models are: i) ResNet [10], ii) Inception V3 [12], iii) InceptionResNet (Inception V4) [13], iv) DenseNet [14], v) Xception [17], and vi) EfficientNet [18].  classifier [36]. Nevertheless, GAP has shown an ability to act as an attention layer by discriminating regions of interest of the image and retain them to the final layer of the model [37]. Therefore, the GAP layer is placed after the pre-trained output layer to transfer attentive knowledge to the second part of the model. Batch normalization [11] and Dropout [38] regulation techniques are then used to reduce the overfitting problem and increase the learning capabilities of the classifier. The next layer in our deep neural network is the Dense layer, a fully connected layer with 1,024 neurons. Next, the output of the Dense layer fed to a rectified linear unit (ReLU) activation function. Those last five steps are repeated 3 times in our classifier, denoted as X3 in Figure 4, before their output goes to the next level. The deeper the network the more vanishing the gradients will be. Therefore, a ReLU layer is added to the end of each of the three blocks to minimize the effect of vanishing gradient problem, where the gradients layer after layer are getting more and more smaller and are not back-propagated to the network layers, preventing the network from learning low level details of the images [10,14,39]. After repeating the previous steps three times, our classifier performs batch normalization and dropout regulation techniques, then another Dense layer with a ReLU activation function. Finally, the results of the previous layer are fed to a SoftMax function for final classification.

EXPERIMENTATION AND EVALUATION
This section discusses the experimentation setup and the evaluation procedure for the proposed models. The rest of this section is organized as follows: the experimentation setup and the parameters that were used to train the proposed techniques are discussed in subsection 4.1., the evaluation measure used to evaluate the proposed techniques is discussed in subsection 4.2. Finally, the models' evaluation results are presented in subsection 4.3.

Experimentation setup
The proposed techniques in this research were trained using the provided dataset (see section 3.1). All the transfer learning-based models were first trained without pseudo-label learning. Second, trained models were used to predict the label of the testing images for pseudo-label learning. Predicted labels and their associated images are then added to training dataset and used to train the proposed models. Table 2 presents the hyper-parameters used in training the proposed models, where, BS stands for batch size and LR stands for learning rate. All the models were trained with a learning rate of 1e-4 for a maximum number of 75 epochs. All the models were trained with image size of 512x512 pixels except for the EfficientNet-B4 was trained using 380x380 pixels. All the experiments were conducted using the Kaggle kernel of the challenge. Small batch sizes were used to train the models due to the limited resources provided by the kernel and the high complexity of the used models.

Evaluation measure
The quadratic weighted kappa (QWK) [40] is used to evaluate the performance of the models. The QWK evaluates the level of agreement between the image target label and the predicted severity level. The Quadratic weighted kappa is computed using (1). (1) where, i represents the target label, j represents the predicted label, Oi,j is an N*N matrix represents the received target and predicted labels, Ei,j is an N*N matrix represents the expected target and predicted label, and wi,j is an N*N matrix represents a weight calculated based on the difference between the target and predicted label. wi,j is computed using (2):  (2) where, the N represents the number of testing samples, i represents the target label, and j represents the predicted label. Table 3 shows the results achieved by our transfer learning models. The best result was scored by the Inception-V3+GAP-based classifier with QWK=82.0%. The next best score was achieved by the DenseNet-169+GAP-based classifier with QWK=81.8%, whereas the third level was scored by the Xception+GAP-based classifier with QWK=80.9%. On the other hand, the ResNet-50+GAP-based classifier scored the worst results among the trained models with a QWK=77.6%.

Results
As this research is based on a Kaggle challenge, and in order to achieve high results and rank, an ensemble based on simple average of the predictions of the top three performing models (i.e., DenseNet-169, Inception-V3, and Xception) was computed. The ensemble model outperforms the best performing model (i.e., Inception-V3) with 0.4% with a QWK of 82.4%. Although we finished the challenge with a rank of 71 (team name: Data_Science@JUST) the first team ("[ods.ai] topcoders") achieved a score of QWK=85.6% with only 3.2% of advancement over our achieved results in https://www.kaggle.com/c/aptos2019-blindnessdetection/leaderboard. It is worth noting that 2,931 teams have participated in this Kaggle challenge.

DISCUSSION
One of the risks for training transfer learning with deep neural networks is overfitting. Therefore, the callback function of "EarlyStopping" in https://keras.io/callbacks/ from Keras callbacks is used to stop the model training when the computed validation loss value is no more improving. As depicted in Figure 5, there is no gap between the computed train loss and the validation loss over model training epochs. This indicates that the model was trained without overfitting. Although the maximum number of epochs enabled for training was set to 75, the model stopped training after epoch 40 to prevent overfitting as shown in Figure 5. Both loss values declined together during model training due to the model layers responsible for regularization (i.e., batch normalization and dropout of Figure 4). As discussed earlier, batch normalization [11] and dropout [38] are regulation techniques used to reduce the overfitting problem and increase the learning capabilities of the classifier.
In order to show the strength of the proposed model architecture and the importance of the proposed image preprocessing techniques (see section 3), an ablation analysis was conducted on the best performing model (Inception-V3+GAP-based classifier). As presented in Table 4, relying on training the Inception-V3 model alone without transfer learning, the model scored the lowest results of QWK=63.8% with -13.7% lower than the achieved result by the transfer learning with the Inception-V3 model (i.e., Inception-V3+GAPbased classifier with QWK=82.0%). This finding shows the significance of the proposed transfer learning architecture. Moreover, this finding goes in line with findings reported in literature for the value of using transfer learning with deep neural networks in general [41] and for the medical image classification in particular [42].
The second highest impact was the ablation of pseudo-label technique, the model scored QWK 70.3% without pseudo-label with a score decline by -11.7%. This finding shows how efficient was the pseudolabel technique in improving the performance of the proposed model. The same finding was reported in the technique published in [34]. The ablation of the image augmentation scored a decline in the results with -3.4%. Previous research has demonstrated the effectiveness of using image data augmentation in enhancing models classification performance [43]. Although it was expected that image data augmentation would have a higher impact on enhancing the proposed model results, the data augmentation techniques were mainly traditional and simple (i.e. rotating, flipping, and cropping of input images). Finally, the ablation of the oversampling technique had the lowest impact with a decline in the model score of -0.7%. As discussed earlier, a random oversampling technique was implemented to solve the imbalanced problem of the original dataset. However, the ablation analysis shows that ablating this technique has a very small affect on the model achieved results with only -0.7%. Although the random oversampling technique can be useful to avoid the negative affect of imbalanced data on achieved model results, it causes the model to overfit during training [44]. However, as we used the callback function of "EarlyStopping" from Keras callbacks to prevent model overfitting, the effect of the random oversampling was very low on the model achieved classification results.

CONCLUSION
Diabetic retinopathy is a progressive eye disease caused by a high blood sugar or pressure. This disease, if not detected and treated well early, can cause vision loss. To that end, we have proposed a transfer learning approach for accurately detecting the severity level of diabetic retinopathy. Our model is a deep learning model based on global average pooling (GAP) technique with various pre-trained CNN models. We have utilized 6 state-of-the-art CNN models as pre-trained models to our GAP-base model and compared between them. Our best model, the Inception-V3+GAP-based classifier, achieved 82.0% QWK. Improving the performance of our models as well as applying our transfer learning models to other medical problems are interesting avenue of future directions.