Decomposition of color wavelet with higher order statistical texture and convolutional neural network features set based classification of colorectal polyps from video endoscopy

Gastrointestinal cancer is one of the leading causes of death across the world. The gastrointestinal polyps are considered as the precursors of developing this malignant cancer. In order to condense the probability of cancer, early detection and removal of colorectal polyps can be cogitated. The most used diagnostic modality for colorectal polyps is video endoscopy. But the accuracy of diagnosis mostly depends on doctors' experience that is crucial to detect polyps in many cases. Computer-aided polyp detection is promising to reduce the miss detection rate of the polyp and thus improve the accuracy of diagnosis results. The proposed method first detects polyp and non-polyp then illustrates an automatic polyp classification technique from endoscopic video through color wavelet with higher-order statistical texture feature and Convolutional Neural Network (CNN). Gray Level Run Length Matrix (GLRLM) is used for higher-order statistical texture features of different directions (Ɵ = 0o, 45o, 90o, 135o). The features are fed into a linear support vector machine (SVM) to train the classifier. The experimental result demonstrates that the proposed approach is auspicious and operative with residual network architecture, which triumphs the best performance of accuracy, sensitivity, and specificity of 98.83%, 97.87%, and 99.13% respectively for classification of colorectal polyps on standard public endoscopic video databases.


INTRODUCTION
Colorectal cancer or bowel is the second most prominent cause of cancer in women and the third most leading cause in men [1]. Although colorectal polyps are forerunners to colorectal cancer, for these polyps it takes several years to potentially transform into cancer [2], which may develop if these polyps are left untreated. Major types of colorectal polyps are adenomas, hyperplastic and serated. If early detection and classification of these polyps is possible, they can be removed before this transmission arises. Several tests are recommended in all colon cancer screening guidelines, particularly video endoscopy and fecal occult blood testing. Within the United States, video endoscopy is the most commonly utilized test and should be performed every 10 years in average-risk persons. During an endoscopy, a long, flexible tube (colonoscope) is inserted into the body. A tiny video camera at the tip of the tube allows the doctor to view the inside of the entire colon to detect and remove a polyp. Distinguishing from low-risk polyps with high-risk colorectal polyps is an important part of colorectal cancer screening through the detection and histopathological characterization of colorectal polyps. The generation of such a typical endoscopy video proceeds for a long period. For back-to-back endoscopy, it is so tough for an endoscopist to examine it with sufficient 2987 attentiveness during such a long period, as it is an operator-dependent procedure. The accuracy of such a challenging diagnosis depends on a qualified physician. There is a large degree of variability for how the physician characterizes and diagnose these polyps, as not all polyps have same malignant influnce.
As an example, serrated polyps can potentially develop more aggressively into colorectal cancer as compared to other colorectal polyps, because of the serrated pathway in tumorigenesis [3]. There are only consistent prevailing methods for diagnosing serrated polyps is histopathological characterization because other screening methods designed to detect premalignant lesions (such as fecal blood, fecal DNA, or virtual colonoscopy) and are not well matched for differentiating serrated polyps from other polyps [4]. The challenging task for a doctor is to differentiate between a serrated polyps and hyperplastic polyps. This is because hyperplastic polyps often lack the dysplastic nuclear changes that characterize conventional adenomas polyp, and their histopathological diagnosis of hyperplastic polyps is entirely based on morphological features, such as serration, dilatation, and branching and often lack the dysplastic nuclear changes that depict conventional adenomas polyps [5].
In the field of artificial intelligence, convolutional neural network models have been proposed and applied for computer-aided polyp detection and classification system. Color wavelet covariance (CWC) of different color bands based on the covariance of second-order textual measures have been used to the detection of tumors in colonoscopic video with a specificity and sensitivity of 97% and 90% respectively [6]. Intelligent processing techniques of SVMs and color-texture analysis methodologies have been proposed in their consecutive work for automatic detection of gastrointestinal adenomas in video endoscopy having the accuracy of 94% [7]. 94.20% accuracy was obtained with the combination of color and shape features to recognize intestinal polyp from capsule endoscopy as a classifier of multilayer perceptron (MLP) [8]. A deep convolutional neural network-based classification for digestive organs in wireless capsule endoscopy images was considered in Y. Zou et al. [9]. A trainable feature extractor based on a convolutional neural network is utilized for lesion detection from endoscopy images in R. Zhu et al. [10]. In paper [11] CNN features have been proposed for the automated classification of colonic mucosa for colon polyp staging in the context of colon cancer screening with a sensitivity of 95.16% and specificity of 74.19%. The CNN-derived features show greater invariance to viewing angles and image quality factors when compared to the eigenimage model [12]. Color wavelet (CW) features and convolutional neural network features of video frames are extracted and combined which are used to train a linear support vector machine, gaining accuracy of 98.65%, sensitivity of 98.79% and specificity of 98.52% [13]. Authors [14] enhanced automatic polyp detection accuracy by utilizing feature fusion (wavelet, local binary pattern, and Gabor features) and multiple classifier techniques. They achieved 80% true positive rate by incorporating local binary pattern and wavelet features. Authors [15] relies on a faster region-based convolutional neural network (Faster R-CNN) model for polyps' detection in endoscopic videos. Tajbakhsh et. al. [16] focused on a novel vote accumulation scheme for detecting colonic polyps that enable polyp detection from partially identified boundaries of polyps. However, several novel technologies are emerging within the field of endoluminal imaging and confocal laser endomicroscopy to perform the imaging of the colorectal area, the number of computer-aided decision support system (CADSS) related to colorectal polyps' classification is still limited. Hence, the main contribution of this paper is to effectively use machine learning to successfully detect and classify colorectal polyps presented in endoscopy video. However, the other intentions of this paper include the followings:  A novel intelligent approach utilizing higher-order statistical texture feature on the color wavelet and convolutional neural network learned from video endoscopy dataset.  The proposed approach is favorable and effective, which achieves the best performance of accuracy, compare to only CNN feature-based method.  Drastically reducing the necessary time for a qualified physician to examine the entire endoscopy video, by indicating colorectal polyps and/or classify them. The rest of the paper is organized as follows: section 2 describes the architecture and methods of the proposed system. Section 3 deals with the analyzing of experimental results. Finally, discussion of the results as well as the conclusions of this study is summarized in sections 4 and 5 respectively.

SYSTEM ARCHITECTURE
The implementation of the proposed system is based on MATLAB 2017b that can accept standard video files of different formats such as AVI, MP4, and WVM as input and produce outputs with a classified polyp in the video frame sequence. When endoscopy video is fed into the proposed system, it utilizes color wavelet with higher-order statistical texture image features and convolutional neural network features. The fusion of all the features is incorporated into SVMs to achieve improved detection and classification

Acquisition of video endoscopy
The availability of endoscopy datasets is an important issue for this purpose. However, the performance of any Computer-aided design (CAD) system depends on the training dataset, this proposed system utilizes more than 86 standard endoscopy videos from different sources. Sample dataset of different polyp classes and normal lesion are shown in Figure 2. Most of the endoscopy video embrace colorectal polyps and only small part of the endoscopy video are associated with the normal colon. Important sources of the database include Department of Electronics, University of Alcala (http://www.depeca.uah.es/colonoscopy_dataset/) [17], Endoscopic Vision Challenge (https://polyp.grand-challenge.org/databases/) [18]. Table 1 shows the dataset collected from video endoscopy. For our experiment, we have extracted 4,216 frames from endoscopy video, which consists of 3,162 colorectal polyps, and 1,054 normal frames.

Pre-processing module
Endoscopy video is loaded in the computer to find possible categories of colorectal polyps that can produce the running sequence of still images called frame using the proposed intelligent system. Original video frame as shown in Figure 3(a) with superfluous regions are discarded resulting in a pre-processed frame as shown in Figure 3

Feature extraction module
A sliding window of user-defined size and sliding step is slided over the pre-processed video frames to generate small images called window as shown in Figure 4. Depending on the size, dimensions, and sliding step, each window produces a number of feature vectors.

Higher order statistics on the wavelet domain as gray scale texture features
Multiresolution analysis of an image is achieved by using a discrete wavelet transform for our proposed framework as the size of the polyp varies. The most relevant texture information often appears in middle-frequency channels [19]. Texture provides information about the spatial arrangement of colors or intensities in an image that helps image segmentation and classification. Wavelets perform well for texture analysis. The decomposition of the wavelet transform provides spatial/frequency representation from original images where every sub-image preserves both local and global information of a specific scale and orientation. When decomposition level decreases in the spatial domain, it increases in the frequency domain providing zooming capabilities and local characterization of the image [6]. Since major texture information produced by this transformation does not contain in the low-frequency image and most of the substantial texture information often looks as if in the middle frequency channel, so our proposed approach uses a discrete wavelet transform (DWT) for the decomposition of the frequency domain of the image.
A two-dimensional (2-D) discrete wavelet transformation is obtained by applying the filtering consecutively along with vertical and horizontal directions (separable filter bank). This filtering procedure is grasped by convolving the image with a high pass filter H and a low pass filter L which produces a low-resolution image B 0 (k) at scale k and detail images D j (k), j = 1, 2, 3, at scale k as described by the following recursive [19]: where arrow (↓) denotes the sub-sampling procedure, the asterisk (*) is the convolution operator and H and L are the two filters for all k = 1, 2, 3, . . . . . . , n.
In this paper, the gray level run length matrix has been considered for the description of higherorder statistical texture features implemented within the decomposed sub-images. A run-length matrix P (i, j) for a given image is defined by the specifying direction of 0 0 , 45 0 , 90 0 ,135 0 and then count the occurrence of a run for each gray levels i and run-length j in this direction. We consider only four run-length matrix features namely Short Run Low Gray-Level Emphasis (SRLGE), Short Run High Gray-Level Emphasis (SRHGE), Long Run Low Gray-Level Emphasis (LRLGE), Long Run High Gray-Level Emphasis (LRHGE) [20]. (3) where P(i,j) is the run-length matrix and n r is the total number of runs.

Higher order color wavelet covariance features
In our proposed approach color texture features extracted by using DWT for the decomposition of the frequency domain of the color image are estimated over the GLRLM. The input image is decomposed into three color channels: In this work, the frames are decomposed at level 3 using Daubechies 2 (db2) wavelet family. A three-level two dimensional discrete wavelet transformation is consequently applied on each color channel (I C i ), producing a low resolution image B i (k), at scale k and the detail images D i (k), where i = 1, 2, 3 and k = 1, 2, . . . . . . ,9 according to the wavelet decomposition of (1). Therefore, we have: , D i (k)}, i = 1, 2, 3 and k = 1, 2, 3, . . . . . . ,9.
where k is the decomposition level. As it has already been noted, the most significant textual information is presented in the middle wavelet detailed channels. So, we consider only the detail images for k = 4, 5, 6. So, the nine different sub-images produced from (7) for the values k = 4, 5, 6. For extracting the higher-order statistical textual information, we consider run-length matrices calculated over the above nine different sub-images. These matrices reflect the spatial relationship between more than two pixels in a definite direction. Run-length matrices generate 36 matrices calculated from four different directions of intensities relation 0 0 , 45 0 , 90 0 , and 135 0 .
where m is the respective run-length matrix feature.

Convolutional neural network features
Convolutional neural network features are extracted from each window of size 227 * 227. A convolutional neural network, a hierarchical neural network is the branch of deep learning, can be composed of convolution layers, pooling layers, rectified linear unit (ReLU) layers, fully connected layers, and loss layers. In a simple CNN architecture, a Rectified Linear Unit layer follows each convolution layer. After each convolution layer, there is a max-pooling layer. Finally, one or more fully connected layers, which can be attained after one or more convolution layers. An important characteristic that distinguishes CNN to traditional multilayer perceptron (MLP) models is taking into account the structure of the images while processing them. Due to full connectivity between the layers, MLP models suffer from the curse of dimensionality thus do not scale well to high-resolution images and less sensitive to positional changes. [21] Inspire the structure of CNN used in this paper contains the following representation as shown in Figure 5. Table 2 can describe about the neural network architecture implementation.

Classification module
This module handles the classification of the feature vectors into one of four classes: hyperplastic, adenomas, serrated and normal. For computer-aided histopathological classification systems, many classifiers have been developed such as linear discriminant analysis (LDA) [6], neural networks [8], adaptive neurofuzzy inference system [22], binary classifier [23], and support vector machine (SVM) [13,24,25]. In this proposed system, SVM has been used for better performance in the case of a large number of features, training, and sparse data. Different types of colorectal polyps' frames are extracted from video endoscopy that operates in two modes: training and testing modes. In training mode, decomposition of color wavelet with GLRLM and CNN features are extracted from a hyperplastic, adenomas, serrated and normal video frames. The input feature vector consists of 144 color wavelet features and 4096 CNN features, which are combined for SVM training. In the testing mode of operation, the classification of new samples is extracted from unknown video frames based on the utilization of knowledge gained from the training samples. If the unknown sample is classified as hyperplastic, adenomas or serrated polyps it goes to the postprocessing module otherwise a new subsequent unidentified video frame comes under the feature extraction module.

Post-processing module
The outcome of the classification module is used to a produce new video frame on which possible types of colorectal polyps are appropriately labeled. An illustration of this technique shown in Figure 6.

Normal
Hp Sr Ad

RESULTS
To evaluate the performance of the proposed method, we examined three metrics per class: sensitivity (Sen), specificity (Spe), and accuracy (Acc). Sensitivity also called the true positive rate refers to the ability to measure the proportion of actual positives that are correctly identified as such. Specificity refers to the true negative rate, measuring the proportion of actual negatives that are correctly identified as such. Accuracy is derived from sensitivity and specificity and is defined as the sum of true positives and false positives divided by the total number of evaluated cases (true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). Twelve different performance metrics were used for classification purposes (Sen, Spe, and Acc for each one of the 4 classes understudy). The final confusion matrices can be seen in Table 3 and the performance analysis with respect to Sen, Spe, and Acc scores across the four different classes is shown in Table 4. In comparison, as is manifested in Table 4, the outcomes indicate that the proposed method (Color wavelet + CNN + GLRLM + SVM) achieves competitive performance not only for detecting polyp and non-polyp but also for classifying colorectal polyps compare to CNN feature-based method.

Comparison with existing methods
However, the comparison with existing approaches has been extremely difficult for several reasons. First, the widely acceptable research papers of related topics attempt to explain different problems: polyp detection [10,12,13] commonly using convolution neural network features. Second, even if the same problem was solved by those methods, there is an entire lack of openly accessible codes. We compare the methods focused on the classification of gastrointestinal lesions in video endoscopy as shown Table 5. In addition, the proposed system is appraised alongside with standard dataset.

Comparison with human experts
We have considered the diagnostic efficacy of human experts to compare the performance with our proposed approach. Although the range of years of experience performing endoscopies goes from 8 to 40 for the human expert's category, several human issues lead to polyp misdetection and misclassification. The lesions wrongly classified by all the humans are shown in Figure 7. The lesions correctly classified by all humans and the best machine learning model, the ones wrongly classified by the best model and correctly by all humans, and the lesions correctly classified by the best model and wrongly classified by all humans are displayed in Figures 8, 9 and 10, respectively. Text included in the figures shows the lesion name ('Hp', 'Ad', 'Sr', 'N' refers to hyperplastic, adenomas, serrated, and normal lesion respectively).

DISCUSSION
In this paper, we presented and developed a set of algorithms for computer-aided automatic colorectal polyp detection and classification system from video endoscopy. Although the polyp detection and classification is a disputing task because of its legion factors such as the presence of trash and liquids and bubbles, vignetting and different types of polyp shape, we attempt to overwhelm this problem by incorporating both higher-order statistical texture and convolutional neural network features. We have used a linear SVM for classification solution for its stable results and faster training instead of a conventional softmax loss layer. Here, the input to the fully connected layer has been used as the input to the SVM classifier. As there are several kinds of important low-level features, we consider that the texture feature suits this paper best. We have used higher-order statistical texture descriptors for feature extraction instead of second-order statistics because higher-order statistics determine the relationship between three or more pixels whereas second-order statistics examine the relationship between two pixels. Therefore, we have considered DWT and GLRLM features for texture recognition. We consider that the CNN feature is another suitable choice for feature extraction as CNN is an end-to-end learning process. On CNN, we provide the dataset and their corresponding labels, the entire process of feature engineering is done. It has both feature extraction and machine learning inside it. The proposed system for colorectal polyp's classification showed that the use of color wavelet with GLRLM and CNN features achieves almost high sensitivity (97.87%) and, equally importantly, displays high specificity (99.13%).

CONCLUSIONS AND FUTURE WORKS
In this paper, we tackled a complete pipeline for computer-aided automatic colorectal polyps' detection and classification system capable of supporting the decision of a medical person. Its aims to enhance the ability of an endoscopist to locate colorectal polyps more accurately which may go undetected and evolve into malignant tumors. The system exploits the higher-order statistical texture feature calculated over the strength of wavelet frame transformation of different color bands and convolutional neural network. The classification was performed using a linear SVM classifier.
The results of the extensive experimental study have to lead us to the use of higher-order statistical texture features on the wavelet domain results in higher classification accuracy in terms of specificity and sensitivity. The proposed GLRLM performs significantly better than Gray Level Co-Occurrence Matrix (GLCM) for polyp detection as the Gray Level Run Length Matrix provides information about the connected length of a particular pixel in a definite direction. Convolutional neural networks are a special architecture of artificial neural networks that have been widely used in automatic image classification systems by reducing learning complexity with sharing the weights in different layers. The use of linear SVM kernel positively affects the discrimination of normal and abnormal samples.
Finally, we may conclude that the experimental results showed the enhanced performance of the proposed detector and classifier, compared to other state-of-the-art detectors. In future, we would like to determine a more robust classification scheme based on video endoscopy frames and wireless capsule endoscopy (WCE) with a border range of images containing colorectal polyps of different qualities. We would like to add another functionality that facilitates the improvement (enhancement) of video endoscopy by applying the super-resolution technique.