Improve of contrast-distorted image quality assessment based on convolutional neural networks

ABSTRACT


INTRODUCTION
Image quality can be degraded due to various types of distortion such as noise, blurring, fast fading, blocking artifacts and contrast distortion. These distortions may occur during operations such as acquisition, compression, storage, transmission, display and post-processing. Contrast distortion is among the most common and fundamental distortion. Contrast-distorted image (CDI) is an image with low range of grayscale as shown in Figure 1. Contrast distortion may be caused by poor lighting condition and poor quality image acquisition device.
Many image quality assessment algorithms (IQAs) have been developed during the past decade. However, most of them are designed for images distorted by compression, noise and blurring. Such distortions cause structural change in image [1] which does not happen in contrast distortion. Hence, it is not suitable to use the above mentioned IQAs to assess contrast-distorted images (CDI). There are very few IQAs designed specifically for CDI. The first IQA for CDI is a Reduced-Reference IQA (RR-IQA) called RIQMC [2]. The disadvantage of RIQMC is that it requires partial access to reference image, which is impractical in real-life application. Unlike distortion caused by image compression where the original image could be used as reference image, contrast distortion is caused by poor image acquisition conditions such as poor lighting or poor device so the original image itself is distorted and reference image is practically not available.
The first practical solution is proposed by Yaming et al. which is called No-Reference IQA for CDI (NR-IQA-CDI) [3]. It is develop based on the principles of Natural Scene Statistics (NSS) in that there are certain regularities in the statistics of natural scenes which could be missing from the statistics distorted images. The five features used in NR-IQA-CDI are the global spatial statistics of an image including the mean, standard deviation, entropy, kurtosis and skewness. Unfortunately, the performance of NR-IQA-CDI are not encouraging in two of the three test image databases, TID2013 and CSIQ, where the Pearson Linear Correlation Coefficients are only around 0.57 and 0.76, respectively.
The existing NR-IQA-CDI relies on features designed by human or handcrafted features because considerable level of skill, domain expertise and efforts are required to design good handcrafted features. Recently, there is great advancement in machine learning with the introduction of deep learning through Convolutional Neural Networks (CNN) which enable machine to learn good features from raw image automatically without any human intervention. Therefore, it is tempting to explore the ways to transform the existing NR-IQA-CDI from using handcrafted features to machine-crafted features using deep learning, specifically Convolutional Neural Networks (CNN).
The evaluation results indicated that NR-IQA-CDI based on machine-crafted features generally performed better than NR-IQA-CDI based on handcrafted features while enjoying the advantage of requiring zero human intervention in identifying the features. In the next section (Section 2), Brief overview on CNN for NR-IQA. Section 3 Designing NR-IQA-CDI Based on Non-Pre-Trained CNN Models are described. Section 4 describes the performance evaluation and Section 5 concludes the current work.

CONVOLUTIONAL NEURAL NETWORKS (CNNs) FOR NR-IQA
Artificial neural networks has been the most popular tools for machine learning [4], which in more general sense for deep learning. Among several deep learning architectures, stacked denoising autoencoders [5], deep belief networks [6][7], and convolutional neural networks [8][9][10][11][12][13] are three of the most popular architectures utilized for different type of applications. Convolutional neural networks (CNNs) are a special kind of deep learning method, CNNs run much faster on GPUs, such as NVidia's Tesla K80 processor, and has achieved state of the art performance on various computer vision tasks, such as object detection, recognition, retrieval, annotation, image classification, and segmentation [14][15][16].
The fundamental difference between convolutional neural network (CNN) and conventional machine learning is that, rather than using hand-crafted features, such as SIFT [17] and HoG, CNN can automatically learn features from data (images) and acquire scores from the output of it [18]. Figure 2 shows the difference between machine learning and deep learning. Generally, a CNN architecture comprises different layer types such as convolutional layer, Rectified Linear Unit (ReLU) layer, cross channels normalization layer, pooling layer, fully connected layer, dropout layer, SoftMax layer, and output classification layer. Each layer obtains the data from the previous layer. Then, the data is transformed and passed to the subsequent layer. CNN architecture varies in terms of the number of outputs per layer, the size and type of spatial pooling, the number of layers, and the size of the convolutional filters. In general, CNNs are trained in a supervised pattern using the standard backpropagation [19]. Figure 3 shows the typical architecture of a CNN model. The application of CNN in IQA was first proposed by Kang et al. [20]. They treated image patches as input and employed CNN to predict the image quality. As a result, CNN could predict the quality score on small image patches in an accurate manner. Also, instead of using handcrafted features, it could merge the feature learning and regression processes into a single optimization process.
In order to eliminate the need for manual feature extraction, deep learning is performed to learn the features from raw data (images) automatically. For example, NR-IQA learns important features automatically from raw images [21]. Most of the conventional NR-IQA depends on two main steps: feature extraction and score prediction, while in NR-IQA based on machine-crafted features, learning and feature extraction are integrated into one single step.

DESIGNING NR-IQA-CDI BASED ON NON-PRE-TRAINED CNN MODELS
The proposed NR-IQA-CDI based on non-pre-trained CNN (NR-IQA-CDI-NonPreCNN) is trained from scratch is illustrated in Figure 4. The details of the network architecture and the training procedures are as presented in the following two sections. Each image patch is considered as an independent image sample during the training step, and it is labeled with the quality value of the corresponding source image.
There is no golden rule in designing CNN model in terms of the number of layers and the size of filters. This work started with designing the network with 3 convolutional layers. Each layer contains 96, 256, and 384 filters of size 12x12, 5x5, 3x3 respectively. This work was also tested on various settings of number of layers and filter size and the results are as presented in the coming section 4, Figures 7 and 8. The ReLU (Rectified Linear Unit) activation functions were embedded as well. The first two convolutional layers were subjected to 2x2 max pooling while no pooling was applied to the third layer. The three convolutional layers were attached to a fully-connected layer containing 9600 hidden units and a linear regression layer (for image quality score prediction).

Training the CNN
Since the input size (image size) of CNN is fixed, the input images for all databases were resized to 512x512. During the training phase, the quality label of the whole image was assigned to all patches of the same image. The proposed networks were trained repeatedly by performing backpropagation over several epochs. Here, one epoch is defined as the period during in which each sample from the training set has been used once. While fixing the learning rate as 0.0001, all models were trained for 150 epochs. Upon inserting the training image as input, the forward propagation step (consisting of convolution, ReLU and pooling operations in the fc layer) was performed to identify the output probability of each class. A laptop (Intel (R) Core (TM) 2 Duo CPU, 8G RAM memory and NVDIA GTX 950M GPU with a MATLAB R2017a platform) was used to perform the experiment.

PERFORMANCE EVALUATION
In this section, the performance evaluation of the proposed NR-IQA-CDI based on non-pre-trained CNN (NR-IQA-CDI-NonPreCNN) is trained from scratch is presented. The presentation begins with the details of the evaluation methodology such as the test image databases, performance metrics and evaluation procedures. This is followed by discussions on the evaluation results and conclusions from the performance evaluation.

Evaluation methodology
The test image databases used for the evaluation are similar to those used to evaluate the existing NR-IQA-CDI for fair comparison. They are CSIQ database [22], TID2013 database [23] and CID2013 database [2]. The test images used include only the contrast distorted images in the three databases without  [22], TID2013 [23], and CID2013 [2], respectively. Subjective scores are represented by either mean opinion score (MOS) or differential mean opinion score (DMOS).
Cross-validation was used in the performance evaluation. It is a model validation method to evaluate how well the performance of the model could generalize to an independent data set. k-fold cross-validation (k-fold cv) was chosen for this work. This method allows performance evaluation with many different combinations of data set to minimize bias. In this method, data are divided into k subsets and performance evaluation is repeated for k times. During the k times of evaluation, each of the k subsets is used for testing for one time while the rest used for training. The final evaluation result is the average results of the k times of evaluation.
It is well-known that the performance of a machine learning model tends to improve with the increase number of training data. Therefore, k-fold cv with higher k tends to show better performance. In this work, the k-fold cv was repeated for k range from 2 to 10 to reduce bias due to the size of training data. However, only 10 train-test iterations are conducted as the training of CNN is very time consuming.
In order to evaluate the performance of IQA, the performance metrics such as (1)

Evaluation of NR-IQA-CDI-nonpreCNN
The evaluation was repeated with various patch sizes M x N such as 256 x 256, 128 x 128, 64 x 64, and 32 x 32 as well as without patches (the image was resized before input to CNN) to determine the best setting. Tables 1-5 show the average result of assessment using patch size 256 x 256, 128 x 128, 64 x 64, 32 x 32 and without using patch, respectively.   It is apparent that increasing the number of training samples by dividing the image into patches could affect the performance of CNN-based NR-IQA-CDI. For example, there were only 400 images in CID2013 database but there are about 2400 image patches of patch size 256 x 256 used, and about 153600 image patches when patch 32 x 32 was used as shown in Table 6. The average results of k-fold cross-validation (k= 2 to 10) are summarized in Table 7 where the patch size ranges from 32 to 256 pixels. It is observed that the proposed method shows better performance with increased number of patches as shown in Figure 6. It can also be observed that the performance of NR-IQA-CDI-NonPreCNN of small patch size 32 x 32 were not significantly improved as compared to those of patch size 64 x 64 to justify the exponential increase in the number of training samples. Therefore, the patch size of 64 x 64 was chosen as the optimum patch size.    Figure 7 shows the variation of performance with respect to the number of convolution kernels. In general, the use of more kernels would lead to better performance. In the case of CSIQ, the kernel number of the three convolution layers in setting 1-4 are (8, 16, 32), (16, 32, 64), (32, 64,128), and (96,256,384), respectively. It can be concluded that the performance could be improved by increasing the number of kernel. Similar conclusion also applies to databases TID2013 and CID2013. Generally, higher layers could extract better features. However, high-level features are not necessarily better than the low-level ones. From Figure 8, performance increased as the number of convolution layers was increased from one to three. However, the performance started to decrease when the numbers of layers were further increased. Similar results were found TID2013 and CID2013. Therefore, CNN with three convolution layers was optimal for NR-IQA-CDI.

Cross-database evaluation
The cross-database evaluation method used to identify the generalization ability of a NR-IQA-CDI across different databases has been introduced in the current work. In this evaluation, all images in one of the databases are used for training while the images in the other two databases are used for testing. For consistency, the MOS and DMOS scores have been rescaled (0-10). The results of the cross-dataset test have been reported in Table 8. It may be observed that results when our method trained on CSIQ is decreased compared to being trained on CID2013. Therefore, larger training set would lead to better generalization. Hence, the generalization capability of a deep neural network is dependent on the size and diversity of the training set.

NR-IQA-CDI-nonpreCNN vs NR-IQA-CDI based on handcrafted features
The proposed NR-IQA-CDI-NonPreCNN is compared against NR-IQA-CDI based on handcrafted features. The comparison results shown in Table 9. The best score for each performance metric and database is highlighted by bolding the numbers. The results show that NR-IQA-CDI based on non-pre-trained CNN, NR-IQA-CDI-NonPreCNN significantly outperforms those which are based on handcrafted features. In addition to showing best performance, NR-IQA-CDI-NonPreCNN also enjoys the advantage of zero human intervention in designing feature, making it the most attractive solution for NR-IQA-CDI.

CONCLUSION
In this paper, a study on transforming the existing NR-IQA-CDI using machine-crafted features based on CNN has been presented. It was able to accurately predict the quality score from an image and integrate the feature learning and regression into one optimization process. Our method does not require any reference image, and any handcrafted features and directly learns discrimination features from raw image pixels to achieve much better performance. The evaluation results indicated that NR-IQA-CDI based on machine-crafted features generally performed better than NR-IQA-CDI based on handcrafted features while enjoying the advantage of requiring zero human intervention in identifying the features.