Copy-move forgery detection using convolutional neural network and K-mean clustering

ABSTRACT


INTRODUCTION
Digital image processing has many advantages in many applications. Today's image processing tools without leaving obvious traces make editing or manipulating digital images easily and fast. The recent growth in image-manipulation software has led to challenges in prominent or evidence documents [1]. These tampered or manipulated digital images can be used for various targets such as to delude the public into thinking, change political views, and leave disturbing effects in public [2]. Therefore, image forgery detection algorithms have been proposed in this regard. In terms of previous knowledge of images, these algorithms can be divided into two categories: active and passive detection [3]. Active detection methods are based on digital watermarking or digital signature. In addition, passive detections include two types of approaches: forgery type-independent and forgery type-dependent. In the type-dependent detection, forgery is detected according to the type of forgery, while in the forgery types-independent, effects of image compressions or repetitive patterns are analyzed. There are some image forgery detection techniques such as copy-move, segmentation-based algorithms, passive detection, and splicing [4]. Splicing forgery is a method in which several copied regions of different images are pasted in an image [5], while copy-move forgery is a Int J Elec & Comp Eng ISSN: 2088-8708  Copy-move forgery detection using convolutional neural network and K-mean clustering (Ava Pourkashani) 2605 method that is done using pasting one or more copied parts of an image in the same image. Copy-move forgery is often used for hiding unwanted region(s) of an image. Copied contents are often selected from a textured region of the image to be invisible from naked eyes. This type of forgery is more popular among the mentioned forgeries because there is a more opportunity that copied regions of an image are similar in texture, content, and illumination features. Figure 1 presents an image taken from the MICC-Fx dataset series with a copy-move forgery attack. As shown in Figure 1, the detection of copied regions of the image by naked eyes is hard. In addition, copied regions may be attacked. In addition, the copied parts can be noised, scaled, rotated, compressed, noised, or blurred, which state-of-the-art detection algorithms fall in the challenge to compare with images that have not any attacks. Several studies have been conducted on copy-move forgery detection (CPFD). In terms of their performance mechanism, the CMFD algorithms are classified into block detection and feature-based detection algorithms. In the block-based algorithm, an image is split into several nonoverlapped blocks. After that, the similarity of the blocks is compared [6]. In the feature-based algorithm, the feature extractors such as Scale-invariant feature transform (SIFT) [7], speeded-up robust features (SURF), and local binary pattern (LBP) are applied to the image and are analyzed. One common feature extractionbased method is the Zernike moments or blur invariant [8]. It has provided good results. But still an effective algorithm for the CMFD to overcome the mentioned challenges especially compression algorithms (JPEG) is a research topic.
In this research, we propose a CMFD algorithm based on the feature extraction technique. The proposed approach includes three steps. First, the Harris corner detection technique is applied to an image. In the second step, after extracting the patches, the matching process is done around each patch using convolutional neural networks (CNN) [5,9]. We use a method inspired by Siamese networks [10]. As two matched patches are not good evidence for forgery, in the third step, the k-means clustering for matching several patches together is used. Our experimental results show that the proposed algorithm outperforms the state-of-the-art approaches, even in multiple forgeries.
The reminder of the paper is structured as: A review to related works and researches is mentioned in section 2. The proposed algorithm has been introduced and discussed in section 3. Experimental results are illustrated in section 4. In addition, the proposed method is compared to several state-of-the-art approaches [11] in terms of precision, recall, and F1-score criteria in this section. Finally, conclusions are in section 5.

RELATED WORK
There are many approaches for the CMFD based on blocking and feature extraction. Some of these algorithms will be introduced in the following. At first, we present some feature extractors such as local binary pattern (LBP) textural descriptor and Zernike moments which are used for CMFD. Some LBP feature properties such as being invariant against illumination, image transformations, and statistical information of the textural structure of an image are an efficient feature for defining the CMFD algorithms. In addition, multi-resolution LBP, one of the LBP extensions, was implemented for the CMFD [12]. The authors in [12], with adding a k-d tree algorithm to the LBP, were depicted that this approach could recognize copy-move forgery in various distortions challenges which have been mentioned before. The Zernike moments, the shape  [8,11]. The Zernike moments and local sensitive hashing (LSH) have been used for copy-move forgery detecting [8]. Because of being local sensitive hashing, this feature achieved better performance against moderate scaling, additive white Gaussian noise, JPEG compression, and blurring [8]. Speeded-up robust features (SURF) and Scale-invariant features transform (SIFT) are two common and regular approaches for copy-move forgery detection. The researchers have been combined the SIFT method with other approaches to enrich the performance of the CMFD. The SURF and SIFT methods have conventionally implemented for detecting of similar regions in an image in typical challenges such as noise, scaling, and blurring. But matching procedure in these algorithms is not the evidence of forgery. To solve this challenge, the authors in [13] after running the SIFT algorithm, performed hierarchical clustering to detect matched points clusters regarding match single points. Random sample consensus (RANSAC) is another algorithm that estimates the homography matrix and matched the clusters. The authors in [14], depicted that the SIFT based algorithms are proper for the CMFD. A combination of the discrete-time wavelet transform (DWT) and SIFT made better results on the CMFD. Regarding to DWT theory, the LL sub-bands of DWT used in raw images instead of using SIFT [15].
The Dyadic wavelet transform (DyWT) approach was implemented for the CMFD. In other words, against traditional wavelet transform tools, coefficients in each decomposition are not reduced. Comparing of wavelet and scaling coefficients were run for each block to detect similar blocks. After dividing an image into some overlapped blocks, the LL1 and HH1 sub-bands were compared with each other. To make a decision in the last step, the Euclidean measure between matched blocks was calculated. The authors in [16] used the SURF algorithm for the CMFD. The experimental results depicted that the SURF can detect a forgery in point of view changed scenes and cases of textured that is still challenging in many algorithms. The authors in [17] implemented the singular value decomposition (SVD) on the regions of an image after quantization of discrete-time cosine transform (DCT). Regarding using this method, the CMFD had some advantages such as being resistant against Gaussian noise, blur attacks, and being able to detect multi copymove forgery. Moreover, the CMFD was implemented in the spatial domain, while it is resistant to rotation attacks. First, the image was split into n×n overlapped blocks to extract the features from the blocks by four nested frames. The k-means clustering algorithm was used to group the overlapped blocks. Using radix sort, each block group was lexicographically ordered. After that, the distance between the nearby blocks was calculated to determine the overall similarity. Because of translation and scale-invariant properties, the Fourier Mellin transform (FMT) was selected for the CMFD. After splitting the image into several overlapped blocks, the FMT was applied for feature extraction. After that, counting Bloom filtering was applied with hashing. The low complexity of bloom filtering against other methods such as lexicographic sort was the main reason for using it. Regarding the fact that finding matched blocks is not an acceptable reason to detect forgery, the authors prove that the distance of matched blocks to an assumptive array was also considered to make a decision.

PROPOSED ALGORITHM
The proposed algorithm for CMFD includes three main steps: corner detection, keypoint extraction, and matching, and making a decision.

Corner detection
The main part of an image may be attacked by scaling manipulation and there is no previous information about where the cloned region started or how it was scaled. To cope with these problems, the image pyramid presentation is proposed as illustrated in Figure 2. Based on Figure 2, level 1 is assumed as an input image that can be scaled to an image shown in levels 0, 2, and 3. Using a pyramid image makes the proposed approach robust against scale attacks. However, using scaled images for training a convolutional neural network (CNN) helps the proposed approach to be more resistant to scale attack. For each image in different pyramid presentation, corners are extracted. We split images in each pyramid level to m×n patches, where the centre of the block is a corner. Here, m and n are width and height of patch, respectively, which are adjusted according to the CNN input size. A modified version of the Harris corner detector is used for corner detection. Harris corner detector for a given image I is defined as (1). blurring. Blurring causes the edges to be smoothed. Since corners can be defined by points of the image that have multi-directional edges, blurring reduces edge intensities and consequently reduces corner intensity or removes the corners. To cope with missed corners caused by blurring, we use sharpening techniques.

Figure 2. Mage pyramid presentation for the CMFD
There are several approaches for image sharpening (e.g., Laplacian [18]). However, we should consider that an input image may be not blurred. Therefore, using simple approaches may increase and bold the unwanted edges. These edges may increase the number of corners. Although the unwanted corners may reduce the overall performance, the next steps of the algorithm will reject them as much possible. To overcome these problems, iterative sharpening (IS) approach is used, which is defined as in (2) for input image I.
where D is an edge smoothing filter such as averaging or Gaussian filter and H is the result of the difference between input image I and blurred image (I * D). In (2), '*' denotes a convolution operator. Now, H is a high-frequency image that will be gamma corrected and then added to the blurred image as shown in (3) to (6).
where In(u, v) is the resulted image after n iteration in the frequency domain, and H*(u, v), D(u, v) n , and I(u, v) are Fourier transformation of H*, D, and I, respectively. The difference between this algorithm and the Laplacian sharpening are depicted in Figure 3 as evidence of the effect of image sharpening for both normal image and sharpened image. After three and four iterations, the image produced using LoG sharpening has several noises while the IS sharpening this effect cannot be seen. White Gaussian noise was applied to the input image. The results are presented in the 4th iteration of LoG. Being robust against sharpening is important for the CMFD because the input image may be sharpened manually as an attack or may be naturally sharp. In this case, simple sharpening methods may add several noises, as shown in the 4 th iteration of the LoG.

Matching
For matching two blocks, we use a non-conventional architecture for convolutional neural networks (CNN). Conventional image matching methods use features such as histogram of the oriented gradient, Zernike or hu moments, and local binary patterns. Instead of using the mentioned features, we leave them to do by the CNN. Figure 4 presents the architecture of CNN used for matching two patches of the image. To achieve a pre-trained feature extractor network, a dense layer (as a fully connected network or support network) is removed. To train this network, we crop different patches from images in the ImageNet dataset (fall 2011 release). We randomly select 100 images from each category. Each image was segmented to m×n non-overlapped blocks, where m and n are the width and height of input of the network. In the training phase, we divide these blocks into two classes: similar and non-similar. Similar blocks are also augmented using conventional attacks and image manipulation including adding noise (such as salt and pepper and additive white Gaussian), rotation, scaling, brightness, and contrast. In addition, we augment image patches to avoid network sensitivity to shift translations. We choose the VGG16 network as a baseline for selecting the best pre-trained network. We also test VGG19, ResNet, and AlexNet. Among these networks, AlextNet was the best for finding image patch pairs. For learning the network, stochastic gradient descent with momentum was used. Drop-out strategy was used to avoid over fitting and make network connections as simple as possible. The learning rate was considered 0.001 and the number of mini-batch was selected to be 128 experimentally.

Decision making
Finding two patches that are similar together is not evident for the CMFD because sometimes images have their repetitive patches. To avoid this problem, we make a decision by matching several patches. To match several patches together, we applied k-means clustering. The main idea is that instead of matching separated patches of images, a cluster of patches should be matched. Each patch in a cluster should be close to other patches in the point view of pixel distance. Figure 5 illustrates the location of corners and corresponding matches.
This figure depicts the location of patches. We use k-mean clustering to classify them. One problem of the k-means clustering is estimating the number of clusters. To achieve an optimal number of clusters, the Davies-Bouldin criterion (DBC) was used. We test different values of clusters and selected an optimal (minimum) number of clustering. Figure 6 illustrates various numbers of clusters and the corresponding DBC. As shown in Figure 7, the minimum number of DBC is 8. Therefore, we consider 8 clusters to solve the problem. Figure 8 demonstrates the result of k-mean clustering where k=8. Next, we show the relation of clusters with each other as a weighted graph. Figure 8 illustrates a weighted graph inspired from the clustering result shown in Figure 7. Vertices and their names are the clusters and legends, respectively. Also, the edges are the number of matched images patches between each cluster. To simplify the graph, the nodes (so corresponding clusters) and the edges with a few numbers of patches and little weights, respectively, are truncated. The results show that two parts of the image are cloned.

RESULTS AND DISCUSSION
In this section, experimental results are presented. First, we define how we evaluate basic measure criteria including true positive rate and false-positive rate. Then, environmental platforms and datasets are introduced and, finally, the implementation results are presented.

Criteria
True positive rate (TPR) and false-positive rate (FPR) were defined according to Jaccard index (7).
where J(A, B) is the Jaccard index between measurable A and B. Also, ⋂ and ⋃ are intersection and union operators, respectively. This index is also well known as the intersection of the union. Since we cannot deliver the output of our approach as ground truth, we measure them using a bounding box. To be more precise, the intersection of the delivered bounding box and the bounding of ground truth are used. When this index exceeds 0.5, it is assumed as true positive; otherwise, as a false positive.

Environmental platform and database
Experimental results platform in this research was a laptop with a Core i7 processor, 12 GB Memory, and GeForce graphic card, Ti980GTX series, with Windows 10 operating system. The proposed algorithm was implemented using MATLAB 2018b. Evaluation of the proposed approach are done on the MICC-F8multi, MICC-F600, and MICC-F2000 public databases which are include 2000, 600, and 8 images, respectively.

Results of implementation
At image level, the important measures are the number of correctly detected forged images, TP, the number of images that have been erroneously detected as forged, FP, and the falsely missed forged images FN. Using these parameters, we computed the measures Precision (p) and Recall (r) [4], which are defined as (8) and (9), respectively.
where p denotes the probability that a detected forgery is truly a forgery and r shows the probability that a forged image is detected. In Table 1, we also give the F1-score as a measure that combines recall and precision in a single value.

CONCLUSION
Regarding to importance of copy-move forgery, a common type of image tampering, we proposed an algorithm for copy-move forgery detection (CMFD) based on feature extraction. To find same patches or similar regions of an image, Harris corner detection is used. Convolution neural network (CNN) is also used for the matching process. To achieve the best result, we use a pre-trained network. We also use k-mean clustering to reduce the false-positive rate. The experimental results on considered datasets depicted that our algorithm outperforms others in terms of detection rate. In addition, experiments show the proposed can detect multiple forgeries. Ease of using CNN as a feature extractor makes it a good candidate solution for the CMFD. As a future work, the CNN architecture should be analyzed more, especially regarding the Siamese networks.