Development of stereo matching algorithm based on sum of absolute RGB color differences and gradient Matching

This paper proposes a new stereo matching algorithm which uses local-based method. The Sum of Absolute Differences (SAD) algorithm produces accurate result on the disparity map for the textured regions. However, this algorithm is sensitive to low texture areas and high noise on images with high different brightness and contrast. To get over these problems, the proposed algorithm utilizes SAD algorithm with RGB color channels differences and combination of gradient matching to improve the accuracy on the images with high brightness and contrast. Additionally, an edge-preserving filter is used at the second stage which is known as Bilateral Filter (BF). The BF filter is capable to work with the low texture areas and to reduce the noise and sharpen the images. Additionally, BF is strong  against the  distortions due to high brightness and contrast. The proposed work in this paper produces accurate results and performs much better compared with some established algorithms. This comparison is based on the standard quantitative measurements using the stereo benchmarking evaluation from the Middlebury.


INTRODUCTION
Computer vision is interdisciplinary field that comprises methods for acquiring, processing and analyzing and image understanding from digital images or videos. It is artificial intelligence to mimic the human visual system. Stereo vision is a part of them and the process to get the information such as object detection, recognition and depth estimation is called as stereo matching. This process starts with corresponding from one point on reference image to another point on the target image. These images can be two or more. In this article, the images are using from the stereo camera input which is also known as stereo images. The matching algorithm from the matching process produces disparity map. This map consists of depth information which is valuable for many applications such as virtual reality [1], 3D surface reconstruction [2], face recognition [3] and robotics automation [4][5]. The stereo baseline can be setup in a wide or short baseline [6] distance which depends on the applications. To determine the range or distance estimation, the triangulation function is applied to each of the pixel on the disparity map. Therefore, to get an accurate result, the matching process requires complex and challenging solution for depth or distance estimation. It requires precise function on the propose framework. Fundamentally, matching algorithm consists of multiple Ì ISSN: 2088-8708 stages which was proposed by Szeliski and Scharstein [7]. First stage, matching cost computes the preliminary matching point of stereo image. Second stage, the filtering is utilized to reduce the preliminary noise of the first stage. Then, disparity selection and optimization stage normalizes the disparity value each pixel on the image. Last stage is to refine the final result and also known as disparity map post-processing step.
In stereo matching development, there are two major approaches available in developing the algorithm framework. It is local methods as published in [8][9][10] and global method [11]. Mostly local methods use local properties or local contents using windows-based technique such as fixed windows implemented in [12][13], adaptive window [14], convolution neural network [15] and multiple windows [16]. In common, Winner-Takes-All (WTA) strategy is applied for local based optimization. It is low computational complexity and fast execution time [17][18][19]. Local method such implemented in [20] that used plane fitting technique to increase the accuracy at the final stage. This method also known as RANSAC that efficiently works on the low textured areas. However, the error still occurred on the object edges. Their method requires several iterations for plane fitting process. If wrong iterations, then it will affect the results. Commonly, local methods show fast running time, but low accuracy on the edges due to improper selection of windows sizes. Hence, to get an accurate result for the local approach is a challenge to the researchers.
Another approach in stereo matching algorithm to produce the disparity map is global optimization method. Fundamentally, this method uses energy-based function which is known as Markov Random Field (MRF). The method in global optimization such as Belief Propagation (BP) [21] and Graph Cut (GC) [22] produce accurate results. Each pixel of interest calculation requires all pixel's energy in disparity map. It calculates neighboring or nearby pixels using maximum flow and the selection is made based on the minimum cut-off energy on the disparity map. The algorithms implemented using global optimization approach normally involve high computational requirement due to all pixel's energy calculation and absorption. Global methods involve iterations which increase the execution time each disparity map reconstruction. This article aims to produce accurate results and competitive with some established methods. The first function or stage will be implemented using improved Sum of Absolute Differences (SAD) [23] with gradient matching. Then, the second stage utilizes the edge preserving filter which is known as Bilateral Filter (BF) [24]. This filter is capable to remove noise and preserved object edges. The third stage is optimization based on WTA strategy. Last stage, the BF is applied once again to remove unwanted or remaining invalid pixels. The BF is also capable to increase the accuracy at object boundaries.

RESEARCH METHOD
The diagram of the proposed work is dispalyed by Figure 1. The stereo matching algorithm starts with STEP 1 to get the preliminary disparity map. The improved SAD has been proposed which the weighted technique is used on the block matching process. The combination of improved SAD with gradient matching in this article should be able to increase the effectiveness of corresponding process and accuracy. Then at STEP 2, the BP is utilized to reduce the noise and preserved the object edges. The BP is capable to efficiently remove noise on the low texture regions and sharping the object boundaries. The optimization uses WTA strategy which this method normalizes the floating points numbers and selects minimum disparity values on the disparity map. Final stage at STEP 4 is also using the BP but with the disparity values. This filter is a type of nonlinear filter and capable to improve final disparity map.

Matching cost computation
The first stage of the proposed framework is using the weighted SAD. The preliminary disparity map is produced at this stage. Hence, robust function must be used to increase the effectiveness on the disparity map. The problem on matching process at this stage on the low texture regions must be at minimum. The weight is proposed at SAD to improve the values on the low texture regions. Thus, the consistency of the weight at the low texture region is capable to make the matching process accurate and reduces the mismatch or invalid pixels. The RGB values are used with the weight of sum of intensity differences on right image I r and left image I l which is given by (1): where (x, y) are the coordinates pixel of interest with d represents the disparity value, W is the proposed weight, RGB channels numbers are i and w represents kernel of SAD algorithm. The second part is gradient matching components. It contains the magnitude differences from each image. There will be two directions that need to be calculated on this gradient differences. Vertical direction G y and horizontal direction of G x are the directions with the equations are given by (3) and (2): where Im is input image and * represents convolution operation on the gradient matching. The G x and G y are the gradient magnitude for m which is given by (4): (5) is the gradient matching kernel G(x, y, d).
The matching cost function at this stage is given by (6) where the input volume of SAD(x, y,d) and G(x, y,d) are combined together.

Cost aggregation
This second stage more likely to filter the preliminary disparity map from stage one. Normally the preliminary disparity map contains high noise and it must be removed. Some of invalid and uncertainties pixels are constructed during the matching process. Hence, at this stage the filter must be robust and is capable to remove high noise of invalid pixels and preserved the object boundaries. The BP is used due to strong preserving object edges and at the same time efficient to remove high noise especially on the plain color and low texture regions. (7) is the BF function used in this article.
where p is the location pixel of interest at (x,y), w B and q are window size of BF and neighboring pixels respectively. The σ s denotes a factor of spatial adjustment and σ c equals to similarity factor for the color detection. The p − q is spatial Euclidean interval and |I p − I q | denotes the Euclidean distance in color space. Hence, (8) is the cost aggregation function of BF with the matching cost computation input.

Disparity optimization
This stage optimizes the disparity values on disparity map. The normalization is based on the minimum disparity values with the floating-point number which the WTA is selected in this article. The WTA is normally being used in the local based methods due to fast implementation. The WTA function is given by (9). d x,y = argmin d∈D C(p, d) where D represents a set of valid disparity values for an image and C(p, d) denotes the second stage of aggregation step. Fundamentally, after this stage the disparity map still contains noise or invalid pixels. Thus, this map needs to be improved and the last stage is will remove remaining noise.

Disparity refinement
The last stage of the algorithm framework is known as refinement or post processing stage. It has several continuous processes which starts with handling the occlusion regions, filling the invalid pixels and filtering final disparity map. The left-right consistency checking process is conducted to identify occlusion areas and some invalid pixels. Then, these invalid pixels are restored with valid pixel values through the filling process. Some of artifacts and unwanted pixels will be removed using the BF and at the same time preserved the object boundaries. The BF smoothes the final disparity map as indicates by (7).

RESULT AND ANALYSIS
This section explains about the disparity map results that will be represented by color-scale intensity. The different color tones show that the respected objects are mapped based on the disparity values and the distance sensor (i.e., stereo camera). Most probably the lighter intensity volume indicates that the object is closer to the sensor. The experimental analysis has been executed on a personal computer with Windows 10, 3.2GHz and 8G RAM. The input images are from the Middlebury stereo evaluation dataset [24] which contains 15 standard images and must be submitted online. These images are very complex, and each image consists of different characteristics and properties such as light settings objects depth, incoherence regions, different resolutions and low texture areas. The values of {w, σ s , σ c , w B } are {9x9, 17, 0.4, 11x11}. Figure 2 shows a sample Jadeplant image (i.e., left and right) from the Middlebury training dataset with different brightness and high contrast. Generally, due to the brightness difference, these input images are very challenging to be matched. It contains different pixel values at the same corresponding point. However, the proposed algorithm is correctly discovered the disparity locations. The level of disparity contour are precisely assigned and object distance are well-recognized. Figure 3 shows the final disparity map results of 15 training images from the Middlebury dataset. The accuracy attributes for error evaluation are nonocc (non-occluded) and all error. The nonocc error is the error evaluation based on the non-occluded regions on disparity map while all error represents the all pixels' evaluation on an image of disparity map. Within these 15 images, Pipes and Jadeplant images are the most difficult images to be matched. These images comprise several piping lines and leaves with different sizes respectively. Yet, the propose algorithm can reconstruct almost accurate disparity map with clear discontinuities regions. Fundamentally, real images from the Middlebury are difficult and very challenging to get an accurate corresponding point. It was developed to test the robustness of an algorithm where same corresponding point maybe contains different pixel values. Additionally, each image contains difference characteristics such as plain color objects, shadow, discontinuity regions and occluded areas.
With referring to Figure 3, the disparity maps of low texture surfaces such as Motorcycle, Motorcy-cleP, Playtable and PlaytableP are well recreated with different depth and disparity contour. Other regions difficult to be matched are plain colour objects and shadow such as images of ArtL, Recycle, Piano and PianoL. These regions consist of similar pixel values and possibility to get wrong matching are very high. The disparity maps from the proposed work display almost accurate matching for these images. It shows that the proposed work is able to get correct matching pixels on these regions and robust against the plain colour areas. The quantitative measurement from the Middlebury online results are given in Tables 1 and 2. These results are produced by the Middlebury online benchmarking evaluation system with two error attributes as explained above. Some established methods are also included in these Tables to show the competitiveness of the proposed work. Overall, an average error measurement is assessed to rank the best results. For Table 1, the proposed method is ranked at top of the table with 6.11%, and Table 2  work is rank at top compared to [15][16][17]19,25,26] for nonocc error. The weight average error is 6.11% where Jadepl, Playrm and Vintge images are the lowest error produced. For the all error attribute in Table 2, the proposed work is produced at 9.15% which is the lowest average error. It shows that the proposed work in this article is competitive with some established methods.   Table 1 and To verify the potentiality of the proposed algorithm, the images from the KITTI [27] are also tested. These images are more difficult and challenging to be matched. It contains complex edges and structures such as shadow, plain color surfaces, high different contrast and brightness areas with large untextured regions. The experimental results are shown in Figure 4. The disparity map results show accurate disparity values estimation in grayscale. As for reference, the signage, a cyclist, trees and cars, are well-reconstructed with correct disparity level. It shows the proposed work in this article capable to work with difficult stereo images from real environment.

CONCLUSION
In this work, the combination of SAD algorithm based RGB color and gradient matching are producing accurate results. The second stage where edge preserving filter is utilized. The BF at the aggregation stage is capable to filter high noise and conserve the object boundaries of the preliminary disparity map. The WTA strategy was implemented at the optimization stage to normalize the floating points numbers to the disparity values. The second edge preserving filter was used at the last stage of the proposed work using the same BF. This nonlinear filter removed remaining noise and increase the efficiency of final disparity map. Overall, these edge preserving filters used in the proposed framework were able to remove noise especially on the low texture regions and able to preserve the object edges as shown by Figure 2. The quantitative measurement from the standard benchmarking Middlebury system also demonstrated low average errors were produced by the proposed framework at 6.11% and 9.15% of non-occluded and all pixel errors respectively. The training images are shown by Figure 3. From real images of the KITTI, the proposed work was also demonstrated accurate results.