Compressor based approximate multiplier architectures for media processing applications

Received Jul 14, 2020 Revised Dec 19, 2020 Accepted Jan 12, 2021 Approximate computing is an attractive technique to gain substantial improvement in the area, speed, and power in applications where exact computation is not required. This paper proposes two improved multiplier designs based on a new 4:2 approximate compressor circuit to simplify the hardware at the partial product reduction stage. The proposed multiplier designs are targeted towards error-tolerant applications. Exhaustive error and hardware analysis has been carried out on the existing and proposed multiplier designs. The results prove that the proposed approximate multiplier architecture performs better than the existing architectures without significant compromise on quality metrics. Experimental results show that die-area and power consumed are reduced up to 28%, and 25.29% respectively in comparison with the existing designs without significant compromise on accuracy.


INTRODUCTION
Applications involving image and video processing have inherent error resilience capability and hence can tolerate computation errors [1][2][3]. Since the final output is interpreted by human sensory systems, particularly in image and video processing, inaccuracy in the final result can be tolerated to a certain limit [4]. Most of these applications require a power-hungry multiplier [5,6] to carry out computations, approximating this multiplier results in an improvement in the area, delay, and power without significant compromise on accuracy [7].
Multiplication is a ubiquitous arithmetic operation, and hence improving the energy consumption will lead to substantial improvements at the system level [8][9][10][11]. The criteria used to measure their performance in digital systems are die-area, speed of operation, and power consumed. Thus, various approximate multiplier design techniques have been proposed at the logic level, which tries to leverage error resilience and achieve an improvement in the area, latency, and power [12][13][14].
Various types of approximate multiplier architectures have been reported in the literature to reduce the computational complexity. Work in [15] proposed an under-designed multiplier by altering the Karnaugh map of the 2*2 sub multiplier at the partial product generation stage. Mahdiani et al. [16] proposed a brokenarray multiplier (BAM) by pruning either vertical or horizontal partial product columns. An approximate recursive Wallace tree multiplier with simple carry-in prediction was proposed in [17].
Momeni et al. [18] reported a multiplier with approximation at the compressor level to achieve lower area and power; however, it suffered from low accuracy. Yang et al. [19] reduced the error in the  [20] modi_ed the compressor designed by Yang and incorporated a simple error correction circuit to improve accuracy further. Suganti et al. [21] simpli_ed the partial product reduction stage using an approximate half adder, full adder, and compressor circuit. Xilin et al. [22] simpli_ed the PPR by deploying approximate compressor circuits. However, the designs [20][21][22] tends to occupy more area and hence consumes more power. This paper proposes two approximate multiplier designs using an efficient compressor circuit with the intent to minimize computation complexity without significant compromise on accuracy. The objectives of this work are summarized as:  An improved approximate compressor is proposed in this work that reduces the die-area and power in the multiplier designs.  Two multiplier architectures based on the proposed approximate compressor are implemented.
The remainder of the paper is arranged as: A description of the proposed approximate multiplier designs is presented in section 2. Section 3 provides the exhaustive error analysis and hardware synthesis of the proposed and existing designs. The impact of proposed and existing multiplier architectures on image processing application is investigated in section 4, while conclusions are provided in section 5.

PROPOSED WORK
The implementation of any multiplier involves three steps [23]. Partial products are generated in the first stage while their reduction to two rows is accomplished in the second step. These two rows are compressed to the final product in the final step. Among these, the partial product reduction (PPR) step consumes more power compared to the other steps as it is a computational intensive stage. The compressor forms the essential module in the partial product reduction structure, and hence optimizing it results in a simpler reduction structure. A new approximate 4:2 compressor is presented in this work that, when deployed in multiplier architectures, result in improved area and performance. Figure 1(a) shows an exact 4:2 compressor [24] comprises of five inputs and three outputs. Intuitively, exact 4:2 compressors are obtained by cascading two 3:2 full adders, as shown in Figure 1(b). However, the exact compressor tends to consume more area and power. Since Cin and C1 contribute little to the final result, the associated circuit can be removed. In this work, this idea is used to design a 4:2 approximate compressor. Since Cin and C1 are ignored, the remaining four inputs B1, B2, B3, and B4, result in two outputs Sum and C0. The efficacy of the proposed design can be found by evaluating the compressor output for all possible input combinations. Accordingly, Table 1 presents the statistics about the correctness of the Sum and C0 for different inputs. It is evident from Table 1 that the proposed design produces correct output in most of the cases, and the same is indicated in the last column in Table 1 with a difference as zero. However, the proposed compressor generates an incorrect result for the "1111" input combination. The same is highlighted in grey color in Table 1, corresponding to the "1111" input combination. Therefore, the error rate of the proposed compressor out of 256 possibilities is 1/256, i.e., 0.0039.  Sum  C0  0  0  0  0  0  81/256  0  0  0  0  0  0  0  1  1  27/256  1  0  1  0  0  0  1  0  1  27/256  1  0  1

Proposed 4:2 approximate compressor
Error probability P(e) of the proposed approximate 4:2 compressor is calculated mathematically as follows. Let p1, p2, p3, and p4 be the probabilities of getting '1' to the input bits of the 4:2 compressor B1, B2, B3, and B4. Then the probability of occurrence error in the proposed approximate 4:2 compressor is written as (1).
The partial products generated using AND gates form the input to B1, B2, B3, and B4. The probability that an individual bit of multiplier or multiplicand is `1` is 0.5. Therefore, the probability of generating `1` as an input bit to the compressor is 0.25. Then the probability of getting error for proposed approximate 4:2 compressor is P(e)=0.25*0.25*0.25*0.25, i.e., 0.0039. The logic circuit of the proposed 4:2 compressor derived using truth table is given in Figure 2 and is described by the Boolean (2) and (3).

Proposed approximate multiplier architecture (D1)
The architecture of 8*8 proposed approximate multiplier design, namely D1, is depicted in Figure 3. The generated partial products are shown using solid dots. The PPR tree structure requires two levels (Level 1 and 2). It is divided into two regions: one four-bit truncation region, and the other 11-bit approximate region. The approximate region consists of proposed 4:2 approximate compressors, full and half adders. The partial product tree is reduced to two rows using the compressor logic, and these rows are accumulated to the final product using a ripple carry adder (RCA). Another variant of the multiplier, namely D2, is achieved by reducing the number of approximate columns in the reduction tree structure. In the first multiplier design discussed above, approximation and truncation are applied to all the partial product columns in the reduction tree, while the second multiplier design (D2) is obtained by applying the truncation in four least significant columns and approximation in N-4 columns for an N-bit multiplier. The rest of the partial products in the most significant columns are compressed by exact logic.

Error analysis
Detailed error analysis on various multiplier architectures, including the proposed designs, was carried out using MATLAB for all the input combinations (65535 cases), and computed results are tabulated in Table 2. Quality metrics such as mean relative error distance (MRED) and normalized mean error distance (NMED) are used to compute the accuracy of existing and proposed approximate designs. The error distance (ED) is computed as a difference of exact and approximate result and is represented in (4).

ED = |R'-R| (4)
where R' is the approximate result, and R is the accurate result [25]. The average of all EDs is the mean error distance (MED). At the same time, MRED is the mean of all relative error distances (REDs) where RED is calculated by using (5).

RED= ED/R
From Table 2, it is evident from the proposed design D2 is more accurate since it has better NMED and MRED than existing designs. D2 achieves better accuracy since approximation and truncation are made to least significant 8 columns in the partial product reduction tree. From Table 2, it is evident that D2 achieves an improvement upto 93.5% in MRED compared to the existing designs. Similarly, NMED of D2 is upto 98.9% better than the existing designs.

Synthesis results
For the sake of fair analysis, all the 8-bit multiplier [18][19][20][21][22] schemes, including the proposed ones, have been modeled structurally using verilog hardware description language. Simulation has been performed using the Cadence incisive unified simulator v6.1. Hardware synthesis of approximate multiplier designs has been carried out using Cadence RTL compiler v7.1 using TSMC 180 nm process node (slow-normal library) to compute area, delay, and power.
It can be observed from Table 3 that the proposed designs D1 and D2 are faster than the existing designs except for Minho. D1 consumes less area in comparison to all the existing designs due to approximation being introduced in all the columns of the partial product reduction stage. Further, D1 achieves low power consumption compared to existing designs except for multiplier1. The percentage improvement in area and power in D1 is up to 28% and 25.29% respectively, compared to existing designs. Similarly, D2 takes less area than existing designs except for Minho and Momeni. Also, D2 has low power consumption in comparison with existing designs except for multiplier1 [21], Xilin [22], and Momeni [18]. Though a few existing designs [18,21,22] perform marginally better in area and power compared to D2, they suffer from lower peak signal to noise ratio, as shown in Table 4. The percentage improvement in area and power of D2 is upto 17.21% and 14.15%, respectively, compared to existing designs.  Figure 4 presents the power-delay product (PDP) of various multipliers, including D1 and D2. It can be observed that D1 and D2 have less PDP compared to all existing multipliers except for multiplier 1. The percentage improvement of D1 in PDP is upto 29.69% compared to existing designs. Similarly, D2 has upto 19.2% improvement in PDP in comparison with the existing designs.

PERFORMANCE OF APPROXIMATE MULTIPLIERS IN IMAGE PROCESSING APPLICATIONS
As discussed in the preceding subsection, the proposed designs offer significant benefits in terms of accuracy, area, and power. In this section, proposed designs D1 and D2 are evaluated on real-time image sharpening [26] and image multiplication [27] applications.

Image multiplication
In this section, proposed designs D1 and D2 are evaluated using an image multiplication. Image multiplication is the best candidate to validate the multipliers as it uses direct multiplication. Image multiplication accepts two images and generates an output image. For example, let P1 and P2 be the input images, then output image X is expressed as (6).
To quantify our results and measure the approximate multiplier's performance, structural similarity index (SSIM) and peak signal to noise ratio (PSNR) is used. Table 4 presents the PSNR and SSIM values of different approximate multipliers. From Table 4, it is evident that D2 achieves better PSNR and SSIM compared to existing multipliers due to its low NMED and MRED. D1 achieves better PSNR compared to Multiplier1. Existing designs [18][19][20][21][22] have better PSNR and SSIM than D1, however, they suffer from more area, power, and delay. Figure 5 shows the multiplication of images processed using exact and proposed multipliers. It is evident that the obtained images using an exact multiplier, D1 and D2 look almost identical. Figure 5. Images obtained by using exact and proposed multipliers, (a) Image1 [27], (b) Image2 [27], (c) Exact output, (d) D1 output, (e) D2 output

Image sharpening
Image sharpening improves the visual quality of an image. The sharpening algorithm accepts an image (P), processes it using 5*5 kernels to produce a high-quality image [26]. Let P be the input image, and then the output image Z can be expressed as (7).

2959
Z(x,y) = P(x,y) -R Where: and H is a matrix defined as; Figure 6 shows the comparison results for 8*8 approximate multipliers obtained by considering the MRED, NMED, SSIM, and PSNR w.r.t power-delay-product (PDP). From Figure 7(a) and 7(b) (see in appendix), it is evident that D2 has lower PDP with better MRED and NMED compared to existing designs. Similarly, From Figure 7(c) and 7(d) (see in appendix), D2 has lower PDP and better SSIM and PSNR compared to existing designs. Similarly, D1 has a lower PDP than the existing multipliers except for multiplier1. Finally, it can be concluded that D2 obtains better PSNR, SSIM, MRED, and NMED that, too, with less PDP compared to existing designs except for multiplier1. D1 has a lower PDP compared to the existing designs except for multiplier1 with good SSIM and PSNR.

CONCLUSION
Approximate computing aims to leverage the error-tolerant quality in image and video processing applications. In this paper, two multiplier design techniques are presented that reduce computational complexity. The approximate multiplier designs comprise of approximate compressor that simplifies the hardware at the PPR stage, thereby improving the delay and power consumption compared to existing works. Comprehensive error analysis and synthesis results prove the efficacy of the proposed designs in comparison with existing designs. Experimental results show that die-area and power consumed are reduced up to 28% and 25.29% respectively in comparison with the existing designs. Towards the end, the impact of proposed methods on an image processing application is investigated, and the results prove that the proposed multiplier designs achieve a better computation quality effort trade-off when compared to existing designs.