Design of efficient reversible floating-point arithmetic unit on field programmable gate array platform and its performance analysis

The reversible logic gates are used to improve the power dissipation in modern computer applications. The floating-point numbers with reversible features are added advantage to performing complex algorithms with high-performance computations. This manuscript implements an efficient reversible floating-point arithmetic (RFPA) unit, and its performance metrics are realized in detail. The RFP adder/subtractor (A/S), RFP multiplier, and RFP divider units are designed as a part of the RFP arithmetic unit. The RFPA unit is designed by considering basic reversible gates. The mantissa part of the RFP multiplier is created using a 24x24 Wallace tree multiplier. In contrast, the reciprocal unit of the RFP divider is designed using Newton Raphson’s method. The RFPA unit and its submodules are executed in parallel by utilizing one clock cycle individually. The RFPA unit and its submodules are synthesized separately on the Vivado IDE environment and obtained the implementation results on Artix-7 field programmable gate array (FPGA). The RFPA unit utilizes only 18.44% slice look-up tables (LUTs) by consuming the 0.891 W total power on Artix-7 FPGA. The RFPA unit sub-models are compared with existing approaches with better performance metrics and chip resource Chip area and power consumption of the RFPA unit and its submodules are tabulated on the Artix-7 FPGA Platform. The Latency of the RFPA and its sub-modules executes in only one clock cycle. The RFPA unit utilizes 18.44% LUTs and a power of 0.891 W on Artix-7 FPGA. The performance metrics like GC, CI, GO, and QC is realized in detail for the RFPA unit and its submodules. The Proposed works are compared with existing similar works with better improvement in resource utilization (Chip area and power) and performance metrics.


INTRODUCTION
The low-power design is the prime factor when designing the high-performance, very large-scale integrated (VLSI) system. The speed and dynamic ranges features are to be considered while creating the low-power designs which fit the portable devices. The high-integration density and high-speed features are merged to emphasize the high throughput computation in portable devices with a significant reduction in heat dissipation factor. There is enormous development in electronic semiconductor industries to improve parameters like chip area, power, and time features by optimizing the electronic architecture framework-one effective way to avoid power loss in the VLSI circuits is by using the reversible gates. The reversible logic gates map the input vectors and recover quickly from the output vectors and vice-versa. The reversible logic gates are used in enormous quantum computing applications, communications, low-power applications, digital signal processing (DSP), and many more to improve power dissipation factors. The power factor reduction in reversible logic gates depends upon the usage of the reversible logic gates, quantum cost (QC), number of constant input (CI) usage, and garbage output (GO) generation [1], [2]. Nachtigal et al. [18] elaborate on the RFP multiplier architecture using an operand decomposition mechanism. The design uses the Wallace tree multiplier for mantissa calculation with the help of 4:2 compressors units. The RFP multiplier realizes the performance metrics by obtaining a delay of 912 and QC of 6,957. Jenath and Nagarajan [19] present the RFP Multiplier module on FPGA. The work uses a standard conventional 24x24 multiplier for mantissa calculation. The designed multiplier is compared with the existing approach with better improvement in QC and Delay parameters. Malathi et al. [20] present the single-precision RFP multiplier with cost-effective features. The design uses a TSG gate for cost optimization and is adopted for 24x24 multiplier calculation. Arunachalam et al. [21] presented the RFP multiplier with new gates. The works introduce the three reversible gates for sub-model designs used in the RFP multiplier. The work optimizes the QC more than the existing approach. Jain et al. [22] describe singleprecision and double precision-based RFP multipliers. The reversible 4:2 compressors are used for multiplier design for mantissa calculation. The work realizes the performance metrics for both the design modules with improvements to existing works.
Kamaraj and Marichamy [23] describe the RFP division architecture with fault-tolerant features. The design uses KMD gates which support many logic functions with optimization. The restoring and nonrestoring division units are designed as per IEEE 754 standards. The work analyzed the performance metrics and compared them with existing approaches with better improvements. Muñoz et al. [24] present the FP library for arithmetic units using FPGA. The arithmetic units like FP adder/ subtractor, FP multiplier, FP divider, FP square root units are designed and realized on FPGA. The FP arithmetic core analyzes the chip area, power, and tradeoff dependence parameters on FPGA. Jamal and Babu [25] present the RFP divider using high-speed division array modules. The work realizes the performance metrics like QC, GO, and CI for 2-bit, 4-bit, 8-bit, and 16-bit divider modules. Gayathri et al. [26] elaborate on the Single-precision RFP division unit with T-count optimization. The work uses restoring and non-restoring algorithms for division unit design concerning quantum Clifford plus T gate set. In contrast, the RFP division module using the Goldschmidt algorithm is designed. The results of restoring and non-restoring Goldschmidt division algorithms are discussed in detail. Przybyl [27] discuss the fixed-point arithmetic unit on FPGA for embedded applications with scalable features. The real-number calculation is accessible on real-time processing applications, providing faster and better processing efficiency. The reversible logic gates are used extensively in most of the processors for high speed computations [28]- [31] and also in image processing [32] and video processing applications [33] for high resolution outcomes.

REVERSIBLE GATES
The reversible gates are used to implement more than one operation. Reversible gates are modeled using the quantum gate library. Reversible gates and a few modules used in the RFP arithmetic unit design are illustrated in Figure 1 (in appendix). The Feynman gate (FG) is a 2x2 gate with a QC of 1, represented in Figure 1(a). The Fredkin gate (FR) is a 3x3 gate with a QC of 5, and it is illustrated in Figure 1(b). The Peres gate (PG) is a 3x3 gate with a QC of 4, and it is represented in Figure 1(c). The multiple controlled Fredkin (MCF) gate [1] is a 3x3 gate used for the construction of reversible AND and OR gates, and it is represented in Figures 1(d) and 1(e), respectively. The reversible AND and OR gates use the QC of 3 with one CI and two GOs. The modified Fredkin gate (MFG) is a 3x3 gate with a QC of 4, represented in Figure 1(f). The reversible multiplexor (MUX) gate is designed using MFG, illustrated in Figure 1(g). The MUX gate uses two GO with a QC of 4. The reversible half adder/subtractor (RHAS) is designed using two FGs and one PG. The RHAS utilizes one CI, two GO with a QC of 6, represented in Figure 1(h). The reversible full adder/subtractor (RFAS) is designed using two FGs and two PGs. The RFAS utilizes one CI, three GO with a QC of 10, represented in Figure 1(i).

REVERSIBLE FLOATING-POINT ARITHMETIC (RFPA) UNIT
The single-precision 32-bit Floating point number representation is illustrated in Figure 2 as per IEEE 754 format. It mainly contains a 1-bit sign (31st bit), 8-bit exponent (30:23), and 23-bit mantissa (22:0) for the formation of a single-precision 32-bit FP number. This representation is used further in the RFP arithmetic unit and its sub-modules.
The 32-bit RFP arithmetic unit architecture is illustrated in Figure 3. The RFPA unit consists of a 32-bit reversible multiplexor (MUX), RFP adder, RFP subtractor, RFP multiplier, and RFP divider units. The 2-bit mode is used as a select line to the MUX unit to select the corresponding units. RFP adder output is chosen if the mode is "00". Similarly, for "01", the RFP subtractor unit output, for "10", the RFP multiplier unit output, and for "11", the RFP divider output is selected. The detailed description of the submodules of the RFP arithmetic unit is explained in the below section.

32-bit RFP Adder/Subtractor
The representation of 32-bit RFP Adder/Subtractor architecture is illustrated in Figure 4. The 32-bit RFP Adder/Subtractor module uses reversible logic gates by framing the sub-modules like an adder, subtractor, multiplexers, barrel shifters (left and right), and comparators. The 32-bit RFP Adder/Subtractor architecture is constructed using the following steps: a. First e. If all the bits of EA and EB are zero, then the number is denormalized, and an implicit bit of the corresponding mantissa (MA or MB) is set to zero. f. Perform the exponent difference calculation and shift either mantissa of MA or MB according to the difference ( > ). g. If exponents are the same ( == ), then perform 24-bit addition using mantissa data and barrel shifted data. h. Finally, normalize and remove the implicit bit to form addition's final 23-bit mantissa output. i. Find the maximum exponent value using EA and EB to generate the 8-bit final exponent bit. j. To concatenate the final sign bit with 8-bit final exponent and final 23-bit mantissa data to form the 32-bit RFP adder output. k. For subtraction operation, consider the right-shifted output for 2's complement operation and then perform addition with mantissa data. l. The added output is used for the normalization and later performs the shifting operation (left) using the exponent (EA) value. m. The final 8-bit exponent and 23-bit mantissa value is generated after normalization and shifting operation for subtraction operation. n. Finally, concatenate the final sign bit with 8-bit final exponent and final 23-bit mantissa data to form the 32-bit RFP subtraction output.
In the RFP adder/subtractor module, the 31-bit comparator unit is designed using NOT, FG, and PGs to generate A less than B (A<B) operation. The swapping unit is created using two reversible 32-bit MUXs with the select line of A<B. The denormalized and implicit bit addition is performed using two reversible 8-bit OR reduction units and two 24-bit MUXs. The exponent difference calculation is performed using 8-bit reversible adder and subtractor units. The right shifting operation is performed using a 24-bit reversible Barrel shifter. The maximum exponent value is calculated using an 8-bit reversible adder and an

32-bit RFP multiplier
The 32-bit RFP multiplier architecture is illustrated in Figure 5 as per IEEE 754 standards. The 32-bit RFP multiplier module contains a reversible 24x24 Wallace tree multiplier, 8-bit adder/subtractor units, 24-bit adders, and a normalization unit. The 32-bit RFP multiplier architecture is constructed using the following steps: a. First, check whether all the bits of exponents EA and EB are one; if yes, the number is either infinity or not a number (NAN). i.e., exception. b. If all the bits of EA and EB are zero, then the number is denormalized, and the implicit bit of the corresponding mantissa (MA or MB) is set to zero. c. Perform 24-bit reversible Wallace tree multiplication for MA and MB bits to generate 48-bit mantissa product. d. The MSB of the mantissa product (47 th bit) will be active as the select line to find out the round of bit. − If the MSB of the mantissa product is 1, then the mantissa product is already normalized and considers the next 23-bits after the MSB bit. − If the MSB of the mantissa product is 0, so the next bit is always 1, and consider after next-to-next bit of 23-bits. So, no need to perform any of the shifting operations. − Add the round-off bit to the normalized result to generate the final 24  The 24×24 reversible Wallace tree multiplier uses nine 8-bit reversible Wallace tree multipliers and eight 48-bit reversible adder units. The single 8-bit reversible Wallace tree multiplier uses reversible AND gates, reversible half, and full adders. The round-off bit was generated using two 24-bit reversible OR reduction operations and one multiplexer unit. The normalization unit is constructed using a 24-bit reversible MUX unit and one 24-bit reversible adder to generate a 24-bit mantissa. The EA+EB-127 operation uses an 8-bit reversible adder and 9-bit reversible adder units to generate the final 8-bit exponent bits.

32-bit RFP divider
The 32-bit RFP divider is illustrated in Figure 6. The reciprocal unit of the RFP divider is designed using Newton-Raphson's method [28]. It mainly contains the exponent difference calculation using 8-bit reversible subtractor and adder units, followed by reciprocal units using RFP adders and RFP multipliers. The 32-bit RFP divider is constructed using the following steps and is as follows: a. First, check whether all the bits of EA and EB are one; if yes, then the number will be either infinity or not a number (NAN). i.e., exception b. Perform XOR operation of MSB bits to generate the final sign bit. c. Calculate the exponent difference (EA-EB). − Compute the accurate iterations I1, I2, and I3 for the reciprocal unit (for single-precision iteration value is fixed to 3). The successive i th iteration (Ii+1) is calculated using (2) as: where i = 0, 1 and 2. − Lastly, calculate the quotient (Q) value by multiplying the dividend (Dn) using a reciprocal unit of the divisor value (Dd), and it is represented using (3) as (3): The performance parameters of the RFP arithmetic unit's sub-modules like GC, CI, GO, and QC is summarized in Table 1. The normalization and shift unit and Wallace tree multiplier unit consume more resources than the corresponding RFP subtractor and RFP multiplier. The 24-bit barrel shifter and 2's complementor are part of the normalization unit in RFPA/S unit.  Figure 6. Representation of 32-bit RFP divider

RESULTS AND DISCUSSION
This section implements and analyzes the results and discussion of the RFP arithmetic unit and its sub-modules. The simulation results, synthesis results, performance metrics realization, and comparative analysis are discussed. The RFPA unit is synthesized and implemented on Artix-7 FPGA (XC7A100T-3CSG 324) using the Vivado IDE environment. The performance metrics like CI, GO, and QC parameters are realized for each sub-module of the RFPA unit. The Synthesis results of the RFPA unit include slice LUTs, total power consumption, and latency (Clock cycles), which are tabulated and discussed. The performance metrics and synthesis results are compared with similar existing approaches, with better improvements.
The simulation results of the RFPA unit are illustrated in Figure 7. The two 32-bit inputs (a and b), 2-bit selection modes (op) are defined, and 32-bit output (RFPA unit) is obtained as per design. The functional simulation results are verified with theoretical results. If mode (op) selection is zero, it performs the RFP addition, 1 for RFP subtraction, 2 for RFP multiplication, and 3 for RFP division.  The RFP subtractor unit has 2's completion operation, normalized and shift process, consuming more GC and QC. The RFP multiplier uses a 24×24 Wallace multiplier for mantissa calculation, and it consumes additional GC and QC. The reciprocal unit in the RFP divider module is designed using four RFP adders and four RFP multipliers that utilize more GC and QC. The resources utilization of the RFPA unit and its sub-modules on Artix-7 FPGA is tabulated in Table 3. The RFP adder/subtractor module utilizes 562 LUTs and consumes 0.14 W of total power. In contrast, the RFP multiplier module uses 1,085 LUTs and consumes 0.157 W of total power. The RFP divider module utilizes 9,803 LUTs and consumes 0.841 W of total power. The final RFPA unit utilizes 11,693 LUTs and consumes 0.891 W of total power after implementation on Artix-7 FPGA. The RFPA unit and its submodules modules are designed using reversible gates and are processed in parallel operations. The RFPA unit and its submodules are simulated individually and executes in a single clock cycle. Overall, the RFPA unit consumes 18.44% LUTs on Artix-7 FPGA.
The performance metrics comparison of RFP sub-modules with existing approaches is tabulated in Table 4. The performance parameters like GC, CI, GO, and QC compares existing approaches [11]- [16], [18]- [22]. The proposed RFP adder improves the CI by 92.7%, GO by 66.74%, and QC by 68.78% than the existing adder [11]. Similarly, the proposed RFP adder improves the CI by 74.5%, with a QC of 52.24% more than the existing adder [12]. The proposed RFP adder improves the CI by 65.03% and QC by 29.79% more than the existing adder [13]. The proposed RFP adder improves the GC by 3.04%, CI by 64.84%, and QC by 34.73% than the existing adder [14]. The proposed RFP adder improves the GC by 33.35%, CI by 75.19%, and QC by 60.02% more than the existing adder [15]. In contrast, the proposed RFP subtractor improves the QC by 11.78% more than the existing subtractor [16]. The proposed RFP multiplier improves the QC of 35.47%, 31.48%, 33.92%, 53.10%, and 24.05% than the exiting multipliers [18]- [22] respectively. The RFP adders [11]- [15] are used a conditional right shifter for the normalization process of RFP adders. In contrast, the proposed RFP adder does not use any conditional shifter during the normalization process, which reduces the 20% of the GC and QC in the design process. The RFP subtractor [16] uses TR Gates to construct a full adder/subtractor, which increases the QC rather than the proposed subtractor. The Wallace tree multiplier is used in most RFP multipliers [18]- [22] for partial product generation (PPG) and is constructed using compressor-tree units. The PPG using compressor-tree increases the hardware complexity and QC.
The resource comparison of RFP units with existing approaches to different FPGAs is tabulated in Table 5. The proposed RFP A/S module utilizes less than 80.81% of LUTs, consumes 65.93% of less power and 50% of latency than the existing RFP adder [16]. In contrast, the proposed RFP A/S module utilizes less than 57.04% of LUTs, consumes 78.09% of less power and 80% of latency than the existing RFP A/S [17]. The proposed RFP A/S module utilizes less than 19.48% of LUTs, consumes 65.85% of less power and 50% of latency than the existing RFP subtractor [16]. The proposed RFP divider module consumes 17.95% less power and 90% less latency than the current RFP divider [24]. The proposed RFP divider module consumes 15.64% less power and 85.71% less latency than the existing RFP divider [16]. The existing RFP adders/subtractor [16], [17] are designed in sequential nature and consume more area and latency. The current RFP dividers [16], [24] are designed using Goldschmidt's algorithm and executed sequentially. These dividers consume more power and latency during reciprocal calculation than the proposed divider.

CONCLUSION
In this manuscript, an efficient reversible floating-point arithmetic (RFPA) Unit is designed and synthesized on Artix-7 FPGA using the Vivado IDE environment. The RFPA unit consists of RFP adder/Subtractor, RFP multiplier, and RFP divider modules. These modules are designed as per IEEE-754 standards. The corresponding module's output is the final RFPA unit output based on the selection mode. All the RFPA unit sub-modules are constructed using basic reversible gates. The simulation results of the RFPA unit are verified with theoretical values. The Synthesis results like Chip area and power consumption of the RFPA unit and its submodules are tabulated on the Artix-7 FPGA Platform. The Latency of the RFPA and its sub-modules executes in only one clock cycle. The RFPA unit utilizes 18.44% LUTs and a power of 0.891 W on Artix-7 FPGA. The performance metrics like GC, CI, GO, and QC is realized in detail for the RFPA unit and its submodules. The Proposed works are compared with existing similar works with better improvement in resource utilization (Chip area and power) and performance metrics.

BIOGRAPHIES OF AUTHORS
Girija Sanjeevaiah has completed her graduation in Electronics and Communication Engineering from Bangalore University and master's specialization in computer science from Visvesvaraya Technological University, Karnataka. She is currently working as Assistant Professor in the Electronics and Communication Engineering department of Dr. Ambedkar Institute of Technology. Her areas of interest are Reversible logic, embedded systems, and data computation. She can be contacted at email: girija.pari@gmail.com.

Sangeetha Bhandari Gajanan
is an Assistant Professor in Electronics and Communication department at RNS Institute of Technology. She received Doctorate and M. Tech degree from Visvesvaraya Technological University and B.E from UBDTCE. Her areas of research interest are low power VLSI design and thin films. Her teaching and research experience of more than 15 years. She can be contacted at email: sangeethabg@gmail.com.