Optimized architecture for SNOW 3G

ABSTRACT


INTRODUCTION
Security of the records is important in the systems where personal and financial matters are involved. Hiding of information from unauthorized users becomes essential in such systems and services. Cryptography is one of the widely used techniques for securing information from eavesdroppers. Considering the need to secure information many researchers are working in the area of information security. To maintain advanced network security, the concern network architecture must change from traditional security to advanced security. The same may be achieved by sinking holes in the security wall.
Cryptography algorithms and their associated key are more secure when it is implemented on a hardware platform [1]. Side-channel attacks and fault attacks may exist. However, developed algorithms must be fast enough to support autonomous protocols. These protocols use different encryption algorithms for a different session. Many recent autonomous protocols like secure sockets layer (SSL) and internet protocol security (IPsec) use different ciphers for different sessions.
Hardware implementation of the cryptographic algorithm on FPGA devices is attractive solutions because FPGAs are reconfigurable [2][3][4][5][6][7][8]. This property provides flexibility for dynamic system development and capable of implementing a wide range of functions/architectures/algorithms. It seems to be significant to emphasize FPGA based implementations of cryptographic algorithms, especially high throughput architectures [9]. SNOW 3G algorithm is the core of the 3rd generation partnership project (3GPP) algorithms UEA2 and UIA2. The 3GPP is a joint attempt between telecommunication associations (TG) to make globally applicable specifications for long term evolution (LTE) mobile phone systems [10,11].

= [0] ‖ [1] ‖ [2] ‖ … ‖ [31]
(1) K0 K1 K2 K3 IV0 IV1 IV2 IV3 Initial Operations S15 S14 S13 S12 S11 S10 S9 S8 S7 S6 S5 S4 S3 S2 S1 S0 SNOW 3G works into two modes of operation, initialization mode and keystream mode. At the start of initialization, the model system should reset LFSR and FSM using terminals Rst1 and Rst2 respectively. In the first clock cycle values calculated in the initialization, a mode is loaded into sixteen stages of LFSR but FSM registers should remain in a reset state. In the second clock cycle, Rst2=0 and now LFSR is clocked. At each clock, 32-bit output F of FSM is combined with S0, S2 & S11 in the feedback path by selecting mode 0 from a select line of the multiplexer and applied to S15 as intermediate signal v. The following equation provides the intermediate signal v in the initialization mode [10].
After 32 clock cycles, SNOW 3G enters into keystream mode. Operations in this mode are the same as initialization mode but the only difference is that output F of FSM is not combined in feedback path by making mode = 1 from the multiplexer. The intermediate signal in keystream mode is given by the following equation [10].

RELATED WORK
The study of existing architectures of SNOW 3G evolved two challenges. One minimizing propagation delay of the 2 32 modulo adders and other is minimizing the chip area of S-boxes. The researcher Kitsos et al. [12] realized S-boxes using 8 lookup tables. Each lookup table consumes 1 KB memory, so memory used for S-box realization is 8 KB. Jairaj et al. used symmetry of S-box lookup tables to minimize cache requirement in the software implementation of SNOW 3G [21]. Kitsos et al. [12] used conventional CLA for modulo adder implementation. The researcher Pai and Chen [22] presented a modified CLA design to minimize the propagation delay. Traboulsi et al. [23] implemented SNOW 3G on an embedded platform. The motive of the design was to minimize the memory required for S-box implementation. Researchers used 2 lookup tables in place of 8 lookup tables for implementation of 2 S-boxes. Eight-bit shifting with cache memory is used efficiently to minimize memory requirement.

PRESENTED SNOW 3G ARCHITECTURE
Considering the challenges of existing FSM, the proposed implementation uses the following refinements to improve the performance of the SNOW 3G algorithm.
-Use of novel modulo CLA architecture over 2 32 to minimize propagation delay in FSM, which decides the critical delay of the algorithm -Use of novel S-Box architecture to minimize chip area

Novel modulo CLA architecture over 2 32
Modulo adders are usually implemented by using ripple-carry adders, but this technique increases the propagation delay of the critical path. The propagation delay of n bit ripple carries adder is (2n+1) gate delays. Modulo adder over 2 32 implemented by using ripple-carry adders will have delay of (2*32+1 = 65) 65 gates delay, assuming average gate delay of 10 ηs the total delay of one modulo adder will be 65*10 = 650 ηs. FSM consists of two such adders so a total delay of modulo adders for single computation will be 1300 ηs.
The propagation delay of modulo adders can be minimized by using CLA for its implementation. Existing CLAs are realized by using basic gates i.e. AND, XOR, and OR gates, but Pai et al. realized CLA by using universal gates i.e. NAND or NOR gates [22]. The same design minimized gate requirement as compared to existing architectures. At the same time, this CLA [22,24] designs are faster than conventional CLA architectures. Adder architecture [25] developed for LILI-II cipher uses different approach for addiation.
Reduction in propagation delay and chip area is possible in existing architectures [12][13][14][15][16]22], so the presented research work uses universal gates for CLA implementation and other techniques to minimize the number of gates required. Novel modulo CLA architecture over 2 32 uses following three architectures in multilevel CLA designs for performance improvement -4 bit CLA at LSB (to calculate S0 to S3) -4 bit CLA at middle stages (to calculate S4 to S27) -4 bit CLA at MSB (to calculate S28 to S31) Using the above CLA architectures novel architecture for modulo CLA over 2 32 was designed as shown in Figure 2. Presented modulo adder architecture is an area, propagation delay, and energy-efficient as compared to existing modulo CLA architectures.

Novel S-Box architecture
Two S-boxes S1 & S2 are used in SNOW 3G architecture each requires memory of 4 KB. The lookup table of S1 is taken from the Rijndael substitution box and a lookup table of S2 is based on Dickson polynomial over GF-28. As per design specification, each S-box (S1 or S2) is implemented by using 4 lookup tables and each lookup table has 256 values each of 4 bytes. So the implementation of each lookup table requires (256x4 = 1024bytes of memory). Each S-box has 4 lookup tables, so total memory required for the implementation of S1 or S2 is (4x1024 = 4K) 4KB. The total memory needed for the realization of two S-boxes is 8KB. Existing implementation [10][11][12][13][14][15][16][26][27][28][29] uses S-box architecture as shown in Figure 3. S1_T0 S1_T1 S1_T2 S1_T3 S-box S1 Implementation S-box S2 Implementation Figure 3. Existing S-boxes architecture The four lookup tables of S1 i.e. S1_T0 to S1_T3 as shown in Figure 3 has the same content but exist in 8 bit shifted form. Analogous is the case of S-box S2. Presented novel S-box architectures use a single lookup table for implementation of S-box (S1 or S2). Presented research work uses two architectures for S-box implementation. First architecture as shown in Figure 4, consumes fewer resources but useful to low-frequency applications only. Second architecture as shown in Figure 5, consumes fewer resources as compared to existing architectures but required more resources as compared to Novel S-box architecture-1.  Presented designs require 2 KB of memory for the realization of S-boxes S1 & S2. These designs save 6 KB of memory as compared to existing designs. S-box architecture-1 saves 6 KB memory at the cost of some additional hardware (Single 2-bit counter, two 4 I/p multiplexers, and four 32 bit latches). This architecture is 4 times slower than conventional architectures and useful for low-frequency applications. S-box architecture-2 has the same speed as conventional architectures but uses 4 additional 256:1 multiplexers. The second architecture can be used for low and high-speed applications depending on cost and speed tradeoffs  Figure 5. Novel S-Box architecture 2

Optimized SNOW 3G architecture
Optimized SNOW 3G architecture as shown in Figure 6 is designed using novel modulo CLA architecture and novel S-Box architecture as discussed in the previous section. SNOW 3G architecture designed using S-Box architecture-1 is used for low-frequency applications and needs two clock arrangements. Whereas SNOW 3G architecture designed using S-Box architecture-2 is used for high-frequency applications and needs a single clock. Internal block diagram of optimized SNOW 3G architecture as shown in Figure 7. K0 K1 K2 K3 IV0 IV1 IV2 IV3 Initial Operations S15 S14 S13 S12 S11 S10 S9 S8 S7 S6 S5 S4 S3 S2 S1 S0

S11
Divide KEY and IV

128
K0 K1 K2 K3 IV0 IV1 IV2 IV3 Initial Operations S15 S14 S13 S12 S11 S10 S9 S8 S7 S6 S5 S4 S3 S2 S1 S0 FSM of SNOW 3G architecture consists of two modulo adders and two S-Boxes. Two modulo adders will decide the speed of the algorithm and two S-boxes will decide hardware utilization of the algorithm. The use of novel modulo CLA over 2 32 minimizes propagation delay and the use of novel S-box architecture minimizes hardware utilization. These refinements help to improve the performance of the SNOW 3G algorithm in terms of throughput and area.
Optimized SNOW 3G architecture uses VHDL language for coding. The same is implemented on the FPGA device Virtex xc5vfx100e manufactured by Xilinx [30]. The presented architecture achieved a maximum frequency of 254.9 MHz and throughput of 7.2235 Gbps. Table 2 shows particulars about the technology used. Figure 8 and Figure 9 show RTL schematic and output waveform of the presented architecture respectively.

RESULT AND DISCUSSIONS
The following section discusses the result in terms of area, propagation delay, throughput, and memory utilized for presented SNOW 3G architecture.
6.1. The area 6.1.1. Novel modulo CLA architecture over 2 32 Presented novel modulo CLAs are used as modulo adders over 2 32 in Optimized SNOW 3G architecture. A comparison of device utilization of existing [13,22] and presented architectures is shown in Figure 10.

Novel S-Box architecture
Optimized SNOW 3G architecture uses Novel S-box architecture to avoid redundancy of lookup tables. Presented Novel S-Box architecture-1 is suitable for low-frequency applications and Novel S-Box architecture-2 is useful for high-frequency applications. The use of these novel architectures minimizes hardware requirement as shown in the Figure 11. The comparison shows that the hardware resources used in the presented architectures are less than existing architectures [12][13][14][15][16]. The reduction in area is possible because S-box is designed using one lookup table in place of four lookup tables. Figure 11. Comparisons of hardware utilization for S-box

Optimized SNOW 3G architecture
Optimized SNOW 3G architecture uses refined modulo CLA over 2 32 and refined S-box to for performance improvement. Hardware resources used by optimized SNOW 3G architecture are presented in Table 3 and Table 4 shows comparisons of hardware resources used by optimized SNOW 3G and existing architectures [12][13][14][15][16].
The comparison shows that optimized SNOW 3G architecture utilizes minimum resources as compared to architecture presented Kitsos et al. [13], Madani and anougast [15] and Madani et al. [16]. The architecture presented by Kitsos et al. [12] is ASIC, so the comparison is difficult. The architecture presented by Zhang et al. [14] uses less hardware as compared to proposed refined architecture because only one mode implemented on hardware. Existing Architectures [12,13,14,15,16] No. of slices used Number of 4 input LUTs Propagation delay comparison of proposed refined CLA and existing CLA architectures [13,22] is shown in Figure 12. Propagation delay evaluation shows that delay of presented novel modulo CLA architecture is fewer than existing CLA architectures. The presented CLA architecture will help to improve the throughput of Optimized SNOW 3G architecture.

Novel S-Box architecture
The combinational path delay comparisons of proposed refined S-box architectures and existing S-box architectures [12][13][14][15][16] are shown in the Table 5. The comparison shows that the propagation delay of proposed low-frequency architecture is more as compared to other architectures, with less hardware. Similarly, the propagation delay of proposed high-frequency architecture is less as compared to other architectures with moderate hardware utilization. The path delay of existing architectures is more as compared to presented architecture 1 but less as compared to architecture 2. The hardware resources used by existing architecture are more as compared to other architectures.

Throughput and memory
Comparisons of throughput achieved and memory used for S-box realization of optimized SNOW 3G and existing SNOW 3G [12][13][14][15][16] architectures are shown in Figure 13. The comparison shows that throughput of optimized SNOW 3G is higher than architecture presented by Kitsos et al. [13], close to architecture presented by Kitsos et al. [12], but less than architecture presented by Zhang et al. [14], Madani and Tanougast [15] and Madani et al. [16]. This may be due to the use of more hardware resources.

CONCLUSION
Optimized SNOW 3G architecture is presented in the paper uses novel modulo CLA and novel S-box architecture. The use of novel CLA minimizes hardware required for modulo adders and minimizes propagation delay as compared to existing architectures. The use of novel S-box architecture minimizes 6 K bytes of memory as compared to existing architectures. The presented architecture uses 2K bytes of memory, whereas existing architectures 8 K bytes of memory for the same. The presented SNOW architecture attained throughput of 7.2463 Gbps at a clock frequency of 226.562 MHz. Presented architecture achieves throughput more than architecture and close to ASIC implementation.
The throughput of existing architectures is more than the presented architecture. It may be due to: (1) S-boxes used in these architectures use 8 KB memory for S-box realizations; (2) Architecture uses a software platform that helps to minimize hardware and to increase throughput; (3) Architecture is ASIC realization and ASIC designs are always faster than FPGA realizations.