High-performance AES-128 algorithm implementation by FPGA-based SoC for 5G communications

In this research work, a fast and lightweight AES-128 cypher based on the Xilinx ZCU102 FPGA board is presented, suitable for 5G communications. In particular, both encryption and decryption algorithms have been developed using a pipelined approach, so enabling the simultaneous processing of the rounds on multiple data packets at each clock cycle. Both the encryption and decryption systems support an operative frequency up to 220 MHz, reaching 28.16 Gbit/s maximum data throughput; besides, the encryption and decryption phases last both only ten clock periods. To guarantee the interoperability of the developed encryption/decryption system with the other sections of the 5G communication apparatus, synchronization and control signals have been integrated. The encryption system uses only 1631 CLBs, whereas the decryption one only 3464 CLBs, ascribable, mainly, to the Inverse Mix Columns step. The developed cypher shows higher efficiency (8.63 Mbps/slice) than similar solutions present in literature.

4223 expansion step lasts just 174.55 ns for deriving the 44 words from the encryption key. Besides, the proposed VHSIC-very high-speed circuits-hardware description language (VHDL) encryption/decryption system includes several control signals for synchronizing it with the data generator block and the modulator/demodulator, respectively; also, different blocks have been added for testing the developed encryption/decryption block, by placing in a deterministic way an error into the incoming data packet and checking the correctness of the resulting encrypted/decrypted packet, indicated by a suitable error signal. A suitable mechanism has been developed to substitute the encryption key in every instant of the encryption process, resulting in the loss of just three data packets in each substitution process. The simulation results and on-field tests have demonstrated the proper operation of both encryption and decryption blocks and higher efficiency in the utilization of hardware resources (i.e. 8.63 Mbps/slices) than similar implementations present in the scientific literature. The rest of the paper is organized as follows: the section 2 reports the design and implementation of the proposed encryption and decryption systems and all the VHDL sections developed to verify their operation. Section 3 presents the results of post-implementation and post-synthesis simulations carried out on the cascade system of the cypher and decryptor. Finally, in the fourth section, the performances of the proposed encryption/decryption system are discussed, comparing them with similar implementations reported in the literature.

RESEARCH METHOD
The Xilinx Vivado Design Suite has been used for developing the proposed encryption/decryption system, exploiting the wide range of tools offered to the designers. The block diagram related to the encryption system is shown in Figure 1, along with all the blocks for testing its correct operation; the AXI Stream bus provides the 128-bit plaintext data packets in input and ciphered packets in output from the cypher. The function of all the implemented blocks is following described: − Insert Error block to insert an error into a packet included in the internal table of the Data Generator block (blue box in Figure 1). − Clock generator block to generate the 220 MHz system clock to all blocks (purple box in Figure1); − Key_Generator block for providing the encryption key to the Key_to_write block, which loads the key into the memory registers by AXI Lite bus; also, this block changes the encryption key. − Data Generator provides the 128-bit plaintext data packets to the AES_AXIS_KEY encryption section (yellow box in Figure 1). − Key_to_write block for loading the encryption key into the memory registers, through the AXI Lite bus, implementing all the controls required for this operation (orange box in Figure 1). − AES_AXIS_KEY block encrypts the incoming plaintext data packets, taking in input the 128-bit plaintext data packets, the encryption key, the clock signal, and the synchronization signals used to manage the loading of the new encryption key into the first in first out (FIFO) registers (highlighted in red in Figure 1). − Pattern Verificator for checking the compliance of the encrypted packets with the input plaintext data, notifying an error signal eventually if an error is detected (grey box in Figure 1). The first step performed AES_AXIS_KEY block concerning the key expansion, aimed to obtain the 11 subkeys used in the 10 rounds constituting the AES algorithm. The developed implementation uses a 16 X 16 Sbox constituted by 32-bit elements, unlike the 8-bit of the classic implementation, thus speeding up the different operations involving it, but with higher resource utilization. A LUT-based implementation has been preferred over combinatorial-based solutions for Sbox, since the main goal of the proposed encryption/decryption implementation is the data throughput rather that the utilization of hardware resources, given the large memory capabilities offered by the FPGA platform; in fact, Sbox solutions based on LUT offer better performances from a processing time point of view at the expense of area occupation. The key expansion phase starts with the validation of the new key, generating the expansion_key_start signal, once the new key is rightly loaded from the registers; the whole key expansion operation lasts 174.55 ns to obtain the 44 words from the encryption key. The pseudo-code related to the key_expansion step is the following:  This operation consists of 11 iterations, in each of which the key is validated, at first, and then the four words for each sub-key are obtained combining the previous sub-keys with the current key transformed by the Sbox; the rcon method is a round constant added to the result of the two sub-functions. Once the 11 sub-keys are obtained, the implemented algorithm generates the encrypted data by carrying out the ten iterations (the ten rounds) required by the AES 128 standard; in the first round (round_0), the plaintext data packets (in_plain_data) are combined through xor operation with the cipher_key_table which contains the key to be used not yet expanded; red box in Figure 2(a). After obtaining the intermediate data in the first round, the following nine rounds are carried out, in each of which the SubstituteBytes, ShiftRow, MixColumns, AddRoundKey operations are performed. The code section that recalls the nine rounds, developed in the AES_AXIS_KEY block, is shown in Figure 2(b). However, each round operates on the words resulted from the previous round. Thus, the new intermediate data is updated up to the last round (round 9), which provides the data for the round 10, where the last Add Round Key function is carried out; in Figure 3(a), the operations involved in each round of AES-128 algorithm are shown. After the ninth round, the Add_round_key function is applied to the last intermediate data, called intermediate_data(9) Figure 3 The Pattern_Verificator block has been equipped with a second table, allowing us to use another key; when a different key is used for the encryption, this event is received by the two signals key_1 and key_2, which indicates to the Pattern_verificator which table refers. These two signals are generated from the Key_Generator block and supplied directly to the Pattern_verificator block. This last is synchronized with the encryption block by a pulse on the m00_axis_tvalid pin provided at the encryption end, signalling the availability of the following data packet.
Also, two flag signals have been added, namely a synchronization flag and a signal to indicate the packet Data_Generator table provided to AES_AXIS_KEY block, enabling to the Pattern_Verificator the tracking of incoming packets, checking the corresponding ones inside its internal table. In addition, as mentioned above, after writing in the register and resetting the key_valid bit, the algorithm performs the key expansion operation, lasting 174.5 ns; during this phase, the change of the sub-keys causes the loss of three packets; these incorrect packets are reported externally through a pin called invalid_packets which goes high when wrong encrypted packets occur at the AES_AXIS_KEY block output and return low when the packets are encrypted correctly. An error_sig signal, provided by Pattern_verificator, indicates errors in encrypted data packets through a low level in correspondence to the wrong data packet.
The key substitution system is an essential feature for every communication system because a periodic key substitution is needed for ensuring the security of the transmitted data. Considering the AES_AXIS_KEY block, it accepts, by Data_Generator, a synchronization signal, called s00_axis_tvalid, to signal the presence of the next data packet on the AXI bus; also, the encryption block accepts another synchronization signal from Pattern_verificator, called m00_axis_tready, for indicating when this last is available to receive a new data packet. Moreover, the s00_axis_tready signal has been configured, which indicates when the encryption block is ready to accept a new plaintext data packet; this signal is reset, only when the m00_axis_tready signal is reset. If the Pattern_verificator notifies its unavailability to load a new encrypted data packet, bringing so the m00_axis_tready signal to a low logic level, the encryption block, communicates to the Data_Generator its unavailability to accept new plaintext data packets, bringing so the s00_axis_tready signal to a low logical level. One of the primary contributions of the proposed framework is constituted by the control and synchronization mechanisms, above described, essential for the operation of the entire encryption/decryption system, guaranteeing its compatibility with the other functional blocks included in the communication system. By storing the intermediate results obtained from each round (i.e. _ ( ), = 1, … .9), a pipelined strategy has been implemented, performing the ten rounds on consecutive data packets, at the same time, thus enabling the elaboration of a new packet only when the processing on the previous packets is concluded. In this way, parallel elaboration on successive packets is obtained, thus improving the usage of hardware resources, and enabling higher data throughput. As aforementioned, the proposed AES implementation is featured by a round's processing time equal to only a clock cycle, providing an encrypted data packet every clock period. Moreover, the implementation of the AES-128 decryption algorithm has been carried out, similarly to the encryption system, parallelizing many logical operations on each clock period; specifically, a 16X16 State matrix was employed, also in this case, containing 32-bits elements and not of 8 bits as in the case of the standard implementation. The proposed implementation starts with key expansion, carried out with the same Sbox employed for the encryption process. The description process consists of ten rounds, involving the correspondent inverse operations to encryption, viz InvShiftRows, InvSubBytes, and InvMixColumns. The inverse functions are all obtained using matrix implementations represented by four 16 X 16 matrices with 32-bit elements (named sbox_decoding_0, sbox_decoding_1, sbox_decoding_2, and sbox_decoding_3), combined by xor operation to derive the intermediate data of the different rounds. This solution allows to reduce the time duration of the decryption operation to just ten clock cycles; nevertheless, this implementation requires more FPGA hardware resources. In order to test the decryption implementation, several VHDL blocks have been employed, with functionalities similar to those used for the encryption system.

Behavioral simulations of the cascade system including the encryption and decryption sections
Although the Xilinx ZCU102 platform is featured by 350 MHz maximum operating frequency, postimplementation simulations produced a negative worst negative slack (WNS) for clock frequencies higher than 220 MHz, indicating clock signal propagation issues. To overcome this problem, the clock frequency has been reduced to 220 MHz and employing the Explore strategy offered by the Vivado tool, thus resulting in a WNS value of 0.005 ns, related to the encryption task, and 0.008 ns for the decryption one. The behavioural simulations on the system composed of the encryption and decryption sections connected in cascade have been performed, using a 220 MHz clock frequency as shown in Figures 4 and 5. In Figure 4, the waveforms related to the encryption/decryption process are shown, obtained by feeding the encryption system with plaintext packets every 40.86 ns, corresponding to a date-rate of about 3 Gbit/s ( 128 40.86 = 3.132 / ). As evident, the encrypted packets are processed by the encryption block output after ten clock periods (i.e. 10 × 4.54 = 45.4 ), and loaded by the decryption section on the following clock rising edge; this last is decrypted and provided in output to the decryption system after only ten clock cycle. Hence, the whole encryption/decryption process lasts only 20 clock periods equal to 90.8 ns (for 220 MHz clock frequency). As evident, the control and synchronizing signals have been implemented for ensuring the interoperability between the developed encryption/decryption system and the different components integrated into the communication system. Afterwards, the behavioural simulation is carried out, providing plaintext data packets to the input of the encryption block, on the rising edges of the clock signal. The time interval required for the encryption and decryption remains 90.8 ns. The combined system can elaborate and provide encrypted data on the rising edges of the synchronization signal, obtaining 28.16 Gbit/s data-rate (220 × 128 = 28.16 / ); this is indicated by the status of s00_axis_tvalid signal, generated by the AES_AXIS_KEY block, which remains to a high level for the whole operating time of the system (blue box in Figure 5), index of the continuous availability of the encryption block to receive a new plaintext data packet. Also, the error_sig, provided by Pattern Verificator, indicates the correct operation of the designed encryption/decryption section, since it remains at a low level for all the operation time, indicating that the decrypted data packets are identical to the corresponding ones provided to the encryptor input (yellow box in Figures 4 and 5).

Post-synthesis and post-implementation simulations of the encryption and decryption systems
In this section, the real resource utilization of the FPGA-ZCU102 device related to the developed AES-128 implementation is derived by post-synthesis and post-implementation simulations. Therefore, the post-synthesis simulations are carried out on both encryption and decryption sections providing data packets on each clock period and with a 220 MHz clock frequency; the simulation results of the FPGA area utilization are summarized in Table 1. Subsequently, the simulation was repeated providing, every 40.86 ns, the data packets to the encryption/decryption block input, obtaining the same resource utilization. Besides, the post-synthesis simulation was performed on the encryption section, after the removal of the blocks added to test the developed implementation, leaving just the blocks required by the encryption process. In this case, the resources needed for the system synthesis are 4.76% of lookup table (LUT) and 0.71% of flip flop (FF) respect to the overall resources of Zynq Ultrascale+ XCZU9EG-2FFVB1156E MPSoC, corresponding to 1631 configurable logic blocks (CLBs); therefore, a reduction of 0.72 % has been obtained for LUT utilization and 0.07% for FF compared to the complete scheme. The post-synthesis simulations on the decryption system were carried out, providing the encrypted data packets both on the clock rising edges and every 40.86 ns, obtaining the same resource utilization as shown in Table 1.
Similarly, the simulation has been repeated, after the removal of all the blocks not involved in the decryption; in this condition, 10.11% of LUTs, 0.71% of FF, and 0.25% of global buffer (BUFG) have been used, corresponding to 3464 CLBs, thus obtaining a reduction of 0.53% for LUTs and 0.08% for FF, compared the complete decryption system. It is evident that the decipherer requires more FPGA resources than the cypher, attributable to the four matrix implementations of Inverse SubBytes, Inverse ShiftRows, and Inverse MixColumns functions. Anyway, this latter corresponds most of the used hardware resources, due to the inverse matrix elements and their related implementation, based on LUT [45].
Besides, the post-implementation simulations were carried out for the encryption, decryption systems, and the one consisting of both blocks' cascade to check that the parameters resulting from the simulation have acceptable values to ensure the regular operation of the algorithm. The simulations demonstrated that for a clock frequency higher than 220 MHz, a positive worst negative slack (WNS) value could not be obtained, thus indicating clock routing issues. Specifically, for 220 MHz operative frequency, and setting the Explorer implementation strategy, the resulting WNS parameter value was equal to 0.005 ns and 0.008 ns relatively to the encryption and decryption sections, respectively. Also, the encryption and decryption sections arranged into a cascade configuration was tested by the post-implementation simulation.
In particular, the occupation of the FPGA's resources turned out to be 15% LUTs 1% FFs 1% I/O ports 1% BUFG besides a further 25% area utilization was obtained ascribable to the IP Clocking Wizard section for generating the main clock as shown in Figure 6(a); also, the WSN value of the combined system was equal to 0.056 ns (dashed blue box in Figure 6(b)). The total-on-chip power, defined as the sum of device and design power consumptions, for the encryption and decryption system arranged into a cascade configuration, is equal to 1.768 W (red dashed box in Figure 6(c)), with 26.7 °C junction temperature, ensuring 73.3 °C of thermal margin, providing the data packets at each clock cycle.  By comparing our solution with another pipelined AES-128 implementation, reported in [46], on a comparable FPGA platform, the first shows a higher efficiency (8.63 Mbps/slices), compared to the latter (2.99 Mbps/slices), indicating better exploitation of hardware resources to obtain a given data throughput. Besides, considering the high-throughput AES implementation reported in [47], our encryption system gets a slightly lower maximum data throughput (-5.3%), but also uses fewer FPGA's resources (namely, -39.7%), thus reaching a higher value of hardware resource utilization efficiency (+56.9%). By comparing our encryption/decryption solution with the implementation reported in [48], our system shows a considerably higher efficiency (+ 92.9%).
Afterwards, the bitstream file of the designed system, consisting of the encryption and decryption sections arranged into a cascade configuration, was generated and, then, loaded on the FPGA-ZCU102 board. In this way, the encrypted data packets, provided in output by the encryption block, are immediately loaded by the decryption block, which processes them in only ten clock periods; the whole encryption/decryption operation lasts only 20 clock periods. Also, the IP integrated logical (IL) analyzer has been added to the Block Design, to monitor the interest signals. The system was successfully tested for both the aforementioned operative conditions, namely furnishing the plaintext data packets every 40.86 ns (i.e. 3.13 Gbit/s) and every clock cycle (i.e. 28.16 Gbit/s).

CONCLUSION
The proposed research work presents a high-speed and lightweight implementation of AES-128 cypher for 5G communication systems, on a Xilinx ZCU102 FPGA platform. A pipelined framework has been employed, both for the encryption and decryption tasks, so enabling the contemporary elaboration of several data packets in the same clock cycle. A maximum working frequency of 220 MHz was obtained to ensure a positive WNS value in the post-implementation simulations, thus reaching 28.16 Gbit/s maximum data rate (i.e. 220 × 128 ); the encryption and decryption times last both just ten clock periods. Some control and synchronization signals have been implemented to ensure the interoperability of the proposed encryption/decryption section with the other ones included in the communication system. The hardware resources used by the encryption system was only 1631 CLBs, as well as the decryption one employs 3464 CLBs, mainly due to the LUT-based Inverse MixColums operation. The encryption system shows a higher efficiency (8.63 Mbps/slices) compared to other similar implementations present in the scientific literature.