An FPGA-based network system with service-uninterrupted remote functional update

ABSTRACT


INTRODUCTION
The advancement of 5G allows mass deployment of wireless sensors for IoT applications. This phenomenon directly contributes to the tremendous increase in data volume in communication networks, which translated into higher network processing throughput requirement. Meanwhile, datapath flexibility becomes a significant factor to support functional updates as some of the application requirements are unknown during the design time [1] or may change from time-to-time [2]. Hence, the functional update is a vital feature to support new emerging network applications, new network protocols [2], and to cope with the data concept drift [3,4].
Service availability or uptime is another critical factor, especially for network monitoring and data collection applications, where service interruptions interfering applications with blackout period can negatively impact analytic [5] due to missing data samples. In network and communication applications, most Int J Elec & Comp Eng ISSN: 2088-8708  An FPGA-based network system with service-uninterrupted remote functional update (Tze Hon Tan) 3223 intermediary devices, including middleboxes, are required to remain active for maintaining end-to-end nodes connectivity and for processing network packets continuously to prevent data loss. For instance, the impact of service interruptions is very significant, especially when the system is deployed for mission-critical applications or deployed on the gateway between internet service providers (ISPs) [6]. Besides, service interruptions due to frequent functional updates can prevent applications from achieving five-nine availability (99.999%) requirement. Field-programmable gate array (FPGA) systems have emerged as a feasible solution because it can provide a balanced processing throughput and datapath reconfigurability. Due to this, FPGA devices have been widely adopted in many applications, ranging from specialized time-critical applications in niche domains [7][8][9][10][11] to servers applications in the cloud [12,13]. With the dynamic partial reconfiguration feature, FPGA systems exhibit a higher degree of flexibility in the datapath, where the updating sub-circuitry is unavailable for a few milliseconds, while the rest of the circuitry is not affected. Remote functional updates can be achieved by transferring a new partial bitstream and loaded it to the FPGA device. Hence, having an architecture and design that allow remote functional update by utilizing the dynamic partial reconfiguration feature is important.
In this work, the architecture of a service-uninterrupted FPGA-based network and communication processing system with remote functional update capability is proposed, and it is implemented in the NetFPGA CML development board for experimental testing. In order to achieve functional update without service interruption, the application datapath is duplicated for redundancy, as the reconfiguring (updating) datapath is temporarily disabled during the dynamic partial reconfiguration process. The reconfiguration throughput for dynamic partial reconfiguration is >3.19 Gbps, and the size of the largest partial bitstream is 3.49 MB, which could cause approximately 9.17 ms of service downtime for each functional update when not adopting the proposed architecture with redundancy. For an 8 Gbps processing module, 9.17 ms service downtime can cause up to 8.74 MB of data loss on full bandwidth utilization. Hence, the proposed architecture in this paper is targeted for application in FPGA-based systems that have little or no tolerance in service interruption.

RELATED WORKS
Application-specific integrated circuit (ASIC) has been commonly used in the implementation of packet processing function to achieve high throughput. Thus, any functional update would involve the swapping of the electronic circuit board. This would introduce problems in terms of maintenance, where physical access to the facilities is required that will consume a significant amount of service downtime for the maintenance [6]. Software approach is preferred to improve system flexibility, but a full software system has limited processing throughput, i.e. less than several hundred Mbps [6]. Furthermore, applying functional updates to either system would result in service interruption and downtime to critical computation infrastructures.
In general, there are two major approaches for service-uninterrupted functional update in FPGAbased systems [6]: switching-based and buffering-based. The switching-based approach adopted a redundancy mechanism with context switching to select the updating datapath and the operating datapath. Hence, this approach doubles the logic resources required for implementation. On the other hand, the buffering-based approach relies on using buffers to store the packets during the functional update. Although the required logic resources in this approach are halved compared to the switching-based approach, most FPGA devices have insufficient amount of internal random-access memory (RAM) for such packets buffering in gigabit rate. Hence, this approach can result in packet drops when the packet buffer is full.
Katayama et al. [6] proposed the buffering-based approach in a functional update to avoid service interruption. In order to prevent the packet drop during a functional update, a multi-context type FPGA is used in the implementation. Zhou et al. [14] included a mechanism to realize a functional update without causing service interruption in their Openflow Switch on SoC platform. This mechanism [14] utilized the switching-based approach on the flow table, where a dual-port RAM is used to allow concurrent read and write for context switching during a functional update. Apparently, such mechanism as in [14] is applicable to a single flow table in dual-port RAM rather than customized datapath in FPGA. Hence, the architecture and mechanism utilizing dynamic partial reconfiguration in FPGA to enable service-uninterrupted remote functional update on the datapath is the primary focus of this proposed work.
NetFPGA [15,16] is an FPGA-based development board that are widely used for prototyping networking devices. To date, there are four variants of NetFPGA development boards, which are the NetFPGA 1G, NetFPGA 10G, NetFPGA CML, and NetFPGA SUME. NetFPGA CML and NetFPGA SUME are high-end development boards, which comes with Xilinx 7 Series FPGA. Besides a large number of logic resources, Xilinx 7 Series FPGA includes the support for dynamic partial reconfiguration. Except for NetFPGA 1G, the other NetFPGA development boards support stand-alone mode, where attachment to a host PC can be avoided as  [16] for network application development and basic open-source reference designs [15] for packet forwarding.
With dynamic partial reconfiguration feature, a pre-defined region of circuitry can be dynamically reconfigured without impacting the other functional blocks. This feature is managed by the reconfiguration controller, which is either implemented internally within the FPGA device or implemented with an external device. A single-chip solution is made possible by implementing a reconfiguration controller internally with logic resources available within the FPGA device. Fundamentally, dynamic partial reconfiguration enables update-in-the-field [17,18] to deal with dynamic requirements in application accelerators, where the datapath can be updated without impacting the other functional modules. In order to utilize this feature, architecture with a mechanism capable of handling the dynamic partial reconfiguration process is needed.

PROPOSED ARCHITECTURE
In our previous work [19], a standalone remote dynamically reconfigurable middlebox has been developed. However, the application services in this middlebox are interrupted during every remote functional update. By leveraging on the architecture from [19] as shown in Figure 1, this work presents the proposed high-level architecture in Figure 2 with major additional updates to enable remote functional update without causing service interruptions. The major updates over [19] are: − Duplicating the reconfigurable partition (RP) and its internal modules − The arbiter is replaced by allocator so that the processing modules in both reconfigurable partitions are capable of functioning together in usual operating mode In Figures 1 and 2, the dotted boxes denote region that can be dynamically reconfigured. Meanwhile, the shaded regions represent the application plane modules. In general, the proposed architecture consists of application plane and management plane. The modules in the management plane process management related tasks, including dynamic partial reconfiguration and modules coordination. On the other hand, the application plane is mainly focused on network packets and applications processing. By adopting dual modular redundancy (DMR) approach, the application modules consisting application datapath residing in the respective reconfigurable partition can cover for each other when the dynamic partial reconfiguration is in progress, thus allowing uninterrupted services.

IMPLEMENTATION
The proposed architecture is implemented on NetFPGA CML board according to the functional block diagram shown in Figure 3. The FPGA implementation flow includes: behaviorally describes the function blocks and modules with Verilog HDL, verifies their functionality with ModelSim waveform simulation, and test the implementation experimentally in NetFPGA CML board. The NetFPGA CML board is operating in the standalone mode, where the power is supplied externally so that attachment to PC through PCIe can be avoided. This is useful for the embedded system deployment and can improve the overall scalability.
Based on NetFPGA CML development environment setup, the modelled Verilog HDL sources are synthesized with Xilinx ISE 14.6. All synthesized netlists are then used in Xilinx PlanAhead for subsequent flow includes: map, place and route, timing analysis and bitstream generation. Another reason for using the Xilinx PlanAhead is to simplify the dynamic partial reconfiguration flow, where the scripts for execution are automatically generated, and its execution can be managed internally. Additionally, the Xilinx PlanAhead contains a GUI to ease the definition of location and size of the dynamically reconfigurable area.

Management plane
The packet dispatcher dispatches packets to either management plane or application plane based on the packet header identifier. Meanwhile, the plane Arbiter is responsible for packets arbitration between the management plane and application plane. For management type packets, the payload is extracted by the management plane packet Handler.
The reconfiguration Handler stores the extracted partial bitstream to SRAM through SRAM Interface. Upon the arrival of all partial bitstream to the FPGA device, the Reconfiguration Handler asserts flag to stop the packet flow to respective partition modules. Once the modules transit into the idle state, the Reconfiguration Handler loads the partial bitstream to configuration memory through an internal configuration access port (i.e., ICAP, a Xilinx FPGA primitive). After the final piece of the partial bitstream is loaded to the configuration memory, a readback sequence is used to retrieve the status of dynamic partial reconfiguration. Subsequently, the modules are initialized, and it will assert the flag to allow packets to flow through.

Application plane
The application plane consists of ingress allocator, network packet processing module (application module consisting application datapath), and Egress Allocator. The Ingress Allocator redirects packets to the network packet processing module in the idle state for processing. Similarly, the Egress Allocator redirects packets to Plane Arbiter whenever the network packet processing modules flags its request.
There are two network packet processing modules located at respective reconfigurable partition, namely RP_0 and RP_1. This implementation allows the network packet processing modules to cover for each other, especially when either one is dynamically reconfigured. When both network packet processing modules are activated, the Ingress Allocator redirects packets based on their availability, where both modules can service packets in parallel. The network packet processing module is deactivated for reconfiguration based on the flag from the reconfiguration packet header. As the application plane datapath depends on the network algorithmics, its architecture is not discussed in this paper.

EVALUATION
In the NetFPGA CML Kintex 7 device (XC7K325T-1FFG676), there are 50,950 slices and 445 BRAMs available for implementation. Table 1 lists the required logic resources for implementation. Modules in each reconfigurable partition (RP) utilized approximately 1,006 slices and 17 BRAMs for a learning content addressable memory (CAM) [20] switch application.
For functional verification and experimental testing, the reconfigurable modules are dynamically reconfigured for expansion with deep packet inspection (DPI) blocks. Table 2 lists the required amount of logic resources for implementation after DPI blocks expansion. Network packets are injected from another PC to the NetFPGA for analysis using Wireshark packet analyzer, where the behaviour from the captured packets is used for verification of successful reconfiguration process.
Based on the experimental evaluation, the reconfiguration throughput is higher than 3.19874 Gbps, where the maximum reconfiguration throughput at 100 MHz clock frequency and 32-bit bus width is 3.20000 Gbps. The size of partial bitstreams is 3,666,884 bytes and 2,241,540 bytes for RP_0 and RP_1 respectively. Loading these partial bitstreams to the configuration memory through ICAP and modules initialization  Table 3 shows the comparison of acquired reconfiguration throughput with other similar works. The clock cycles used for dynamic reconfiguration and module initialization implies service downtime period as well. For a fully utilized bandwidth datapath, there will be a data loss of 9,169,432 bytes (1,146,179 * 64 bits) if the proposed architecture and mechanism is not adopted, where the data bus width is 64 bits. For maximum transfer unit (MTU) of 1,500 bytes, the amount of full-sized packet loss is more than 6112 packets. This impact is significant for applications with little to no tolerance to such service interruption or when the deployment is near to the network core. Figure 4 shows the floor plan for RP_0 and RP_1 in Xilinx PlanAhead.    [21] 3.05600 DRAM DPR Manager [22] 3.07432 SD Flash FlashCAP [23] 3.08000 BRAM Intelligent ICAP Controller [24] 3.19832 SRAM ICAP Controller [25] 3.19840 DDR SDRAM BRAM_HWICAP [26] 2.97120 BRAM AC_ICAP [27] 3.04824 BRAM Proposed 3.19874 SRAM

CONCLUSION
This paper presented a standalone FPGA-based system architecture that allows remote functional update without causing service interruption by adopting a redundancy mechanism in the application datapath and utilizing the dynamic partial reconfiguration feature. Significantly, service interruption is no longer triggered by remote functional updates, and the processing throughput is doubled except during the dynamic partial reconfiguration (9.17 ms). Hence, the proposed architecture in this paper is well-suited for applications of the FPGA-based system that has limited tolerance in service interruption. Such applications include, IoT sensors in monitoring and data collection applications are operating continuously at all time (24/7) to avoid data loss. Wireless sensors can be deployed prevalently with the emergence of the 5G network, which would further strengthen the significance of having a service-uninterrupted remote functional update in FPGA-based system. Future works will focus on architecture exploration for flexibility improvement and data analytics integration for deployment.