High performance modified bit-vector based packet classification module on low-cost FPGA

ABSTRACT

has a recursive flow method, lucent and aggregated bit-vector algorithm, and cross producting method. Lastly, the hardware-based methods have bitmap intersection, and ternary content addressable memory (TCAM) approaches [2]- [4].
Most of the time, the PC module's use is for an extensive network, multi-field classification, and large rule sets, increasing the complexity. The TCAM is a hardware-based approach and provides high-speed time with less complexity and not consumes more energy while classifying the packets [5], [6]. The decision tree based on the binary search on level (BSOL) engages the replication control method to reduce the memory utilization of BSOL. The decimal tree-based BSOL updates dynamically improve the speed and reduce the memory than HiCuts approaches [7]. The non-partitioning based algorithms like exhaustive search-based approaches use more classification time and support less table search. Cross producting method supports more table search and consumes less classification time. The partition-based approaches like decision-tree based, tuple space, and hash-based approaches support moderate search tables and utilize average classification time [8].
The proposed modified bit vector (MBV) based packet classification (PC) module is designed and implemented on FPGA. The PC module offers scalable and memory-efficient features, which suits real-time network applications. Section 1 explains the review of existing Packet classification using different approaches. The proposed MBV based packet classification with detailed hardware architecture is explained in section 2. Section 3 elaborates on the results and discussion of proposed work, with different design constraints on FPGA and also a comparison with existing similar PC methods with improvements. Section 4 concludes the outcome of the proposed work with improvement.
This section discusses the existing packet classification (PC) approaches for different network application viewpoints. Qu and Prasanna [9] discuss the high-performance Packet classification module on FPGA, which offers dynamic updation, Clustering, striding, power gating, and dual-port memory mechanism in a single PC engine to improve the system performance. The 2-dimensional pipelined BV based PC is designed with a scalable architecture to overcome the existing PC challenges. The dynamic updates like modifying, insertion, and deletion in the ruleset improve the optimization of PC. The latency, energy, and throughput for different rules are analyzed with better results. Zhou et al. [10] present the decomposition based PC algorithm with multi-core processor implementation. The decomposition methods like linear search, range search, linear BV, and Range BV based approaches are incorporated in PC. Using these decomposition methods, the performance results are analyzed for latency and throughput for different processors like AMD and Intel concerning the number of rules. The work also analyzes the search and merges latency, cache performance, and threads per core. Linan et al. [11] describe the improved cuttingbased PC method with multidimensional features. The Improved version of HyperCut work offers a better tradeoff between throughput and memory while creating the decision tree. The HyperCut algorithms describe the filter set, cutting methods, functional evaluation, decision tree building, and searching with updating methods to classify the packets.
Wang et al. [12] present the TCAM based PC to improve the packet forwarding rate constraints. This work provides a memory-efficient architecture in TCAM based PC by compressing the memory space utilized for the same data in the different ruleset. The global and block mask registers used in TCAM improves the packet forwarding rate. Meshram et al. [13] explain the field split BV (FSBV) algorithm for PC with modular architecture. This work uses the BV approach to classify the packets to improve the latency and throughput by providing the proper ruleset. The memory requirement for this work is less compared to the existing FSBV approach. Khan et al. [14] explain the PC module with high throughput, which uses simple XNOR gates for classifying the packets based on the ruleset. The work offers less memory utilization and low latency for the same rules used in the stride BV based PC approach. Mohan et al. [15] present the different Multi-match approaches like hyper-cut (HC) based, multi-match based discriminator (MMD), and B2PC along with pipelined and distributed hash table (PDHT) based Packet classification approaches. The hyper cut offers high speed and cutting freedom from the ruleset. Whereas MMD, which is the same as the TCAM approach, offers more processing speed and adopts a parallel search approach. The B2PC provides significant determination of rule set membership. The PDHT provides efficient and high throughput classification by not using TCAM.
Zheng et al. [16] present a total prefix-length (TPL) Cluster-based PC algorithm approach that decomposes the rules into different clusters with the same prefix length and area-based Quadtree approach with the highest priority. The work offers better space utilization, speed in search, dynamic updation to improve PC performance Yingchareonthawornchai et al. [17] present a fast and dynamic PC with a sortedpartitioning method. The ruleset sort ability function explains PC and field order comparison's usefulness to overcome PC's challenges. The multi-dimension interval Tree approach offers search time, deletion, and insertion for the sortable ruleset. The online and offline sort-ability partitioning for ruleset offers priority optimization and successful searches. Huang et al. [18] explain the geometric space partition, and hash-

3857
(GSPH) based hybrid PC algorithm. This work provides better classification speed with the same accuracy by using a hash table with a parallel structure and matches large packet sets for classification. The ruleset creates the subsets and generates the hash, followed by a decision tree structure with a packet strategy to classify the packets.
The packet classification usage is increased because the multiple fields have to be matched against many rule sets. The multiple fields matching is always challenging by maintaining the same performance, larger throughput, and efficient memory usage. It has been noticed from the existing works in the above section. Most of the classification approaches consume more chip area and latency and obtains low throughput on FPGA and ASIC platforms. The proposed PC module also provides low latency, and high throughput pipelined architecture with less resource utilization on-chip, which helps perform better classification processes for incoming packets from networks. The next section explains the proposed MBV based PC with detailed hardware architecture.

PROPOSED PACKET CLASSILIFCATION MODULE
The packet classification is partitioning the packets into various flows the group of packets that matches a predefined rule, which comes under one flow. The packet classification aims to identify the highest priority rule, which matches the incoming packets and performs specific actions with the particular ruleset. The packet classifier module performs a match of the input packets field against the database's rules. The rule set or classifier is made up of a particular set of rules, and these rules require the header fields of the input packets as a search engine. The overview of the proposed modified bit vector (MBV) based packet classification module is represented in Figure 1

Packet generation module (PGM)
The packet generation module (PGM) receives the incoming packets from the network system or the PC user and processes them further. The PGM initiates the first packet as the packet (SOP) packet to continue the PC process. Each packet contains 1 byte or 8-bits of data information. The PGM validates each incoming packet and processes further only the valid packets. The PGM also check any error packets appearing or not; if error packets are received, it will discard immediately and won't carry further.

Header extractor module (HEM)
The HEM is designed based on the internet protocol version (IPV-4) specification. The received packets are used to frame as different header formats for the PC purpose. The packet header fields are used in the classification process to match the packets against the rule set or not. The HEM is mainly used to extract the header field information from the received packets. The HEM is classified as the internet protocol (IP) header, ethernet header, and transmission control protocol (TCP) headers. Only IP and TCP headers are considered for the MBV Based PC. The IP headers are used for source and destination address generation. The TCP headers are used for the generation of source and destination ports.

Modified bit vector packet classification (MBV-PC) module
The MBV algorithm places a significant role in the packet classification process and is mainly used to match the source and destination IP address along with fields. The rule set is first converted into bit vectors (BV), and then matching processing needs to be done using header fields to finds the matching rule. The BV generation is generated first, and then the classification process continued further to find the highest priority match against the ruleset. The Pipelined Architecture of MBV based packet classification module is represented in Figure 2. The PC module has IP header, TCP header fields, BV fields as inputs, and the highest priority match as classified output. The IP header field is decomposed into source address (SA) and destination address (DA) as header inputs for the MBV process. Similarly, The TCP header field is decomposed into source port (SP) and destination port (DP) as header inputs for the range search process. The MBV based SA and DA are processed in a pipelined manner. In this design, 4-stages of the MBV based SA and DA and range searchbased SP and DP processes are performed. The MBV process of single-stage for pipelined architecture is represented in Figure 3. 3859 ruleset, and 'k' is sub-field length (in terms of bits). The MBV algorithm working flow is represented as follows:  Each of the header field 'W' bit rules is divided into sub-fields, and its length is 'k' bits. So the sub-field is known to stage or stride.  The number of subfields' W/k' is considered, and each has a width of 'k' bits. Each 'k' bits of the rule will be matched with corresponding 'k' bits of header fields (SA/DA).  The same process is continued to map the sub-fields as 2 k of 'N bit vectors.  The corresponding rule is matched with the input header to get the match results. Each 'k' bits of the header field access the corresponding BV memory to generate the single 'N' bit vector. The same process is continued till to obtain the N bit vectors.  Perform the bitwise AND operation of this BV memory and BV input fields to result in the matching output of the whole Ruleset of the particular Field.  The Obtained BV output checks whether the corresponding rule set matches or not for the header field value.
The MBV algorithm process the independent matching for each sub-field with the corresponding header field. The 'W' bit rule is divided into the 'k' bits of a subfield. So the total number of sub-field is W/k. The W/k is equal to the number of searches for the whole PC, and it can be achieved by using W/k pipelined stages. In design, W=16, and k=4, so W/k=4 pipelined stages are considered.

Range search module
In the final stage, MBV matched outputs like SA, and DA outputs are input to the range search module as Bit vector inputs for the corresponding SP and DP, respectively. The packet classifier ruleset contains particular ranges for the corresponding fields, which request to match the ranges. So range search finds the ranges in the ruleset, which contains input header fields. The range search module is working the same as the MBV process in a pipeline manner. The pipeline stages are the same as W/k means 4. The range search module is performed in each stage for one sub-field. Range search module for each stage in pipelined architecture is represented in Figure 4. The range values of lower bound (LB) and upper bound (UB) are predefined against the header field. The header input field is considered as either SP or DP data. The Header field input is compared (≥) with LB values and generates the corresponding bit vector value, either '0' or '1'. Similarly, the same header field is compared (≤) with UB values and generates the corresponding bit vector value either '0' or '1'. The process continues until N-1 bit-vector values and generates the BV upper (BV_Hout) and lower bound (BV_Lout) outputs for one stage. The same range search process continued for different header fields and generated the corresponding BV lower and upper bound outputs. Finally, perform AND operation for all the stage BV lower and upper bound outputs to generate the final range search BV lower and upper bound outputs for SP and DP individually.

Network aggregator and priority encoder
The network aggregator performs the AND operation for outcomes of BV source address output (BV_SA_out), BV destination address output (BV_DA_out), BV source port lower and upper bound outputs (BV_SP_Lout and BV_SP_Hout), and BV destination port lower and upper bound outputs (BV_DP_Lout and BV_DP_Hout). The network aggregator output is input to the priority encoder. The priority encoder collects the BV final outputs and extracts one highest priority matching rule. The priority encoder extracts the BV output bitwise and generates the highest priority output among all the matching rules as a classified output.

RESULTS AND DISCUSSION
This section elaborates on the proposed packet classification (PC) results using the modified bit vector (MBV) approach and its comparative analysis with existing approaches. The high-performance PC using the MBV approach is designed and synthesized on Artix-7 FPGA. The resource utilized on Artix-7 (XC7A100T) FPGA for PC using the MBV approach tabulated in Table 1 [25]. The latency and throughput (Mpps) of the MBV based PC are improved around 20% and 21.6%, respectively, than the BV based PC [25]. The proposed MBV based PC utilizes fewer slices, operates at a suitable frequency, and works with better speed than the existing PC approaches. The efficiency parameters comparisons of the proposed MBV_PC module with existing PC modules are tabulated in Table 3. The TCAM based PC [23] is implemented on Kintex-7 FPGA, which works with a throughput of 3.4 Gbps and doesn't support a range to prefix feature, and considers 23.5 bytes/rule in memory. The fast PC [24] is implemented on Virtex-7 FPGA, gives a throughput of 51.76 Gbps and doesn't support a range to prefix feature, and considers 38.9 bytes/rule in memory. The stride BV based PC [26] is implemented on Virtex-7 FPGA, which works at 111 Gbps with a latency of 31 clock cycles. The stride BV considers 52 bytes/rule in memory, a rule set dependencies, and range-to prefix features are not supported. The emulated TCAM [27] is implemented on Stratix -IV, which utilizes only 1 clock cycle latency with a throughput of 64 Gbps. The emulated TCAM [27] supports a range of prefix features with the fewer rule set dependencies and considers 24 bytes/rule in memory. The BV TCAM [28] is implemented on Virtex-2 FPGA, which utilizes 11 clock cycle latency with a throughput of 75 Gbps. The BV TCAM [28] does not support a range to prefix feature with more rule set dependencies and considers 154 bytes/rule in memory. The DCFL [29] works at 19 Gbps throughput with a latency of 5 clock cycles. The DCFL [29] does not support the range to prefix and considers 90 bytes/rule in memory. The proposed MBV_PC module works at 74.95 Gbps with 4 clock cycles (Latency) and considers 16 bytes/rule in memory. The MBV_PC module does not range to prefix and rule set dependency features.

CONCLUSION
In this manuscript, an efficient MBV based Packet classification module is implemented on FPGA, which offers high speed, low latency, and better memory utilization. The PC mainly contains a packet generation module, header extractor module, MBV based source, destination address modules, range search based source, destination port unit, and priority encoder to classify the highest priority match output. The MBV based PC utilized 2% slices, 3% LUT's, works at 493.1 MHz and consumes 0.1 total power on Artix-7 FPGA. The MBV based PC uses only 4 clock cycles with a throughput of 74.95 and utilizes 16 bytes/rule in memory. The proposed MBV based PC utilizes less overhead in terms of 14.46% for slices, 2.91% for total power, 20% for latency, and 17.35% for throughput than our previous BV based PC. The proposed PC is also compared with different PC engines with better improvement in different design constraints.