Area efficient parallel LFSR for cyclic redundancy check

Received Jun 12, 2019 Revised Oct 23, 2019 Accepted Oct 30, 2019 Cyclic Redundancy Check (CRC), code for error detection finds many applications in the field of digital communication, data storage, control system and data compression. CRC encoding operation is carried out by using a Linear Feedback Shift Register (LFSR). Serial implementation of CRC requires more clock cycles which is equal to data message length plus generator polynomial degree but in parallel implementation of CRC one clock cycle is required if a whole data message is applied at a time. In previous work related to parallel LFSR, hardware complexity of the architecture reduced using a technique named state space transformation. This paper presents detailed explaination of search algorithm implementation and technique to find number of XOR gates required for different CRC algorithms. This paper presents a searching algorithm and new technique to find the number of XOR gates required for different CRC algorithms. The comparison between proposed and previous architectures shows that the number of XOR gates are reduced for CRC algorithms which improve the hardware efficiency. Searching algorithm and all the matrix computations have been performed using MATLAB simulations.


INTRODUCTION
CRC (Cyclic Redundancy Check) is most popular error detection algorithm which is used in digital communication, data storage and data compression. Various digital communication standards which use CRC code are Asynchronous Transfer mode (ATM), WiMAX (IEEE 802. 16) and Wi-Fi (IEEE 802.11). CRC algorithm is based on cyclic code which follow two properties that is linearity and cyclic. LFSR is used to perform CRC operation [1]. LFSR is an important circuit in field of communication which is used in encoders, decoders, cryptography, CDMA and test pattern generator. Two implementation styles of LFSR are Fibonacci and Galois [2]. Power dissipation is challenging problem in VLSI field. In order to reduce the power consumption of LFSR, various algorithms have been proposed. In [3], Concept of reducing transitions in generated test patterns has been used for LFSR. Gated clock approach was used to reduce the power consumption. [4]. The switching and testing time in LFSR was reduced in [5] for power consumption reduction. There are two Conventional serial LFSR uses Galois implementation for nth order generator polynomial as shown in Figure 1.
There are different types of CRC algorithms available. Each CRC algorithm has predefined generator polynomial which is used to generate CRC code. Figure 1 is drawn for generalized case of generator polynomial given by the equation G(x)=1+∑gⱼ.sⁱ +sⁿ where i, j varies from 1 to n-1 and n is generator polynomial degree. Generator polynomial is different for different types of CRC algorithms. Coefficients of generator polynomial are g₀, g₁, g₂ …. gₙ₋₁ in which g₀ and gₙ are always equal to one. The authors in [6] gave theory behind CRC and designed different hardware and software methods for CRC implementation. Parallel CRC implementations based upon mathematical deductions have been proposed in [7]. In [8][9][10][11][12][13], state space representation of serial and parallel LFSR is given. These all architectures process the data message of length 'm' which is divided into number of blocks of length 'v' each where 'v' is same as the generator polynomial degree (n). State-space similarity transformation and transformation matrix have been proposed which is dependent on some vector and exhaustive search was applied for finding the vector [10].
In [12], parallel architecture for LFSR has been proposed on the basis of transposed serial LFSR which reduces hardware complexity as well as critical path delay (CPD). The authors in [13] gave best way to find transformation matrix and improved searching algorithm is given to obtain best vector that reduces hardware complexity. Parhi [14] and Zhang and Parhi [15] has been proposed BCH encoders high speed architecture. For elimination of fan-out bottleneck new approaches have been proposed in [14,16]. To reduce the effect of fan-out authors in [17] has been proposed a two-step method. Pipelining methods have been proposed for high speed CRC hardware implementation in [18]. In [19,20] approaches have been presented that are based on software based CRC computation.
Matrix based formulation have been performed for computation of CRC for the specific case when generator polynomial degree is less than the degree of parallelism [16]. For transformation from serial to parallel new methods have been described [21] and firstly applied by Patel [22] for CRC calculation. For parallelization of computation Look-ahead technique used in [23]. Parallel implementation of CRC achieved by using cascading approach [24]. Performance of CRC polynomials generated with Genetic algorithms has been evaluated in [25]. Large fan-out effect has been eliminated using new technique based on Chinese Remainder Theorem for long BCH codes [26]. In [13], improved search algorithm has been proposed which finds the best vector that results in less hardware. This paper presents the implementation and results of this search algorithm using MATLAB. Functions such as number of ones calculation, generation of kids from parent node, calculation of total number of ones (TN) for each child node in level are used to implement the search algorithm.
The main requirement is to reduce the harware complexity of the design. To solve this problem, improved technique is given in this paper to find number of XOR gates for different CRC algorithms. This method is based on sharing technique in multiple rows which results in hardware efficient architecture as compared to previous one. The remaining paper is organized as follows: In section 2, representation of serial and parallel linear feedback shift register is done using state space method, summarizes the technique and constraint to obtain transformation (T) matrix which is presented in [13], gives detail regarding implementation of search algorithm and explaination of new technique to find number of XOR gates. Section 3 gives analysis of comparison between proposed architecture and previous reported parallel architectures and conclusion is given in section 4.

RESEARCH DESIGN 2.1. State-space representation of LFSR
State-space equation of conventional serial LFSR implementation is given as under where in equation (1), Xₙ ₓ ₗ (t)is (n x 1) order vector and u(t) represent single bit input at time 't' and matrix A is in terms of generator polynomial coefficients which is same as the transition matrix in Galois implementation of LFSR. A matrix is shown below.
(2) can be given as : where Bv matrix has order n x v. Linear transformation is applied on Xₙ ₓ ₗ (t) to reduce the complexity. Transformation matrix used must be a nonsingular matrix. Xₙₓₗ(t) = Tₙₓₙ x X⸆ₙₓₗ(t). Transformed state space equation for (3) can be written as: where A v T = T⁻¹ .Aⱽ. T and B v T=T⁻¹. B v Table 1 given below represents generator polynomial for different CRC algorithms. Figure 2 shows block diagram of parallel LFSR based on state space transformation described by (4).

Method to find transformation matrix
There are many methods to find best transformation matrix. In [10], transformation matrix was selected in such a way that would simplify the feedback path complexity but outside the feedback loop, circuit complexity increased. This complexity resulted in more hardware cost. In [13], new method was presented to find improved transformation matrix and following constraints were followed to find better transformation matrix.
-The transformation matrix T must be non-singular matrix that could satisfy state space transformation.
-Less number of ones in coupling matrices AvT, BvT and T matrix. Because less number of ones directly decide hardware complexity of the circuit in terms of XOR gates. -Efficient method to search transformation matrix.
For transformation matrix, vector V of length n is used. Vector value cannot be equal to 0. Vector MSB must be equal to 1 and other elements of this vector can be 0 or 1. Vector V is defined as V = [ 1, v₁, v₂, v₃, ............ vₙ₋ₗ]. T matrix constructed from vector V is given below.

Search algorithm implementation
This algorithm search for the best vector which gives smallest TN. Total levels in the searching tree is equal to number of total bits or equal to generator polynomial degree. The root node in the level 1 is given value: [1 0 0…………...0] and child nodes in level n are obtained by replacing 0 with 1 in parent node of level (n-1) which means number of child nodes for specific parent node is equal to number of zeros in that parent node. Each node in Level n contains n number of ones.
C program is designed which computes the total number of nodes at each level, since the values increases a lot as computation for higher bits is started. So, it is very important to know the total number of nodes at each level otherwise CPU may take lots of time to complete the TN calculation and memory requirement also increases. So, first step is to compute the total number of nodes at each level and next step is to find the value of nodes at each level. To ease the output readability, values are shown in decimal format and converted to binary internally for computation of TN. Since output window has limited size in C application, so the output is shown for n=3 in Figure 3 and for n=5 in Figure 4. C application was first implemented to find total number of nodes and value of nodes at each level and then search algorithm is implemented in MATLAB, where saving and handling matrix data is easier as compared to C language. Search algorithm steps are described by using flowchart given in Figure 5. MATLAB code structure for search algorithm is shown in Figure 6. Description of each function and main code is given below. This function converts the kid array into tn array using tncalc function. It is required to compute the TN value of each child node which helps in computing the optimal node at each level. The process of computing optimal node includes calculation of lowest TN value at each level. These optimal values are saved in matrix and at the end of code, the lowest value from all optimal values is sorted. Final optimal value location is identified in code which helps in finding the corresponding level and the node.
After calculating best vector which has smallest TN the next way to improve hardware efficiency is to find number of XOR gates in the matrices of different CRC algorithm. If number of ones in any row is equal to n and n is greater than 1, number of XOR gates needed to perform modulo-2 addition is (n-1) [10]. If single one is present in any row, then there is no requirement of XOR. XOR gates are calculated based on the subexpression sharing technique. In [13], sharing technique is applied between two rows but in the proposed method, sharing technique is applied in multiple rows.
Technique to find number of XOR gates is explained below. Matrices AvT, BvT and T for all CRC algorithms are computed using MATLAB and then observe number of ones present in each row and column. If number of ones are present at same columns in multiple rows, then number of XOR required for one row are shared by other rows. By using this multiple sharing technique, less number of XOR gates are required which further reduces hardware complexity over previous architecture [13]. Table 2 shows vector V used, total number of ones and XOR gates required for T, AvT and BvT matrices for all CRC algorithms. The comparison is given for previous model [13] and proposed method. Number of XOR gates reduced from 53 to 35, 97 to 74, 82 to 77, 90 to 82, 461 to 255 for CRC-12, SDLC, CRC-16 Reverse, SDLC Reverse and CRC-16 respectively and percentage reduction in XOR is 33.96%, 23.71%, 6.09%, 8.88% and 44.68% for above types of CRC. Percentage reduction in XOR is shown in Table 2 rightmost column. It is observed that in proposed method number of XOR gate reduced for every type of CRC instead of CRC-16. It should be noted that percentage reduction for CRC-32, 44.68% is very high means very large reduction in hardware complexity is achieved. Finally, comparison between previous architectures done on the basis of AT for frequently referenced generator polynomial given in Table 1. Area Time Product (AT) depends on delay element (DE), total no. of XOR and CPD. AT is calculated as: AT=(1.5 . DE + XOR). CPD Table 3 gives the comparison between various architectures and proposed one. For the proposed method, number of XOR gates are reduced for every CRC generator polynomial which means AT value is reduced for every CRC. Relative value of AT is given in the rightmost column and the value is normalized with the proposed one. It is observed that relative value is more than one for all previous architectures except for CRC-16. All computations are based on the assumption that the level of parallelism 'v' is same as the generator polynomial degree.

CONCLUSION
This paper presented a new method to find number of XOR gates in transformation and coupling matrices for generator polynomial of different types of CRC. Implementation details of search algorithm are given. Based upon this new improved method of finding XOR, the proposed method achieve smaller AT as compared to previous architectures. Reduced AT value means parallel architecture of less hardware complexity is obtained.