Design and implementation of dual-core MIPS processor for LU decomposition based on FPGA

Rusul Saad Khalil, Safaa S. Omran
Department of Computer Engineering Techniques, Electrical and Electronic Technical College, Middle Technical University, Iraq

ABSTRACT

Many systems like the control systems and in communication systems, there is usually a demand for matrix inversion solution. This solution requires many operations, which makes it not possible or very hard to meet the needs for real-time constraints. Methods were exists to solve this kind of problems, one of these methods by using the LU decomposition of matrix which is a good alternative to matrix inversion. The LU matrices are two matrices, the L matrix, which is a lower triangular matrix, and the U matrix, which is an upper triangular matrix. In this paper, a design of dual-core processor is used as the hardware of the work and certain software was written to enable the two cores of the dual-core processor to work simultaneously in computing the value of the L matrix and U matrix. The result of this work are compared with other works that using single-core processor, and the results found that the time required in the cores of the dual-core is more less than using single-core. The designed dual-core processor is invoked using the VHDL language.

Keywords:
Dual core
Field programmable gate array
LU decomposition
MIPS processor
Single core
VHDL

1. INTRODUCTION

Many different systems require solving of matrix inversion, these systems like control or communication systems. The required time for solving the matrix inversion increases on the size of the matrix is become bigger. Hence, an alternative method were required in order to work in real-time, one of these methods is the LU decomposition [1].

In LU decomposition method the coefficient matrix \([A]\) of the given system of equation \([A][X] = [B]\) is written as a product of a Lower triangular matrix (L) and an upper triangular matrix (U), such that \([A] = [L][U]\) where the elements of \(L = (l_{ij} = 0 \text{ for } i < j)\) and the elements of \(U = (u_{ij} = 0 \text{ for } i > j)\) that is, the matrices \([L]\) and \([U]\) look like [2, 3]. Following are set of equations for a 4x4 matrix.

\[
[A] = [L][U]
\]  
(1)

\[
\begin{bmatrix}
A_{11} & A_{12} & A_{13} & A_{14} \\
A_{21} & A_{22} & A_{23} & A_{24} \\
A_{31} & A_{32} & A_{33} & A_{34} \\
A_{41} & A_{42} & A_{43} & A_{44}
\end{bmatrix} =
\begin{bmatrix}
1 & 0 & 0 & 0 \\
l_{21} & 1 & 0 & 0 \\
l_{31} & l_{32} & 1 & 0 \\
l_{41} & l_{42} & l_{43} & 1
\end{bmatrix}
\begin{bmatrix}
u_{11} & u_{12} & u_{13} & u_{14} \\
u_{22} & u_{23} & u_{24} \\
u_{33} & u_{34} \\
u_{44}
\end{bmatrix}
\]  
(2)
where,

\[ u_{11} = A_{11}, u_{12} = A_{12}, u_{13} = A_{13} \text{ and } u_{14} = A_{14} \] (3)

\[ l_{11} = \frac{A_{11}}{u_{11}} \text{ where } l_{21} = \frac{A_{21}}{u_{11}}, l_{31} = \frac{A_{31}}{u_{11}} \text{ and } l_{41} = \frac{A_{41}}{u_{11}} \] (4)

\[ u_{22} = A_{22} - l_{21} \times u_{12}, u_{23} = A_{23} - l_{21} \times u_{13}, \text{ and } u_{24} = A_{24} - l_{21} \times u_{14} \] (5)

While

\[ A_{32} = u_{12}x_{13} + u_{22}x_{32} \] (6)

\[ A_{42} = u_{12}x_{14} + u_{22}x_{42} \] (7)

\[ A_{33} = l_{31}u_{13} + l_{21}u_{14} + u_{33} \] (8)

\[ A_{34} = l_{31}u_{14} + l_{21}u_{24} + u_{34} \] (9)

\[ A_{43} = l_{41}u_{13} + l_{21}u_{23} + l_{31}u_{33} \] (10)

\[ A_{44} = l_{41}u_{14} + l_{21}u_{24} + l_{31}u_{34} + u_{44} \] (11)

If one has a system of equations in the form of \([A][X] = [B]\), then the method of using the LU decomposition will make the solution easier by using the triangular matrices. After computing the LU matrices as shown in the next equations [4-7]:

\[ [A][X] = [B] \leftrightarrow [L][U][X] = [B] \] (12)

\[ [U][X] = [Y] \] (13)

\[ [L][Y] = [B] \] (14)

The objective of this paper is to program and build a 32-bit MIPS processor to perform the LU decomposition. Then designing and implementing a dual core MIPS processor, the results will be compared for the two designs system, each system been designed and implemented in VHDL [8-10].

2. MIPS PROCESSOR

It is a reduced instruction set computer (RISC) processor developed by MIPS technologies in the early 1980s which can fully implement instructions in single clock cycle. Therefore the slowest instructions can limit session time. In this paper a single core and dual core MIPS processors will be designed and implemented to perform mathematical requirements for the application of LU decomposition [8].

2.1. MIPS instruction set architecture (ISA)

32-bits MIPS Architecture been covered in this paper where transactions are either register or memory locations as shown in Table 1, Processor, to get to the word uses byte addressable [9, 11, 12].

2.2. Instruction formats

The MIPS has three different formats, which they are the R-type, I-type and J-type. Table 2 shows the different instructions formats for the MIPS processor [13-16].

<table>
<thead>
<tr>
<th>Name</th>
<th>Register number</th>
<th>Usage</th>
<th>Preserved on call?</th>
</tr>
</thead>
<tbody>
<tr>
<td>$zero</td>
<td>0</td>
<td>The constant value 0</td>
<td>n.a.</td>
</tr>
<tr>
<td>$v0-$v1</td>
<td>2-3</td>
<td>Values for results and expression evaluation</td>
<td>no</td>
</tr>
<tr>
<td>$a0-$a3</td>
<td>4-7</td>
<td>Arguments</td>
<td>no</td>
</tr>
<tr>
<td>$t0-$t7</td>
<td>8-15</td>
<td>Temporaries</td>
<td>no</td>
</tr>
<tr>
<td>$s0-$s7</td>
<td>16-23</td>
<td>Saved</td>
<td>yes</td>
</tr>
<tr>
<td>$t8-$t9</td>
<td>24-25</td>
<td>More Temporaries</td>
<td>no</td>
</tr>
<tr>
<td>$gp</td>
<td>28</td>
<td>Global pointer</td>
<td>yes</td>
</tr>
<tr>
<td>$sp</td>
<td>29</td>
<td>Stack pointer</td>
<td>yes</td>
</tr>
<tr>
<td>$fp</td>
<td>30</td>
<td>Frame pointer</td>
<td>yes</td>
</tr>
<tr>
<td>$ra</td>
<td>31</td>
<td>Return address</td>
<td>yes</td>
</tr>
</tbody>
</table>

Table 1. Processor registers

Design and implementation of dual-core MIPS processor for LU decomposition ... (Rusul Saad)
Table 2. Formats of processor instructions

<table>
<thead>
<tr>
<th>Field size</th>
<th>6-Bits</th>
<th>5-Bits</th>
<th>5-Bits</th>
<th>5-Bits</th>
<th>5-Bits</th>
<th>6-Bits</th>
</tr>
</thead>
<tbody>
<tr>
<td>Register</td>
<td>op code</td>
<td>Source register</td>
<td>Target register</td>
<td>Source register</td>
<td>Target register</td>
<td>Register destination</td>
</tr>
<tr>
<td>Immediate</td>
<td>operation code</td>
<td>ordination code</td>
<td>Target register</td>
<td>function</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Jump</td>
<td>26-bits address</td>
<td>16-bits Imm</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

2.3. Single-core MIPS processor design

The MIPS processor is 32-bits processor which has 32 different registers each with size of 32-bits [17-23]. The main part in the MIPS processor is the control unit (CU). This unit consists of some registers and the arithmetic logic unit (ALU). Certain instructions where required for calculating the LU decomposition were designed and implemented [24-26]. Table 3 shows these different instructions. The design instructions set of the processor is suitable to perform LUD as shown in Table 4. Figure 1 shows the internal architecture of the control unit and Figure 2 shows the schematic design circuits that required in implementing the LU decomposition for single-core processor.

Table 3. Alu control for both processors

<table>
<thead>
<tr>
<th>Code</th>
<th>Operation</th>
</tr>
</thead>
<tbody>
<tr>
<td>000</td>
<td>Mul</td>
</tr>
<tr>
<td>001</td>
<td>Div</td>
</tr>
<tr>
<td>010</td>
<td>Add</td>
</tr>
<tr>
<td>011</td>
<td>not used</td>
</tr>
<tr>
<td>100</td>
<td>not used</td>
</tr>
<tr>
<td>101</td>
<td>not used</td>
</tr>
<tr>
<td>110</td>
<td>Sub</td>
</tr>
</tbody>
</table>

Table 4. Instruction set

<table>
<thead>
<tr>
<th>Instructions</th>
<th>SW</th>
<th>LW</th>
<th>ADD</th>
<th>ADDs</th>
<th>SUB</th>
<th>MUL</th>
<th>DIV</th>
</tr>
</thead>
<tbody>
<tr>
<td>Opcode</td>
<td>101011</td>
<td>100011</td>
<td>000000</td>
<td>001000</td>
<td>000000</td>
<td>000000</td>
<td>000000</td>
</tr>
<tr>
<td>Regwrite</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>Regdst</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>ALUSRC</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>ZERO</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>MEMWrite</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>MEMtoRegister</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>ALUopcode</td>
<td>0</td>
<td>0</td>
<td>10</td>
<td>00</td>
<td>10</td>
<td>10</td>
<td>10</td>
</tr>
<tr>
<td>Function</td>
<td>x</td>
<td>x</td>
<td>100000</td>
<td>100010</td>
<td>100100</td>
<td>100101</td>
<td></td>
</tr>
<tr>
<td>type</td>
<td>Immediate</td>
<td>Immediate</td>
<td>Register</td>
<td>Immediate</td>
<td>Register</td>
<td>Register</td>
<td></td>
</tr>
<tr>
<td>Description</td>
<td>Save word</td>
<td>Load word</td>
<td>Addition operation</td>
<td>Addition immediate operation</td>
<td>Subtraction operation</td>
<td>Multiplication operation</td>
<td></td>
</tr>
</tbody>
</table>

Figure 1. RTL for control unit internal architecture

Figure 2. RTL for single core MIPS processor
2.4. Dual-core MIPS processor design

Dual-core consists of two cores and each one is responsible for specific function, both cores shared same data memory. Each core has their own instruction memory, register file and control unit, first core will be used to perform the lower (L) matrix while the second core will perform the upper (U) matrix depending on LU decomposition (factorization) [13, 27]. Figure 3 shows the designed Dual-core MIPS processor, the Lower core is used to compute the (L) matrix while the Upper core is used to compute the (U) matrix. So that, both cores were working simultaneously to compute LU matrices in less time than single-core, which gives a high level of parallelism.

![Figure 3. RTL for dual core MIPS processor](image)

3. DATA REPRESENTATION

The fixed-point data representation is chosen in this paper, which is easier in the design consideration. Other method in data representation is floating which is excluded in this work because it requires a very large hardware component [28-30]. Figure 4 shows the format for 32-bits of data.

![Figure 4. Format for the used data](image)

4. SIMULATION RESULT OF SINGLE-CORE

Single-core processor is implemented using FPGA development board Spartan-6 the simulation results which have been gotten from the Xilinx ISim simulator. Executing a set of instructions to compute LUD, both matrix and LUD is shown in (15) for a 4x4 matrix which also can lead into a 6x6 matrix, the time required to perform LU decomposition is 3070 ns (3.07 µs) at frequency 50 MHz. The results are found identical to the theoretical results when applied for the 4x4 matrix. Figure 5 and Figure 6 show the test-bench of waveform simulation for matrix A and it's LUD and Figure 7 shows the resources needed for the executed design.

\[
\begin{align*}
A &= \begin{bmatrix} 2 & 3 & 1 & 5 \\ 6 & 13 & 5 & 19 \\ 2 & 19 & 10 & 23 \\ 4 & 10 & 11 & 31 \end{bmatrix} = L \begin{bmatrix} 1 & 0 & 0 & 0 \\ 3 & 1 & 0 & 0 \\ 1 & 4 & 1 & 0 \\ 2 & 1 & 7 & 1 \end{bmatrix} U \begin{bmatrix} 2 & 3 & 1 & 5 \\ 0 & 4 & 2 & 4 \\ 0 & 0 & 1 & 2 \\ 0 & 0 & 0 & 3 \end{bmatrix} \\
&= \begin{bmatrix} 2 & 3 & 1 & 5 \\ 0 & 4 & 2 & 4 \\ 0 & 0 & 1 & 2 \\ 0 & 0 & 0 & 3 \end{bmatrix}
\end{align*}
\]
5. SIMULATION RESULT OF DUAL-CORE

The proposed design of dual core processor has been coded by using VHDL, XILINX Spartan 6 with sets of instructions that compute LU decomposition, a testbench was created to implement same 4x4
matrix as shown in Figure 8, Figure 9 and Figure 10 with resource required as shown in Figure 11. The time required to perform L decomposition in dual core processor is 850 ns (0.85 µs) at frequency 50 MHz with number of instruction 41. As shown in Table 5, and the time required to perform U decomposition is 1170 ns (1.17 µs) for the same frequency with 57 instructions that has been used as shown in Figure 12.

Table 5. Single core and dual core comparisons

<table>
<thead>
<tr>
<th>Processor</th>
<th>Time (ns)</th>
<th>Instructions used</th>
<th>Clock period (ns)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Single Core</td>
<td>3070</td>
<td>142</td>
<td>20</td>
</tr>
<tr>
<td>Dual Core (First core)</td>
<td>850</td>
<td>41</td>
<td>20</td>
</tr>
<tr>
<td>Dual Core (Second core)</td>
<td>1170</td>
<td>57</td>
<td>20</td>
</tr>
</tbody>
</table>

Figure 8. Dual core processor test bench of register file 1

Figure 9. Dual core processor test bench of register file 2
6. CONCLUSION

A single core and dual core were designed to perform LU 4x4 matrix calculation for the purpose of teaching studies of the MIPS architecture course for master student. Designing and implementing single core and dual core processors with the required instructions for each processor sufficient to implement the LU decomposition using decomposition process. The time of single core processor to perform the LU 4x4 matrices was 3.07 µs at frequency 50 MHz while designing dual core processor where the first core of the processor used to compute the L matrix and the second core of the processor used to compute U matrix. This design can achieve high performance with timing of 1.17 µs. The most consuming processor is the Dual core processor. However, it gives higher performance.
REFERENCES


---

Design and implementation of dual-core MIPS processor for LU decomposition ... (Rusul Saad)
BIOGRAPHIES OF AUTHORS

**Rusul Saad Khalil** was born in Baghdad, Iraq in 1992. She graduated from AlMamon University college in 2014, and now studying master at Electrical Engineering College/Middle Technical University, Baghdad, Iraq, her main interest in Computer Architecture Design, Computer engineering, embedded system and design.

**Safaa S. Omran** was born in Baghdad, Iraq in 1956. He graduated from University of Baghdad in 1978, and then he got the MSc from the same University in 1984. He is now a professor working at the Electrical Engineering College/Middle Technical University, Baghdad, Iraq. His main interest working researches are in the field of microprocessor design for embedded systems, Image processing and cryptography system design.