Hybrid branch prediction for pipelined MIPS processor

Received May 10, 2019 Revised Nov 15, 2019 Accepted Jan 11, 2020 In the modern microprocessors that designed with pipeline stages, the performance of these types of processors will be affected when executing branch instructions, because in this case there will be stalls in the pipeline. In turn this causes in reducing the Cycle Per Instruction (CPI) of the processor. In the case of executing a branch instruction, the processor needs an extra clocks to know if that branch will happen (Taken) or not (Not Taken) and also it requires calculating the new address in the case of the branch is Taken. The prediction that the branch is T / NT is an important stage in enhancing the processor performance. In this research more than one method of branch prediction (hybrid) is used and the designed circuit will choose different types of prediction algoritms depending on the type of the branch. Some of these methods were used are static while the other are dynamic. All circuits were built practically and examined by applying different programs on the designed predictor algorithm to compute the performance of the processor.


INTRODUCTION
Branch predictors now considered one of the basic units in the modern microprocessors that use pipeline stages in their design. This unit (BP) makes a prediction for the branch instructions that if the branch will be Taken or Not Taken. Previously when processors were designed without branch prediction unit, the processor requires more clock cycles by making a delay in the pipeline stages in each coming of branch instruction in order to know if that branch is Taken or Not Taken and also to calculate the target address in the case of Taken [1,2].
In general 20% out of the instructions in a program is branch instructions; this means that is in every 5 instructions there is one branch instruction [3]. Hence, predicting the behaviour of the branch (which is Taken or Not Taken) is very important and affects the performance of the processor. The penalty associated with mispredicted branches in modern pipelined processors has a great effect on performance. The performance penalty is increased as the pipelines deepen and the number of mispredicted instructions increases. For example, the AMD Athlon processor has 10 stages in the integer pipeline [4], while the Intel NetBurst microarchitecture used in the Pentium 4 processor is hyper-pipelined with a 20-stage branch prediction penalty [5]. The rest of this work will be as follows; in the next section a theoretical review for some types of static and dynamic branch prediction methods. Then a section will presents the designed branch prediction circuit and the following sections will presents results and conclusions respectively.

THEORETICAL BACKROUN
There are two kinds of branch prediction, one called static while the other is dynamic, and here a brief explanation for some types of branch prediction methods.

Static branch prediction
If the technique of the used branch prediction circuit gives the same prediction for all types of branches is known as static branch prediction [6,7]. While if the prediction changes with the running time, this is called a dynamic branch prediction. For example the processor i486 used static branch prediction algorithm, in which at each coming branch the prediction is always Not Taken [8]. But most of the branches are taken especially the branches of kind Loop, where the branch is taken for all the number of the loop except the last one the branch is Not Taken. As an example for a loop of 100 cycles, 99 of them are taken and only the last one is not taken. Hence, another technique of static branch prediction is used in Pentium 4 processor which is Backward Taken/Forward Not Taken (BTFNT). This is done by counting the value of the new address if it is less than the current address then the prediction is Backward Taken and if the address is greater than the current address then the prediction is Forward Not Taken [9,10]. The advantage of using static branch prediction algorithms is that they are very easy to implement and needs simple hardware circuit to be added to the designed processor.

Dynamic branch prediction
This type of branch prediction technique will take the advantage of the available information through the run time of branch behavior. The main idea in dynamic branch prediction is to take into account the state of the branch as the time is run which gives better prediction from the static branch prediction [11,12]. One of the earliest methods used as a dynamic branch prediction is the algorithm presented by [13] (also known as Bimodal Predictor), which is shown in Figure 1. In this algorithm the prediction consists from table recording each previous prediction where it was taken or not taken. It is shown in Figure 1  The size of the used counter in this technique consists of 2-bits, which is best from using 1-bit. The MSB of the counter indicates the prediction state while the LSB of the counter indicates the past branch state. In the case of increasing the number of counter bits to 3, the improvement in the predictor algorithm is very small, so that it is always prefers to use 2-bits counter with less hardware from using a larger counter [1,14].
Some of the researchers used the two-level predictor, which uses a history for the most branch outcomes. These outcomes are stored in a Branch History Register (BHR) which is a shift register where the outcome for each branch is shifted inside the shift register and the oldest outcome is shifted out and discarded [13,15,16]. Figure 2 shows the structure for the two-level branch prediction algorithms with global history.  table which consists from the saturated counters. This table called  the Pattern History Table (PHT). The prediction in this algorithm using the outcome of a 4 most recent branches with 2-bits from the branch address to compose a 6-bit which is the index to one of the 64 counters from the PHT. Also there exists another type of the two-level predictor, which is known as the Local History two-level predictor [17][18][19] shown in Figure 3.  Table (BHT). The branch address is used to select one of the entries of the BHT and according to the selected number of these entries as shown in Figure 3. This address will select one of the existing entries (BHR) in the BHT, in which it will give the local history. The contents of the chosen BHR will be combined with the PC to index to one of the counters in the PHT.

Hybrid Branch Prediction
Because there are different types of branches exists in the programs, may be these types are correlated with different types of history. Hence some of the branches may be is better to use the global history algorithm while other branches are better to use with it the local history or any other algorithm correlated with local history algorithm. This difference in the type of prediction algorithms leads some of the researchers to use a Hybrid Branch Prediction (HBP) [20][21][22][23][24][25]. One of the earliest researchers who used the HBP is [17] who suggests what is called the Tournament Predictor as shown in Figure 4. It is clear from this figure that the Meta predictor (M) consists from a table of 2-bit counters which indexed to it by using the two lower order bits from the branch address. According to the content of these counters the multiplexer will select the predictor P 0 in the case of the MSB=0, and choosing the predictor P 1 in the case of MSB=1. The Meta predictor works to predict for which algorithm prediction method P 0 or P 1 is correct.
When the branch outcome is available, the predictors P 0 and P 1 are updated according to their respective update rules. While the Meta predictor is updated according to different rules. The 2-bit counters will be used in the predictors are finite state machines (FSMs), where the inputs are typically the branch outcome and the previous state of the FSM. For the Meta predictor, the inputs are C 0 , C 1 and the previous FSM state, where C i is one if P i predicted correctly. Table 1 lists the state transitions.
When P 1 's prediction was correct and P 0 miss predicted, the corresponding counter in M is incremented, saturating at a maximum value of 3. While, when P 1 miss predicts and P 0 predicts correctly, the counter is decremented, saturating at zero. If both predictors are correct, or both miss predict, the counter in M is unmodified. The prediction lookups on P 0 and P 1 with the state for M are all performed in parallel. When the prediction operations for the three predictors are done, the Meta predictor is used to choose one of the multiplexer lines P 0 or P 1 . The processor Compaq Alpha 21264 [26,27] used the HBP algorithm as shown in Figure 5.    Figure 5. Tournament hybrid for compaq alpha 21264

SYSTEM DESIGN AND IMPLEMENTATION
In this research the hybrid prediction method is used in the design of the used processor. The processor is a MIPS (Microprocessor without Interlocked Pipelined Stages) pipelined processor with five pipeline stages. The design of this type of processor is a part from the work in the subject advanced computer technology for the MSc course study. This designed algorithm was synthesized using the Xilinx ISE (Integrated Software Environment) design suite 14.7, and using the Vertex-4 Kit with operating frequency of 50MHz. The branch prediction algorithm is designed for the MIPS Processor to be as follows: -In the case of Unconditional branch and Call/Return, a static algorithm of Always Taken is used. This is because of its simple design and also for its less miss prediction penalty. -In case of the branch is Conditional, a dynamic algorithm which is the Two-Level algorithm is used as shown in Figure 2. -Finally in the case of branch of type Loop, the predictor will be dynamic branch predictor of type Bimodal as shown in Figure 1. -For the two types (two level and bimodal) of dynamic branch prediction a 1024 2-bits counters were used, which is in this case approximately the effect of aliasing not exists. At start all the 2-bit counters will be saturated (its value is 11). For the two-level predictor a 32 Branch History Register is used in the Branch History Table. -A selector is used to select the type of the prediction algorithm from the three designed algorithms P 0 , P 1 , and P2 according to Table 2. The Prediction Type Circuit (PTC) shown in Figure 6 is used to decide the type of the prediction algorithm according to the executing branch instruction. Figure 6 shows the designed hybrid algorithm. As shown from Figure 6, the input to the Prediction Type Circuit is bits 0:5 and bits 26:31 from the branch address, this group of bits known as function and Op Code respectively. The Prediction Type Circuit examines these two sets of bits and decides the type of branch instruction, and hence the output depends on the type of the branch, then the selector selects one of the predictors according to Table 2.

RESULTS
In order to test the designed hybrid branch predictor algorithm and to compare it with different branch prediction algorithms, three different programs were written and executed using the MIPS pipelined processor with the following cases: -Without using any BP algorithm.  Table 3 shows the different recorded results. It is clear from Table 3 that the hybrid Branch Prediction algorithm gives best results and this is because of using more than one algorithm, where each algorithm is suitable for certain types of branch instructions. Also it is clear that using BP algorithm of any type (static or dynamic) gives better results than not using branch prediction algorithm

CONCLUSION
There are different kinds of branch instructions, so that, there is a certain algorithms were suitable for some types for branches while other kinds of branches are suitable for other types of Branch Prediction Algorithms. Hence, three different predictors were used in this work in the same structure which is known as a Hybrid Branch Predictor. This predictor is tested by using a designed 32-bits pipelined MIPS processor using the Xilinx vertex-4 kit. Different test programs were written to test the designed hybrid branch predictor algorithm with the MIPS processor and the results compared with other types of prediction algorithms and it is found that the HBP gave the best results.