FPGA configuration of an alloyed correlated branch predictor used with RISC processor for educational purposes

ABSTRACT


INTRODUCTION
The direction of a conditional branch depends on the outcome of its condition test which be either true or false. When it is true, the branch is taken and the instruction at the branch target address is fetched. Otherwise, the branch is untaken and the next instruction in ordinary sequence is fetched [1]. This repetitive pattern of a conditional branch has been applied in dynamic branch prediction to keep a local history for each conditional branch from which its future direction can be predicted [2].
In the case of nested conditional branches, the conditions may be related to each other, so the outcome of the running condition test will also depend on that of the previous conditions which in turn will cause the direction of current branch to be affected by the directions of preceding branches [3]. Consequently, each prediction is based on the histories of past branches rather than just its individual history. This global history besides the local history should be recorded by an effective branch prediction unit and exploited in revealing correlation among branches, or high-performance processors will continue to experience pipeline stalls due to correlated branches [4,5]. For instance, Pentium Pro uses the result from the last two branches to select one of the four sets of branch history table (BHT) bits [6], while VIA Nano processor uses a prediction technique that amalgamates the local and global principles of prediction [7]. According to its parallelism which is also considered as one of the major factors that affect processor performance, field programmable gate array (FPGA) is recently adopted by many researchers for the branch prediction implementation [8][9][10][11][12][13]. Where in [8], an accurate and fast branch predictor that uses few resources on FPGA was implemented for general purpose, pipelined, single core soft processor, whereas [9] implemented a configurable VHDL model of a branch predictor unit which is composed of a branch direction predictor and a branch target buffer. A new TAGE branch predictor was proposed by [12], it can avoid being the critical path of the processor while keeping the same prediction accuracy as TAGE.
Later sections of this research are ordered as follows; section 2 expresses a theoretical review of how some dynamic branch predictors utilize the global history of previous branches in predicting current branch direction. Then the proposed VHDL design of the alloyed-correlated branch predictor and how it is implemented in FPGA is described at section 3. Section 4 reveals how the required results which confirmed the efficiency of the design are obtained. Finally, the conclusion is demonstrated in section 5.

THEORETICAL FEATURES OF CORRELATED PREDICTION
In order to reduce the branch misprediction penalty raised due to branch correlating, two levels of history should be used to guess the branch direction. The first level holds the patterns of former branches, while the second level catches the branch directions associated with the last occurrence of a particular pattern in the first level [14,15]. Here is a summarized illustration of how the feature of two history levels is employed in designing some types of branch prediction schemes.

Correlated branch predictor
This technique takes benefit of the correlation between conditional branches therefore it has improved branch prediction accuracy remarkably. Branch prediction process in correlated predictor passed through two levels of history as Figure 1 shows [16]. The first level which is an m-bit global history register (GHR) records the behavior of the last m branches. As for the second level, the pattern history table (PHT) holds the last direction taken by each branch when a certain pattern in the first level occurs [17][18][19]. The lower n bits of the branch address are used to select one of the 2 n entries of the PHT, where each entry consists of 2 m sets. one of these sets is chosen by the GHR. Each set holds a 2-bit saturating counter with saturation levels ranged from (00) as the minimum value to (11) as the maximum value, and the predicted path is specified by the MSB (Most Significant Bit) of the counter [1,20].
Because this predictor only records correlation between m+1consecutive branches, the GHR may not catch any bit belonging to the same branch especially if there are many irrelevant branches in between. This may lead to branch misprediction which in turn will increase the branch penalty [21].

Alloyed branch predictor
This dynamic scheme which also consists of two levels of history has been suggested to address the issue associated with the correlated predictor to some extent. Where a single k-bit GHR, a shift register, is used to record outcomes of the most recently executed k branches. This register which represents the first level is then concatenated with the lower n bits of the branch address to produce an m-bit index which is used to pick one of the 2 m counters of the second level performed by the pattern history table (PHT) as in Figure 2 [22][23][24]. The main drawback of the alloyed predictor lies in the number of history bits in the PHT index which might be insufficient to include all previous branches correlated with the branch being executed [25].

ALLOYED-CORRELATED BRANCH PREDICTOR IN VHDL
This research aims to increase the speedup of the processor by designing a dynamic branch predictor with higher accuracy and lower branch penalty. This predictor which is modelled in VHDL using Xilinx ISE (integrated synthesis environment) Design Suite 14.7 merges the prediction attitudes of both the alloyed predictor and the correlated predictor.
The proposed design of the alloyed-correlated branch predictor shown in Figure 3 is placed within a predesigned 32-bit MIPS processor that is taught as a main subject in the microprocessor architecture course, and its predictive action depends on two levels of history:  The first level is a 7-bit GHR that records the actual direction of the most recent seven branches.  The second level is a PHT with 1024 entries. Each entry has 4 sets, and each set contains a 2-bit saturating counter that is initialized to a default value of "11". During the fetch stage, the seltF (9:0) address is created, as in the alloyed predictor, by concatenating the GHR upper five bits (GHRF (6:2)) with the lower five bits (pcF (6:2)) of the Program Counter (PC) which holds the branch address. Subsequently, the seltF is used to select one of the 1024 entries of the PHT. Here the proposed predictor acts as a correlated predictor and uses the two lower bits GHRF (1:0) of the GHR to choose only one set from the four sets of the indexed entry according to Table 1. The upper bit (predictF (1)) of the selected set content (predict (1:0)) provides the predicted direction as taken or not taken whenever it is '1' or '0' respectively.
Later in the decode stage as shown in Figure 3, whenever the branch is really taken (pcsrcD=1), the corresponding saturating counter is incremented until the maximum value (11) is reached, otherwise it is decremented until the minimum value (00) is reached. Every saturating counter is modelled as a mealy finite state machine (FSM) in which the predfsmop (1:0) output depends on the pcsrcD input besides the current state (predictD (1:0)) as in Figure 4. The output (predfsmop (1:0)) and the next state are generated at the next state logic which is a combinational logic, whereas the current state is held by the current state register which is modeled as a D flip-flop.   After completing the VHDL implementation of the alloyed-correlated branch predictor module, a schematic representation of its components has been created using the RTL and Technology Schematic viewer of Xilinx ISE Design Suite 14.7. As illustrated in Figure 5, this schematic represents the branch predictor design by generic symbols to each of the PHT, the GHR and the 2-bit saturating counter.

RESULTS AND DISCUSSIONS
The alloyed-correlated branch predictor gathers the design aspects of both alloyed and correlated predictors, and to prove effectiveness of the designed predictor, its performance was compared with that of each of alloyed and correlated predictors in terms of program execution time, predictor performance, clock per instruction (CPI) and processor speedup. In this section, the selection sort algorithm that consists of conditional branches as 20% out of its all instructions was used to sort 100 integer numbers of three different distributions: sorted, uniform and normal. Since this algorithm is in-place comparison algorithm, using different types of input data would make the correlation between branch conditions uneven, this made it possible to demonstrate the impact of the proposed design. It can be clearly seen from Table 2 that the alloyed-correlated branch predictor achieved the best results by having the least execution time across all of the three groups of input numbers which in turn means it took the least number of clock cycles to execute the sorting program. The reciprocal relation between the predictor performance and its CPI is presented in Figure 6, that is, the higher the performance, the less the CPI and vice versa. Where the highest performance for the alloyed-correlated branch predictor (225.989) was recorded in part (a) of Figure 6 during the arrangement of the sorted numbers group, which in part (b) corresponding to the lowest CPI value (1.516) for arranging the same input group. The effect of the three prediction schemes on MIPS processor speedup are revealed in Figure 7. It can be deduced that processor speedup rates of the first two predictors (alloyed and correlated) were almost close for all combinations of input data, then these values increased noticeably when using the alloyed-correlated predictor.

CONCLUSION
In this research, a hardware model for a proposed dynamic branch predictor design has been configured in VHDL and synthesized using Xilinx ISE design suite 14.7 tool. This design decreases the penalty of branch misprediction caused by correlated branches by amalgamating the prediction mechanisms of two predictors (the alloyed predictor and the correlated predictor). After completing the design, it has been included inside the fetch stage of a 5 stages pipelined MIPS processor then the selection sort algorithm was executed in order to arranging input numbers of different combinations ascendingly. The obtained results showed the effectiveness of the alloyed-correlated predictor design.
At the end, it should be mentioned that this research is intended to be used in a computer architecture course where students will be able to design, implement, and debug a dynamic branch predictor and study its efficacy in minimizing the effect of nested conditional branches on prediction accuracy.