Lecture 6 - Branch Prediction

Looking for Instruction Level Parallelism (ILP)

Identify and exploit ILP - instructions that can potentially be executed at the same time
Branches are 15-20% of instructions
- Implications?
  - 20% of control hazards
  - …
Can only keep pipeline full if we can consistently keep fetching well past unresolved branches
Can only exploit high level of parallelism if we consistently have multiple basic blocks in the processor at once

Importance of Branch Prediction

Timeline:

MIPs scalar, branch hazard of 1 cycle, 1 instruction issued per cycle
- delayed branch
next gen - 2-3 cycle hazard, 1-2 instructions insued per cycle
- cost of branch misprediction goes up
even worse - lots of instructions missed

CPI?

Assume (for simplicity) a scalar pipeline, with a branch hazard (distance from branch resolve/ex to fetch) of 10 cycles.
What is the best case CPI if the branch mispredict rate is 15%, assuming 20% of instructions are branches?
Note 1 – it’s worse on a superscalar pipeline, and potentially even worse on out-of-order
Note 2 – modern pipelines can have 500+ instructions after a branch in the pipeline!

Branch Prediction

Easiest (static prediction)
- always not taken, always taken
- forward not taken, backward always taken
- compiler predicted (branch likely, branch not likely)
What’s the problem with static prediction?
- if you don’t have a branch that is really biased it’s not going to do well
- it’s performance is capped by the max of taken or not taken count
Dynamic branch prediction
- next easiest (1-bit dynamic)
  - remember last taken/not taken per branch (1 bit)
  - per branch approximated
    - per 1 cache line
    - use part of address
  - what happens on a loop? Note: mispredict at N and the T after it. 60% predicts correctly, 1st is trivial case
- 2-bit branch prediction
  - Often referred to as a “saturating counter”
  - Provides some hysteresis, doesn’t over-react to one miss
  - Now 80% correct, miss on N
  - Note: Branch predictor doesn’t reset

Bimodal Predictor

limited size
2 bits by N (eg. 4k)
uses low bits of branch address to choose entry, the volatile bits of nearby addresses
What happens when the table too small?
- branch aliasing (conflicting keys with different branches but not enough space)
What about even/odd branch?
0% if start at weakly Not Taken, bouncing back and forth with neighboring states

Patterns

Can we do better than just history in dynamic branch prediction

We can look at patterns (local predictor) for a particular branch
Pattern History table 1 = branch taken, 0 = branch NOT taken
earliest = left, latest = right, think of it like a queue
right most bit value determines whether to take or not take
- depending on the correctness change BHT in response. TODO review

Correlating Branch Predictors

Can we do even better?!

Correlating Branch Predictors also look at other branches for clues

Often use two indices
- Global history register (GHR) → history of last m branches (e.g. 0100011)
  - branch address
  - virtually every important modern predictor has a GHR (or other globabl path history) in it somewhere

Global history register is a shift register that records the last n conditional branches (of any address) encountered by the processor

If the combining function is xor, this is called the gshare predictor

Combining branch predictors / tournament predictors

Use multiple schemes and a voter to decide which one typically does better for that branch

How does prediction work?

In what stage do we need to do the prediction to avoid any control hazards on a correct prediction?
- Prediction in the instruction fetch stage for next cycle
  - But we don’t know if the instruction is a branch or not yet
  - Don’t know the target address either to move to if branch
A taken/not taken prediction only helps us if..?
- correct

Branch Target Buffers

predict the location of branches in the instruction stream
predict the destination of branches
How?
- use PC (all bits) for lookup
  - match implies this is a branch
- if match and predict bits → taken, set PC to predicted PC
- if branch predict wrong, must recover (same as branch hazards we’ve already seen)
  - but what about dynamically scheduled (out of order) processor??
- if decode indicates branch when no BTB match, two choices:
  - look up prediction now and act on it
  - just predict not taken
- when branch resolved, update BTB (at least prediction bits, maybe more) Performance
Two things can go wrong
- didn’t predict the presence of branch (misfetch)
- mispredicted a branch (mispredict)
Can use both BTB and branch predictor
- have no prediction bits in BTB (why is this a good idea?)
- presence of PC in BTB indicates a lookup in branch predictor to predict whether the branch will go to destination address in BTB
- Modern processors always use both

What about indirect jumps/returns?

Branch predictor does really well with conditional jumps
BTB does really well with unconditional jumps (jump, jal, etc)
Indirect jumps often jump to different destinations, even from the same instruction.
- Indirect jumps often used for returns
- Sometimes for case or switch
- Heavily used in object-oriented languages
Return easily handled by a stack
- jal → push PC+4
- return → predict jump to address on top of stack, pop stack
- Most processors include a hardware return address stack (RAS), used only for prediction
  - procedure calls are typically predicable unless used in weird ways

Summary BTB

Branch Target Buffer allows us to pedict the existence of branches in the fetch stage (so we know to predict condition) and also predict the target
The BTB cannot predict the destination of branches that have multiple targets. For that we need another predictor the indirect branch predictor

Tage Predictor

Different branches require different amount of history

For branches that are not hard to predict, more history = more noise
For really hard branches can benefit from really long history
problem could lead to huge tables
so only use long history for those who need it and less history for those who don’t

Path History Register

Captures global history (like GHR)
unlike GHR, updated only on taken branches
For each taken branch, shift PHR, then xor in some bits of branch PC and some bits of branch target.
For largest Intel predictors, they shift 2, use up to 16 bits of branch PC, up to 6 bits of target, and record up to 388 bits (194 branches) of history!
Both Intel and AMD use PHR. It looks like AMD also uses a small GHR as well, that gets hashed in.

Branch Prediction Key Points

Significant improvement in predictor accuracy overshadowed by increases in pipeline depth and superscalar width
2-bit predictors capture tendencies well
correlating predictors improve accuracy, particularly when combined with 2-bit predictors
accurate branch prediction does no good if we don’t know there was a branch to predict
BTB identifies branches in (or before) IF stage
BTB combined with branch prediction table identifies branches to predict and predicts them well
Modern codes can create significant aliasing in branch predictor tables

Aaron's Digital Garden 🪴

Recent Writing

Computer Arch Crash Course

The Missing Readme - consolidated by new grad

Caching Crash Course

OS Crash Course

Recent Notes

Dist OS

HW Disaggregation

Table of Contents