YAGS Branch Prediction Scheme A. N. Eden and T. Mudge

What’s the big problem they are trying to solve?

  • conflicts with predictions with limited hash addrs What does aliasing do to the predictor?
  • muddle the predictor because multiple branches can be aliased to a predictor that is vastly different
    • Neutral → when they both match
    • Destructive → when they don’t

Aliasing

  • Neutral
  • Destructive
  • What are two obvious ways, then to reduce the impact of aliasing
    • mapping two branches that match to same predictor
  • aliasing from lower bits of addr to history - TODO?
  • Intentional aliasing from history to BHT

Anti-Aliasing Predictors

  • Gshare is bad for aliasing. When addr XOR history for key, completely random → values become noisy
  • Agree predictor
    • assigns a biasing bit to each branch in Branch Target Buffer (BTB). PHT info is changed to “agree” or “disagree” with prediction of the biasing bit. Both need to be in the agree state to take. or both in disagree to not take

Bi-Mode Predictor

Produce a predictor taken or not taken… TODO

Skew

Distributes aliasing to reduce it’s effects

Filter

Don’t even use BTB for easy branches only use it for hard branches

YAGs

Enumerate Exceptions? TODO Bimodal with fewer hits

  • YAGS add tags to PHT (pattern history table)
    • tags are 6-8 bits that contain the least significant bits of the branch address and virtually eliminate aliasing between two consecutive branch

Isn’t branch prediction really a machine learning problem

Probably Only one class of ML-based predicts has really made an impact

Perceptron Branch Predictor (I)

  • Idea: use a perceptron to leanr the correlation between branch history register bits and branch outcomes
  • A perceptron learns a target Boolean function of N inputs
    • Each branch associated with a perceptron
      • Perceptron contains a set of weights wi
        • Each weight corresponds to a bit in the GHR
        • How much the bit is correlated with the direction of the branch
        • Positive correlation: large + weight
        • Negative correlation: large - weight
          Prediction:
        • Express GHR bits as 1 (T) and -1 (NT)
        • Take dot product of GHR and weights
        • If output > 0, predict taken
  • Advantages:
    • more sophisticated learning mechanism → better accuracy
  • Disadvantages
    • hard to implement
    • can learn only linearly-separable functions

What is ILP

The characteristic of a program that certain instructions are independent and can potentially be executed in parallel

  • Any mechanism that creates, identifies, or exploits the independence of instructions, allowing them to be executed in parallel

Where do we find ILP?

  • in basic blocks?
    • 15-20% of (dynamic) instructions are branches in typical code
    • virtually none
  • Across basic blocks?
    • Lots, further we go from two instruction, the more likely to find parallel instructions,
    • across branches, across control flow

How do we expose ILP?

  • by moving instruction arounds
  • How?
    • software:
    • hardware: