Lecture 11 - ILP more

Pipeline Perf.

Where:

BSPI - branch stalls per instruction
FPSPI - floating point stalls per instruction
LdSPI - load stalls per instruction When BSPI, FPSPI, LdSPI approach 0, we’re bounded by the base CPI 1 for scalar

Multiple Instruction Issue

Getting CPI < 1

Issuing multiple instructions

Superscalar
- Variable number of instructions issued each cycle
- Parallelism detected in hardware
Very Long Instruction Words (VLIW)
- Fixed number of instructions issued each cycle
- Parallelism scheduled by the compiler
- Intel IA-64 (Itanium)
- Popular for embedded processors, DSPs (digital signal processors)

Early Superscalar

Early attempt - Superscalar MIPS: Fetch 2 instructions, 1 FP and 1 anything else in-order
- Fetch 64-bits/clock cycle; Int on left, FP on right
- Can only issue 2nd instruction if 1st instruction issues
- If the inst mix wasn’t just right, then you issue one at a time implications:
More ports for FP registers to do FP load and FP op in a pair
1 cycle load delay expands to 3 instructions in SS
- instruction in right half can’t use it, nor instructions in next slot
- amplifying data hazards, check above can’t forward I0 mem until I4
Branch hazard also expands to 3 instructions

Superscalar Flexibility

Quickly frustrated with superscalar machines that limited what instruction could be scheduled together
Thus, modern 4-wide superscalar → more flexibility
OOO execution makes this much easier, because the instruction grouping (each cycle) is determined by the hw scheduler, not the fetch unit (compiler)

Dynamic Scheduling and Superscalar

Dependencies stop instruction dispatch in In-order SS
Code compiled for scalar pipeline will run poorly on SS
- may want code to vary depending on how superscalar (i.e., recompile for each target pipeline)
Simple approach: combine dynamic scheduling (such as tomasulo) with the ability to fetch and issue multiple instructions simultaneously

Limits of Superscalar

While integer/FP split is simple for HW, we only get CPI of 0.5 only for programs with:
- Exactly 50% FP operations
- No hazards
If more instructions issue at same time, greater difficulty of decode and issue
in modern superscalar processor, register renaming logic, forwarding logic is quadratic in issue width
Register file is complex (4 issue requires 8 read ports and 4 write ports)
Out-of-order instruction wakeup and scheduling also very complex
These are the reasons superscalar essentially stopped at 4 a couple decades ago. but now up to 6ish to 8ish recently

Superscalar Key Points

Only way to get CPI < 1 is multiple instruction issue
SS requires duplicated hardware, more dependence checking
Without duplication of functional units, will see limited improvement
SS combined with dynamic scheduling can be powerful

Very Long Instruction Words (VLIW)

fixed number of instructions issued each cycle
parallelism scheduled by the compiler What is it?
Very Long Instruction Word, attempt at a high performance ISA
n-wide VLIW processor issues a packet of N instructions simultaneously
- compiler guarantees independence of those N instructions
Try to fill 5 instruction format, lots of nops as a result, but requires compiler techniques to fill, like loop unrolling
Need lots of registers because of loop unrolling, multiplying the number of registers needed per iterations and need to isolate them

Traditional VLIW vs Intel IA64

Traditional VLIW (multiflow, transmeta)
- Hard to fill large VLIW
  - Lack of ILP
  - Poor match of instruction stream vs VLIW slots.
- Lots of NOPs
Intel IA64 (Itanium)
- More flexible groupings of instructions
- Only 3-wide (but multiple-issue)
  - so up to 6 ops per cycle for initial Itanium
- Can collapse multiple groups into 1 3-op (not parallel) instruction, so fewer NOPs

Superscalar vs VLIW

Superscalar Positive
- Less stress on compilers
VLIW Positives

Pipeline case study

Intel pentium 4
- (at one point) apex of ooo processors, subsequent processors back…
- …

Aaron's Digital Garden 🪴

Recent Writing

Computer Arch Crash Course

The Missing Readme - consolidated by new grad

Caching Crash Course

OS Crash Course

Recent Notes

C2 - Data Models and Query Languages

C1-Reliable, Scalable, and Maintainable Applications

Table of Contents

Lecture 11 - ILP more

Pipeline Perf.

Multiple Instruction Issue

Issuing multiple instructions

Early Superscalar

Superscalar Flexibility

Dynamic Scheduling and Superscalar

Limits of Superscalar

Superscalar Key Points

Very Long Instruction Words (VLIW)

Traditional VLIW vs Intel IA64

Superscalar vs VLIW

Pipeline case study

Recent Writing

Computer Arch Crash Course

The Missing Readme - consolidated by new grad

Caching Crash Course

OS Crash Course

Recent Notes

C2 - Data Models and Query Languages

C1-Reliable, Scalable, and Maintainable Applications

Graph View

Table of Contents