Lecture 15 - Cache cont. 2

Software/Compiler Optimization

Big idea

Instructions cache
- Reorder procedures in memory to reduce misses
- profiling to look at conflicts
Data cache
- Merging Arrays
  - Improve spatial locality by single array of compound elements vs 2 arrays
  - Example:
    - key value, can be represented with two arrays, bundle together to provide spacial locality
- Loop interchange
  - Change nesting of loops to access data in order stored in memory
  - Example
    - Multi loop code that accessing a matrix in strides of row in the inner loop. Better locality can be achieved by row column order, by sequential access through columns in inner loop or sequential access in memory.
- Loop fusion
  - Combine 2 independent loops that have same looping and some variables overlap
  - Example:
    - Below, we get locality between a[i][j] and c[i][j] when fused. Otherwise, you would lose both a[i][j] and c[i][j] values to the cache in the second pass (have to start from the beginning, 2 misses before vs 1 miss after)
- Blocking
  - Improve temporal locality by accessing “blocks” of data repeatedly vs going down whole columns or rows
  - Example:
    - The idea is to partition the big matrix into smaller blocks that can fit into cache to maintain x[i][j] temporal locality

Reduce the miss penalty

Read priority over write

load > stores
general principles, don’t stall for stores
- Compiler → register allocation → done with store
Easiest method: to resolve RAW hazards is to send them in order
- Problem: loads get stuck behind stores
- Solution: prioritize loads, let loads pass stores in pipelines
  - As long as write buffer not full (structural hazard)
  - Let store buffer empty out as bandwidth allows
  - Need to identify if load depends of store
- How write back caches handle this?
  - Read miss may require write of dirty block
  - Copy dirty block to a write buffer, read, then write
    - Less CPU stalls, don’t need to wait for write to higher memory

Non-blocking Caches to reduce stalls on misses

Non-blocking cache (or lockup-free cache) allows the data cache to continue to supply cache hits during a miss
hit under miss reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the CPU
- requires more metadata to know if the hit is determined by the missed data
hit under multiple miss or miss under miss - further lower the effective miss penalty by overlapping multiple misses

Multi-level caches

Primary way to reduce miss penalty

Second level cache
- Local vs Global cache miss rates
  - Local miss-rate - misses in the cache / total number of mem accesses to this cache ( $M i ss r a t e_{L 2}$ )
  - Global miss rate - misses in this cache / total number of memory accesses generated by CPU ( $M i ss r a t e_{L 1} * M i ss r a t e_{L 2}$ )
- Why is 10% miss rate for L1 cache and 40% L2 cache is reasonable?
  - L1 gets the “easy” locality hits, if L1 gets updated with local hits. If L1 misses, it means it’s a harder locality hit, one that is not explained by L1 to L2. Which means the miss chance would be higher.
- L1’s highest priority is fast hit time
- L2’s highest priority is low miss rate
- L1 caches low associativity → fast hit rate
- L2 caches high associativity → lower miss rate
- Block size? L1 L2? → same, easier to reason
  - nobody changes block size

Policies

Inclusion policy
- Inclusive
  - Everything in L1 is also in L2
    - not the other way around
  - Better for read heavy?
- Exclusive
  - Nothing in L1 is also in L2
  - effective utilization, performance
  - Performance overhead in communication, every cache line in L1 is considered dirty
  - Better for write heavy?
- Non-inclusive
  - Some (likely most, but not all) lines in L1 are in L2
  - Let caches do whatever they want
  - Common case now
  - cache coherency issues
  - Dynamic scheduling??

Reducing Miss Penalty Summary

Three techniques
- Read priority over write on miss
- Non-blocking Caches (Hit Under Miss)
- Multi-level Caches

Reduce the time to hit in the cache

Simple L1 caches, small

I and D used to be Direct Mapped, on chip for speed
- out-of-order execution reduce latency which made a huge difference here
- load → add in inorder machine makes load speed important, but with OOO, not so much anymore since add can be ran prior to load

Way Prediction

In direct mapped cache you don’t have to do a tag lookup

done in parallel N-way cache makes set lookup and tag look up now a serialized processes.
Mimic directed mapped hit performance Find set, match tag

Virtual Caches

TLB is the bottleneck on the critical path of virtual → physical

TLB - fast cache of page table, HW

Why don’t we just send virtual address to cache?

Also called virtually addressed cache or virtual cache

Issues:

Every time process is switched logically, must flush the cache (invalidate)
- cost = time to flush + "compulsory" misses from empty cache
Aliasing problem with virtual addresses, say 2 diff processes
- two different virtual addresses map to same physical address
- Conflicting blocks in caches are knocked out
I/O must interact with cache…
Solutions to aliases
- HW that guarantees that every cache block has unique physical addresss
- SW guarantee: lower n bits must have same address; as long as covers index field and direct mapped, they must be unique; called page coloring
  - not used anymore, due to direct mapped constraint
Solution to cache flush
- Add process identifier tag that identified processes as well as address within process; can’t get a hit if wrong process

Current Implementation - Virtually indexed physically tagged!

Note: If index and block are within page offset, TLB translation can be done in parallel because it only affects the high bits of virtual page number. Cache can still check for set hit or miss. Although still need to wait for TLB for tag hit or miss

Tag check based off of physical address

Trace Caches

Handles fetch bottleneck - cannot execute instructions faster than you can fetch them into the processor
- For wide superscalars this can be a challenge keeping the CPU fed
  - too many branches
  - Cannot typically fetch more than about one taken branch per cycle
Trace cache is an instruction cache that stores instructions in dynamic execution order rather than program/address order
- Executes in an order than is variable

Aaron's Digital Garden 🪴

Recent Writing

Computer Arch Crash Course

The Missing Readme - consolidated by new grad

Caching Crash Course

OS Crash Course

Recent Notes

C2 - Data Models and Query Languages

C1-Reliable, Scalable, and Maintainable Applications

Table of Contents