Lecture 14 - Caches Cont.

Cache Updates

Writes updates

Write through: The information is written to both the block in the cache and to the block in the lower-level memory.
Write back: The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced.
- Just record the fact that this is block is dirty and need to update other caches

Write misses

write-allocate - make room for the cache line in the cache, fetch rest of line from memory. Then perform write into block
no-write-allocate (write-around) - write to lower levels of memory hierarchy, ignoring this cache
Decision is ultimately based on locality, is there locality between writes and loads
- are you reading the data you’re writing
- much more sensitive to latency of loads than stores, typically don’t need to wait for stores to finish
- loads = reads
- stores = writes

Separate Caches for Instructions and Data

Unified cache vs separate instruction and data caches?

This is the norm:

separated instruction and data cache
L2 unified
L3 unified Why this design is performant
Instruction fetch is in a different place in the processor than the load store unit
- Place I cache by the instruction fetch
- Place data cache by the load store unit
Smaller caches are faster! (make each cache half the size by separating them)
- less physical distance
Multiported caches are inefficient compared to single ported cache
- A port is a hardware access path (separate address lines)
- A cache is multiported when it can service more than one independent read/write request per cycle

Cache Performance

$ET = I C * (CP I_{e x ec u t i o n} + M e m ory s t a ll s p er in s t r u c t i o n) * c l oc k cyc l e t im e$
$ET = I C * (CP I_{e x ec u t i o n} + M e m ory a ccesses p er in s t r u c t i o n * M i ss r a t e * M i ss p e na lt y) * c l oc k cyc l e t im e$
- includes Hit Time as part of CPI
- typically will be called “Memory stalls (cycles) per instruction”, MCPI
- $MCP I = M e m ory a ccesses p er in s t r u c t i o n * M i ss r a t e * M i ss p e na lt y$
- or
- $MCP I = L o a d / in s t * L o a d mi ss r a t e * L o a d mi ss p e na lt y$
- or
- $MCP I = I n s t C a c h e R e a d s / in s t * I C mi ss r a t e * I C mi ss p e na lt y + D a t a C a c h e a ccesses / in s t r * D C mi ss r a t e * D C mi ss p e na lt y$
  - considering both instruction cache and data cache

Ex

Instruction Cache miss rate 4%
Data cache miss rate 9%
Base CPI 1.1
20% of instructions are loads and stores
Miss penalty = 12 cycles
CPI?

$1.1 + (1.0 * 0.04 * 12) + (.2 * 0.09 * 12) = 1.796$ 1.1 + .48 + .216 = 1.796

Assuming scalar machine, 1 access per instruction
Superscale, 1 access per x instructions

Improving Cache Performance

AMAT = $H i t t im e + M i ss r a t e * M i ss p e na lt y$ How are we going to improve cache performance:

Reduce hit time
Reduce miss rate
Reduce miss penalty

Classifying Misses

Compulsory - first access to a block not in cache
- also called cold start misses or first reference misses
Capacity - If C is the size of cache (in blocks) and there have been more than C unique cache blocks accessed since this cache was last accessed
Conflict - Any miss that is not a compulsory miss or capacity miss must be a byproduct of the cache mapping algorithm. A conflict miss occurs because too many active blocks are mapped to the same cache set
- eg. Accesses are targeted on one set leading to misses and other sets are not fully utilized which makes this not a conflict or compulsory miss

Reducing Misses?

Reducing compulsory misses?
- Increasing cache line → other addresses local within cache line will get loaded in → more hits
Reducing capacity misses?
- increasing cache size
Reducing conflict misses?
- increasing associativity
- increasing cache size, spreads out cache blocks
  - if 2x cache sets, then now blocks to one set now maps to two sets
What can the compiler do?
Avg. Memory access time vs Miss rate
Increasing Associativity and Cache sizes decreases miss rate
- But there are limits to Associativity (adds latency to cache) and cache sizes

Hardware Prefetching

Instruction Prefetching
- Alpha 21064 fetches 2 blocks on a miss
- Extra block placed in stream buffer
  - placed in stream buffer rather instead of cache, is when you get a hit in stream buffer and put it into cache, you know to put the next sequential blocks in stream buffer
- On miss check stream buffer
Works with data blocks too
Prefetching relies on extra memory bandwidth that can be used without penalty
Strategies:
- Next line prefetching
  - Prefetch the next line on each access
  - “Miss on A bring in A + 1”
- Next N lines
- Tagged next line (in cache)
  - tag prefetched lines, so that you do a next-line prefetch when you access a prefetched line
  - If hit on a prefetch line (A+1), you can prefetch A+2
  - tag : usefulness bit - “Last time we missed on line x, prefetchig X + 1 helped”
- Stream Buffers
  - offset to next line to access
- Stride-based stream buffers
  - mostly modern prefetchers
  - offset to any line to access, learn at runtime dynamic
- Pointer based prefetchers
  - load this thing into memory and if it is a pointer, load value into cache
  - can tell a pointer, if the high bits of the address and high bits of the data are the same it’s probably a pointer
    - regions of a process are partitioned by ranges in high bits of address space
    - Heap region has distinct high bits, if a value lands in that range and is aligned, probably a pointer
    - High bits of pointer address ~= high bits of value (data address), probably a pointer
- Markov prefetchers
  - create graphs of typical address
- Fetch-directed instruction prefetchers
  - instead of stalling the pipeline, decouple branch predictor and let it run by itself
  - creates queue of addresses to prefetch
  - dominate prefetcher in Intel

Software Prefetching

Different from HW where the prefetch is embedded into the code

instruction in ISA that tells HW to prefetch
System needs to know the difference between loads and prefetches Examples:
Data Prefetch
- Cache Prefetch: (usually) load into cache
- Special prefetching instructions cannot cause faults; a form of speculative execution
Issuing Prefetch Instructions (including address calculation) takes time
- Is cost of prefetch issues < savings in reduced misses?
Helper thread prefetching (speculative precomputation)
- Another thread that runs ahead of main thread to do speculative precomputation

Aaron's Digital Garden 🪴

Recent Writing

Computer Arch Crash Course

The Missing Readme - consolidated by new grad

Caching Crash Course

OS Crash Course

Recent Notes

C2 - Data Models and Query Languages

C1-Reliable, Scalable, and Maintainable Applications

Table of Contents