Aaron's Digital Garden 🪴

Recent Writing

Computer Arch Crash Course
Sep 26, 2025
The Missing Readme - consolidated by new grad
Jun 16, 2025
Caching Crash Course
Jun 08, 2025
OS Crash Course
Jun 08, 2025

Recent Notes

Dist OS
Oct 23, 2025
HW Disaggregation
Oct 23, 2025

LLM Inference
Prefill vs Decode
Compute vs Memory Bound
Approximation Model
Speculative Decoding
Results
Cost of Speculation
Sampling
Tree Attention
Stochastic Decoding
Results
Tradeoffs
EAGLE

❯

Infra LLMs + AI Agents

❯

Lec 7 LLMs Gen AI

Lec 7 - LLMs - Gen AI

Oct 13, 20252 min read

LLM Inference

LLMs generate the best next token

Prefill vs Decode

Decode is autoregressive

Compute vs Memory Bound

Roofline graph

Approximation Model

model should be order(s) of magnitude smaller
- E.g. 7B and 68M
- Either fewer layer, smaller hidden dimension size, fewer attention heads
Also called “small specialized model (SSM)” or “draft model”
Same vocabulary as the LLM (also called “target model”)
Ideally trained on the same data

Speculative Decoding

Drafting
- The overhead is magnitudes smaller in draft model that system can tolerate performance wise
Verification
- Decide how many accept and reject

Results

Cost of Speculation

$M_{p}$ is target, $M_{q}$ is draft

As speculative depth increases, so does overhead
- if TAR is too low, this overhead can dominate
instead, we can sample a tree from the draft model

Sampling

Greedy decoding would select the token with the highest probability Stochastic decoding would be random selection

Tree Attention

sampling
Construct token tree
Linearize and construct mask
Verification with mask

Stems beam search (TODO lookup)

Stochastic Decoding

Results

Tradeoffs

EAGLE

used currently
Data at the token/LM head level is very uncertain and not as rich
Instead, train an auto-regressive head right before the LM head
- 1 fully-connected linear layer
- 1 transformer decoder layer
Drafting overhead is negligible
- Reduced kernel launch overhead
- Autoregressive head << draft model size

Recent Writing

Computer Arch Crash Course
Sep 26, 2025
The Missing Readme - consolidated by new grad
Jun 16, 2025
Caching Crash Course
Jun 08, 2025
OS Crash Course
Jun 08, 2025

Recent Notes

Dist OS
Oct 23, 2025
HW Disaggregation
Oct 23, 2025

Graph View

LLM Inference
Prefill vs Decode
Compute vs Memory Bound
Approximation Model
Speculative Decoding
Results
Cost of Speculation
Sampling
Tree Attention
Stochastic Decoding
Results
Tradeoffs
EAGLE

Created with Quartz v4.5.2 © 2025

GitHub
LinkedIn