Aaron's Digital Garden 🪴

Recent Writing

Computer Arch Crash Course
Sep 26, 2025
The Missing Readme - consolidated by new grad
Jun 16, 2025
Caching Crash Course
Jun 08, 2025
OS Crash Course
Jun 08, 2025

Recent Notes

C2 - Data Models and Query Languages
Dec 17, 2025
C1-Reliable, Scalable, and Maintainable Applications
Dec 16, 2025

Components of Transformer-Based LLMs
Mixture-of-Experts (MoE)
Latest Large-Scale LLMs
Roofline Model
Characteristics of two phases in LLM serving
Disaggregate Attention and FFN
Challenge 1: Idle Resource Due to Dependencies
Challenge 2: Requirement of High-Performance M2N Communication
MegaScale-Infer

❯

Infra LLMs + AI Agents

❯

Lec 14 MegaScale Infer

Lec 14 - MegaScale Infer

Oct 29, 20251 min read

Components of Transformer-Based LLMs

Each layer contains two modules
- Attention
- FFNs

Mixture-of-Experts (MoE)

Replace FFNs with multiple FFNs each called “expert”, an expert in a domain
Each token is only sent to top-k experts
increases sparsity

Latest Large-Scale LLMs

Doubao-Seed-1.6
DeepSeek v3.1
GPT-OSS
Gemini 2.5 Pro All use MoE

Roofline Model

Operational Intensity: the number of operations per byte of memory traffic
Memory-bound vs. Compute-bound

Characteristics of two phases in LLM serving

Disaggregate Attention and FFN

Independent scaling: Aggregating multiple attention requests can improve the computational efficiency of FFN
Heterogeneous deployment: Adopt more cost-effective hardware for each module

Challenge 1: Idle Resource Due to Dependencies

Sequential computation of a batch will result in only a portion of the resources being utilized at the same time

Challenge 2: Requirement of High-Performance M2N Communication

High overhead in existing libraries and instabilities

MegaScale-Infer

Disaggregated expert parallelism
Ping-pong pipeline parallel
High-performance M2N communication library

Recent Writing

Computer Arch Crash Course
Sep 26, 2025
The Missing Readme - consolidated by new grad
Jun 16, 2025
Caching Crash Course
Jun 08, 2025
OS Crash Course
Jun 08, 2025

Recent Notes

C2 - Data Models and Query Languages
Dec 17, 2025
C1-Reliable, Scalable, and Maintainable Applications
Dec 16, 2025

Graph View

Components of Transformer-Based LLMs
Mixture-of-Experts (MoE)
Latest Large-Scale LLMs
Roofline Model
Characteristics of two phases in LLM serving
Disaggregate Attention and FFN
Challenge 1: Idle Resource Due to Dependencies
Challenge 2: Requirement of High-Performance M2N Communication
MegaScale-Infer

Created with Quartz v4.5.2 © 2025

GitHub
LinkedIn