Components of Transformer-Based LLMs

  • Each layer contains two modules
    • Attention
    • FFNs

Mixture-of-Experts (MoE)

  • Replace FFNs with multiple FFNs each called “expert”, an expert in a domain
  • Each token is only sent to top-k experts
  • increases sparsity

Latest Large-Scale LLMs

  • Doubao-Seed-1.6
  • DeepSeek v3.1
  • GPT-OSS
  • Gemini 2.5 Pro All use MoE

Roofline Model

  • Operational Intensity: the number of operations per byte of memory traffic
  • Memory-bound vs. Compute-bound

Characteristics of two phases in LLM serving

Disaggregate Attention and FFN

  • Independent scaling: Aggregating multiple attention requests can improve the computational efficiency of FFN
  • Heterogeneous deployment: Adopt more cost-effective hardware for each module

Challenge 1: Idle Resource Due to Dependencies

  • Sequential computation of a batch will result in only a portion of the resources being utilized at the same time

Challenge 2: Requirement of High-Performance M2N Communication

  • High overhead in existing libraries and instabilities

MegaScale-Infer

  • Disaggregated expert parallelism
  • Ping-pong pipeline parallel
  • High-performance M2N communication library