MegaScale-Infer Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism (SIGCOMM '25)

source: https://arxiv.org/abs/2504.02263v1

Summary

This paper presents MegaScale-Infer, an efficient and cost-effective system to serve mixture-of-experts models. In mixture-of-expert models, the FFN part of a transformer is replaced with many possible FFNs (experts) and each with their own weights. This shifts FFNs to be memory-intensive during inference and lowering GPU utilization. MegaScale-Infer introduces ping-pong pipeline parallelism that partitions batches into micro batches and pipelines the micro batches between attention and FFNs steps.

Questions

Why can the disaggregation of attention and FFN improve the GPU utilization during MoE serving?

Attention scales differently from FFN. Attention is latency-bound and sequential and FFN are compute-bound and parallel. Thus having separate scaling strategies for each workload improves GPU utilization.

Does the ping-pong pipeline impose any inherent constraints on the model architecture? If yes, what problems will arise when such constraints are not satisfied?

Yes. Ping-pong pipeline requires that attention and FFN computation times be roughly balanced so that communication time be smaller than compute time. If batch sizes are too small or workloads are imbalanced, one side of the pipeline becomes idle, reducing utilization and increasing latency.

Abstract

Mixture-of-Experts (MoE) has huge potential to scale LLMs with enhanced perf. and reduced computational complexity
Issue:
- Sparsely activated architecture shifts feed-forward networks from compute-intensive to memory-intensive during inference
Paper presents MegaScale-Infer, an efficient and cost-effective system to serve MoE models
- Disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules
- Introduces ping-pong pipeline parallelism - partitions request batch into micro-batches and shuttles them between attention and FFNs for inference

Mixture-of-Experts (MoE)

LLMs have multiple transformer layers and each transformer layer consists of two key parts:

Attention → mixes information between tokens (context)
Feed-Forward Network (FFN) → transforms each token independently Attention is about relationships between tokens, FFN is about interpreting each token’s new meaning

MoE

replaces the FFN part of a transformer with many possible FFNs (experts) and each with their own weights

Attention (same as usual)
↓
Router → chooses top-k experts (e.g. top-2)
↓
Send token’s hidden vector to those experts’ FFNs
↓
Combine results (weighted by router’s softmax scores)

Because of many possible FFNs → memory bottleneck

Storage - more FFNs
Communication overhead - choosing FFNs

Problem

MoE is inefficient when attention and FFN share the same GPU pool

Attention is latency-bound and sequential
FFN experts are compute-bound and parallel

During the decoding phase (dominates LLM inference process)

GPU utilization of attention modules remains low because they must access the intermediate states (KV cache) of all previous tokens
Conversely, FFN modules achieve high GPU utilization as the number of tokens involved in computation increases.

Solution: Disaggregated MoE Serving

Disaggregated expert parallelism - disaggregates the attention and expert modules, assigning them to separate GPUs Mega-Scale-Infer splits the work into two dedicated clusters

A-nodes: handle all attention computation
E-nodes: host experts(FFNs) Tokens ping-pong between these clusters:

A-nodes run attention for a micro-batch
Sends activations to E-nodes
E-nodes execute expert FFNs
Send results back for the next layer While A-nodes are computing, E-nodes are communicating

Aaron's Digital Garden 🪴

Recent Writing

Computer Arch Crash Course

The Missing Readme - consolidated by new grad

Caching Crash Course

OS Crash Course

Recent Notes

C2 - Data Models and Query Languages

C1-Reliable, Scalable, and Maintainable Applications

Table of Contents