DeepSpeed Ulysses System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models (PODC '24)

Summary

The authors present DeepSpeed-Ulysses a novel, portable and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence length. Longer sequences increase memory and communication costs in attention. In order to speed up inference, the authors decided to partition the long sequence. Right before attention computation, it employs all-to-all communication collection on the partitioned queries, keys, and values such that each GPU receives the full sequence but only for a non-overlapping subset of attention heads. This allows for GPUs to compute attention for different attention heads in parallel. Lastly an all-to-all to gather the results along the attention heads while re-partitioning along the sequence dimension

Questions

Can DeepSpeed-Ulysses work when the number of GPUs is higher than the number of heads? Why or why not?

Yes, but the extra GPUs won’t be doing anything. DeepSpeed-Ulysses parallelizes over attention heads, specifically non-overlapping subset of attention heads, to prevent conflicts. Any extra GPUs will sit idle because the maximum useful degree of parallelism is limited by the number of non-overlapping attention heads.

What is the latency impact for one transformer layer caused by DeepSpeed-Ulysses?

DeepSpeed-Ulysses adds two all-to-all operations per transformer layer. The first all-to-all operation for QKV projections has an aggregate message size of 3Nh and the second all-to-all for the output context projection with a size Nh. The overall communication impact is 4Nh/P where N is sequence length, h is hidden size, and P is the degree of parallelism.

Abstract

Computation in Transformer model is characterized by:

Batch size,
Hidden dimension
number of layers
sequence length Previous work has been focusing on the first three dimensions:
data parallelism for batch size
tensor parallelism for hidden size
pipeline parallelism for model depth or layers.

These are not targeted or optimized for long sequence Transformer models.

The authors introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training with extremely long sequence length.

Problem

Longer sequences increase memory and communication costs in attention, because every token talks to every other token Traditional parallelism limits:

Tensor/model parallelism: split matrices → fine for big models, not for long sequences
Pipeline parallelism: splits layers → no help for long context
Megatron sequence parallelism: splits token across GPUs, but each still needs all tokens’ keys/values for attention, causes large all-gathers

What is DeepSpeed Ulysses?

DeepSpeed Ulysses asks: “What if we shared the sequence dimension and keep communication scalable”

The big idea is:

partition sequence among GPUs
Right before attention computation, it employs all-to-all communication collection on the partitioned queries, keys, and values such that each GPU receives the full sequence but only for a non-overlapping subset of attention heads.
- Allows for GPUs to compute attention for different attention heads in parallel
- Lastly an all-to-all to gather the results along the attention heads while re-partitioning along the sequence dimension

Attention works, but traffic scales to $se q u e n ce l e n g t h / n u mb er o f GP U s$ now

Aaron's Digital Garden 🪴

Recent Writing

Computer Arch Crash Course

The Missing Readme - consolidated by new grad

Caching Crash Course

OS Crash Course

Recent Notes

C2 - Data Models and Query Languages

C1-Reliable, Scalable, and Maintainable Applications

Table of Contents