Background

Data Parallelism Model Parallelism

How to server long context in a systems sense

Within a single request parallelizing the context

Self attention

Multi-headed Attention