Lec 9 LLM Serving Pt.1 Prefix Caching

Overview of Preble Efficient Distributed Prompt Scheduling for LLM Serving

It’s all about prompting

Prompt example Complexity of change vs task accuracy How we can increase the quality of our result through prompting

LLM Caching vs Traditional Caching

Traditional might have:	LLM Serving Systems
Separation of compute and storage eg. DRAM and disk	Co-location of compute and state
Caching can be useful for any chunk of data - any type of data	Sharing is only useful for prefix - due to autoregressive nature, can only match from beginning of prompt
Computation and memory can be predictable	Computation and memory are unknown before execution

Example Prompt Tree Cache

How are real-world prompts like?

High sharing degree

Single GPU example of Prefix Tree

Scheduling comparisons

Exploration

Preble Takeaways

LLM Serving is getting more expensive using more complex prompting
Workloads are longer and shared
Preble(ICLR ’25) enables cache and load to be effectively utilized for performance
- Utilizing E2 scheduler and fair waiting queue