Overview of Preble Efficient Distributed Prompt Scheduling for LLM Serving

It’s all about prompting

Prompt example Complexity of change vs task accuracy How we can increase the quality of our result through prompting

LLM Caching vs Traditional Caching

Traditional might have:LLM Serving Systems
Separation of compute and storage
eg. DRAM and disk
Co-location of compute and state
Caching can be useful for any chunk of data
- any type of data
Sharing is only useful for prefix
- due to autoregressive nature, can only match from beginning of prompt
Computation and memory can be predictableComputation and memory are unknown before execution

Example Prompt Tree Cache

How are real-world prompts like?

High sharing degree

Single GPU example of Prefix Tree

Scheduling comparisons

Exploration

Preble Takeaways

  • LLM Serving is getting more expensive using more complex prompting
  • Workloads are longer and shared
  • Preble(ICLR ’25) enables cache and load to be effectively utilized for performance
    • Utilizing E2 scheduler and fair waiting queue