KV Cache
- Memory space to store intermediate vector representations of tokens
- Working set rather than a “cache”
- The size of KV Cache dynamically grows and shrinks
- A new token is appended in each step
- Tokens are deleted once the sequence finishes
- Additional info: https://medium.com/@joaolages/kv-caching-explained-276520203249
Key Insight
Efficient management of KV cache is crucial for high-throughput LLM serving
- The size of KV Cache dynamically grows and shrinks
- A new token is appended in each step
- Tokens are deleted once the sequence finishes
Memory Waste
- Reservation: not used at the current step, but used in the future
- Internal fragmentation: over-allocated due to the unknown output length.
- External fragmentation: due to different sequence lengths.
KV vs Classical Page table
- Classical Page table multiple levels, block table is one level
- For GPU KV cache, request batches are contagious. Tokens only grows one way in memory so we can have a flat entry. You never free memory In contrast to classical page tables, KV Cache can have a flat lookup table because entries only grow
What We Have Learned about LLM Scheduling It is a tradeoffs
- Scheduling overhead can be significant and dominate e2e performance
- Absolute scheduling overhead grow with both input size and task complexity
- Input: vLLM scheduling grows with token and request counts (hint: implement in native language)
- Complexity: SGLang overhead with prefix cache (esp. with big prefix tree)
- Relative scheduling overhead is higher when other parts in e2e is faster
- Faster model forwarding, less frontend processing, etc.
- Chunked prefill makes scheduling overhead higher (both absolute and relative)
- Multi-step scheduling lowers overall scheduling overhead but has its tradeoffs
- Faster CPU and slower GPU reduce scheduling overhead (cmp. results on our servers with A6000 GPUs)
Ideas with no tradeoffs to system
- Rewrite to faster lang
- not if you don’t consider engineer ramp up and complexity
- Async scheduling