InferCept Efficient Intercept Support for Augmented Large Language Model Inference (ICML'24)

source: https://arxiv.org/abs/2402.01869

Summary

The paper introduces InferCept, an efficient intercept support for augmented LLMs. LLMs are increasingly augmented with external tools that pause the LLM’s decoding during inference. When that happens an LLM has three options: discard the context, preserve the context, or swap the context to CPU. Each of these options has GPU time and memory tradeoffs and InferCept is a system that is able to dynamically choose which option to take to minimize total GPU memory waste while maintaining throughput. It also improves the individual options to reduce and eliminate their memory waste.

Questions

What are the three ways of dealing with KV cache when a model calls an API?

The three ways are discard, preserve, and swap. Discard drops the KV cache before relegating the task to the augmenting entity which requires recomputation afterwards. Preserve keeps the KV cache but by keeping the cache it increases memory usage and hence throughput. Swap moves the KV cache to the CPU but the limited GPU-CPU bandwidth could bottleneck the process.

Why does InferCept consider running requests that are not currently calling tools when calculating GPU memory waste?

InferCept considers running requests because when you discard the KV cache, it stalls the other running requests in the batch when you need to recompute the interrupted request. That means the other running requests are effectively idle and wasting time but still in memory. InferCept avoids the full stall but chunking the recomputation step to lower the GPU memory waste.

Motivation

LLMs are increasing augmented with tools like calculators, retrievers, or image generators
During inference, these interaction pause the LLM’s decoding (waiting for tool result) causing repeated recomputation when resumed
- tool request can depend on human events (eg. waiting on human response)
Current systems like vLLM treat each pause as a new request, wasting up to 40% of total compute time recomputing already-processed tokens.

Key Problem

When an LLM pauses mid-generation (for tool) has three options:

Discard: drop old context, losing KV cache → relegate task to augmenting entity → recomputation overhead
Preserve: keep KV cache in GPU memory (alternative to Discard) → memory waste, increased throughput
Swap: Move context to CPU → slow due to bandwidth limits Each option wastes either GPU time or memory

INFERCEPT’s core idea

min-waste interception - dynamically choose among discard, preserve, or swap to minimize total GPU memory waste while maintaining throughput

It achieves this via:

Waste models for all three strategies.
Example: $Wa s t e_{P reser v e } = T_{I NT} \times C_{i} \times M$ (intercept duration × context size × per-token memory)
Dynamic scheduling using these waste estimates.
Optimized recomputation and swapping:
- Swap pipelining: overlap GPU–CPU transfers with layer-forward passes.
- Swap chunking: split context swaps into chunks that fit the GPU–CPU link budget.
- Recomputation chunking: overlap partial recomputation with decoding to balance compute/memory

Scheduling Algorithm

At each iteration:

Compute per-request waste using equations (1)–(5).
Sort intercepted requests by potential waste.
Assign a swap budget to the most wasteful ones.
For others, choose preserve or discard adaptively.
Resume swapped/discarded requests using FCFS fairness while ensuring GPU saturation

Aaron's Digital Garden 🪴

Recent Writing

Computer Arch Crash Course

The Missing Readme - consolidated by new grad

Caching Crash Course

OS Crash Course

Recent Notes

C2 - Data Models and Query Languages

C1-Reliable, Scalable, and Maintainable Applications

Table of Contents