What is KV Cache?

In Large Language Models, generating text is autoregressive. This means the model predicts one token at a time, and each new token depends on all previous tokens.

Without a cache, the model would have to re-calculate the "Key" and "Value" vectors for every single word in the sentence every time it wants to generate the next word. This is highly inefficient.

1. The Problem: Redundant Computation

Imagine generating the sentence: "The cat sat". To generate the word "on", the model processes "The", "cat", and "sat". To then generate "the", it processes "The", "cat", "sat", and "on" all over again.

The red boxes represent wasted GPU cycles. We already knew the mathematical representation of "The" and "cat", yet we re-calculated them.

2. The Solution: KV Cache

In the Transformer Self-Attention mechanism, each token has three vectors: Query (Q), Key (K), and Value (V). To calculate the next word, we only need the Query of the newest token, but we need the Keys and Values of all previous tokens.

3. How it Works (Step-by-Step)

Step 1: Prefill

When you send a prompt, the model processes all tokens at once and calculates the Initial K and V values. These are stored in the GPU memory (VRAM).

Step 2: Decoding

For every new word generated:

Only calculate Q, K, V for the 1 new token.
Store the new K, V in the cache.
Pull all previous K, V from the cache to perform the attention math.

4. The Trade-off: Memory vs. Speed

While KV Caching makes generation much faster, it consumes a significant amount of VRAM. This is why models have "context limits" (e.g., 32k tokens). The longer the conversation, the larger the KV Cache grows.

Speed: Significantly faster (O(n) vs O(n²)).
Memory: Requires 2 * layers * heads * head_dim * precision bytes per token.