What is KV Cache?

In Large Language Models, generating text is autoregressive. This means the model predicts one token at a time, and each new token depends on all previous tokens.

Without a cache, the model would have to re-calculate the "Key" and "Value" vectors for every single word in the sentence every time it wants to generate the next word. This is highly inefficient.

1. The Problem: Redundant Computation

Imagine generating the sentence: "The cat sat". To generate the word "on", the model processes "The", "cat", and "sat". To then generate "the", it processes "The", "cat", "sat", and "on" all over again.

The Predicts: "cat" The cat Predicts: "sat" The cat sat Re-calculating previous tokens!

The red boxes represent wasted GPU cycles. We already knew the mathematical representation of "The" and "cat", yet we re-calculated them.

2. The Solution: KV Cache

In the Transformer Self-Attention mechanism, each token has three vectors: Query (Q), Key (K), and Value (V). To calculate the next word, we only need the Query of the newest token, but we need the Keys and Values of all previous tokens.

GPU Processing Current Token: "sat" Q K V KV Cache (VRAM) K, V for "The" K, V for "cat" Store New KV Fetch Past KV

3. How it Works (Step-by-Step)

Step 1: Prefill

When you send a prompt, the model processes all tokens at once and calculates the Initial K and V values. These are stored in the GPU memory (VRAM).

Step 2: Decoding

For every new word generated:

4. The Trade-off: Memory vs. Speed

While KV Caching makes generation much faster, it consumes a significant amount of VRAM. This is why models have "context limits" (e.g., 32k tokens). The longer the conversation, the larger the KV Cache grows.