In Large Language Models, generating text is autoregressive. This means the model predicts one token at a time, and each new token depends on all previous tokens.
Without a cache, the model would have to re-calculate the "Key" and "Value" vectors for every single word in the sentence every time it wants to generate the next word. This is highly inefficient.
Imagine generating the sentence: "The cat sat". To generate the word "on", the model processes "The", "cat", and "sat". To then generate "the", it processes "The", "cat", "sat", and "on" all over again.
The red boxes represent wasted GPU cycles. We already knew the mathematical representation of "The" and "cat", yet we re-calculated them.
In the Transformer Self-Attention mechanism, each token has three vectors: Query (Q), Key (K), and Value (V). To calculate the next word, we only need the Query of the newest token, but we need the Keys and Values of all previous tokens.
When you send a prompt, the model processes all tokens at once and calculates the Initial K and V values. These are stored in the GPU memory (VRAM).
For every new word generated:
While KV Caching makes generation much faster, it consumes a significant amount of VRAM. This is why models have "context limits" (e.g., 32k tokens). The longer the conversation, the larger the KV Cache grows.
2 * layers * heads * head_dim * precision bytes per token.