KV Caching: Why Long Prompts Get Expensive
Large language models feel fast on short prompts, then slow down dramatically on long text. KV caching is the key trick that makes long context usable in practice. This essay opens the box: what is cached, why reuse helps, and where the memory wall shows up.
The illusion of speed
Large language models feel fast on short prompts. But give them a 100,000 token document, and latency increases dramatically.
Now consider two common situations:
Multiple questions over the same large document
You load a long document once, then ask ten questions about it. Intuitively, it feels like we should be able to process the document once, preserve internal state, and reuse it. If we could, total compute and therefore energy could drop by nearly 10×.
A long chat with a single user
A conversation grows turn after turn. Each new message depends on everything that came before. Intuitively, we should be able to preserve the internal state of the conversation and continue from there, instead of reprocessing the entire growing history every time.
So why don’t we just preserve the internal state so we never have to recompute anything?
To answer that, we need to open the box. We need to understand how the next token is generated step by step, what internal matrices must be carried forward, and why keeping that state becomes expensive and impractical at scale.
1. How the model processes text
From text to vectors
A large language model does not read raw text. It first tokenizes the input. For simplicity, you can think of a token as roughly a word. In practice, tokens are subword units, so a word may be split into multiple tokens.
Suppose:
- The document contains 100,000 tokens
- The model embedding size is dmodel ≈ 8,000
dmodel depends on the architecture. For example:
- GPT-2 Large: dmodel = 1,280
- LLaMA-2 7B: dmodel = 4,096
- LLaMA-2 70B: dmodel = 8,192
- GPT-3 (175B): dmodel = 12,288
Each token is represented as a vector in ℝdmodel. These dimensions are learned during training and have no predefined meaning. What matters is the geometry of the full vector space, not any single dimension.
The input to the model is a matrix of token vectors:
This is already large: 100,000 × 8,000 = 800,000,000 numbers.
Inside one Transformer layer
Each layer has two main parts: self attention and a feed forward network (MLP).
Self attention (Q, K, V)
The model computes three learned projections:
Intuitively, Q is what a token queries, K is what a token makes available, and V is what flows forward. In practice, Q, K, and V are learned linear projections of the same token representation.
Attention is computed as:
The score matrix QKT is conceptually T × T. For 100,000 tokens, that is 10,000,000,000 entries. In float16, storing it would be about 20 GB, but modern kernels compute it in blocks and do not materialize the full matrix.
However, the Keys and Values must be stored for reuse during generation. For each layer:
That is 800,000,000 values per tensor. In float16, this is about 1.6 GB for K and 1.6 GB for V, so about 3.2 GB per layer.
Notice that we do not cache Q. Queries are only needed at the moment a token is processed. During autoregressive generation, each new token produces its own Q vector, which is immediately compared against all previously stored K vectors. Once that comparison is done, the Query has served its purpose and is never reused.
Keys and Values are different. They must remain available because every future token will attend to them. In other words:
- Q is a temporary request.
- K and V form persistent memory.
This asymmetry explains why the cache is called a KV cache and not a QKV cache.
Feed forward network (MLP)
After attention, each token is transformed independently by a feed forward network. This MLP expands the representation to a larger dimension, applies a non linear activation, then projects back.
GELU stands for Gaussian Error Linear Unit. It is non linear, meaning it does not satisfy f(ax + by) = a f(x) + b f(y). Without a non linear activation, stacking layers would collapse into a single linear mapping.
This layer pattern repeats many times. For example, GPT-3 (175B) uses 96 layers, while LLaMA-2 70B uses 80 layers.
2. What caching means
When a Transformer processes a document, it produces internal state at each layer. The most useful state to reuse is the pair of tensors K and V. If we store them after processing a document, we can reuse them when answering multiple questions about the same document.
Without caching
[Document] + [Question]
↓
X ∈ ℝ(C+Q) × dmodel
X ──> [Layer 1] ──> [Layer 2] ──> ... ──> [Layer L]
↑
recomputed every time
With KV caching
Step 1: Prefill (done once)
[Document]
↓
Xdoc ∈ ℝC × dmodel
Xdoc ──> [Layer 1] ──> ... ──> [Layer L]
│ │
store K₁,V₁ store KL,VL
Step 2: Question and Generation (what happens to Q, K, V)
Two regimes exist:
A) Prefill (batched processing)
- When you process an entire sequence at once (document or question),
you compute projections for all tokens in parallel:
Q, K, V ∈ ℝT × dmodel
Example:
- Document prefill: T = C = 100,000 → Q ∈ ℝ100,000 × dmodel
- Question prefill: T = Q = 50 → Q ∈ ℝ50 × dmodel
B) Decoding (autoregressive generation)
- When generating the next token, you process only ONE new token at a time:
New token tn
↓
Compute Qn ∈ ℝ1 × dmodel (transient)
Compute Kn, Vn ∈ ℝ1 × dmodel (to be cached)
Compare against stored Keys:
Qn × [K₁ ... Kn-1]T
Weights = softmax(...)
Combine stored Values:
Weights × [V₁ ... Vn-1]
→ produce output for token tn
Then append:
Kcache ← [K₁ ... Kn-1 ; Kn]
Vcache ← [V₁ ... Vn-1 ; Vn]
Key idea:
- Past Q vectors are never reused.
- K and V persist because future tokens must attend to them.
During generation, the model appends one new row of K and V per layer for every new token. This is what makes decoding fast, and what makes long context memory heavy.
3. The performance win
Let:
- C = document length in tokens
- Q = question length in tokens
- N = number of questions
Without shared caching, total cost is roughly proportional to: N(C + Q)
With a shared KV cache, total cost is roughly: C + NQ
Gain factor: N(C + Q) / (C + NQ)
With C = 100,000, Q = 100, N = 10, compute drops by nearly 10×. If energy roughly tracks compute, energy drops by roughly 10× too.
The win is real. The question is whether the cache is affordable at scale.
4. The memory cost
For each layer, KV caching stores K and V for every token in the context:
- K ∈ ℝT × dmodel
- V ∈ ℝT × dmodel
Total KV memory is roughly:
Total KV memory ≈ 2 × C × L × dmodel × bytes
Caching is not compression. It is materialized internal state.
5. Order of magnitude example
Assume:
- C = 100,000
- L = 80
- dmodel = 8,192
- fp16 precision (2 bytes)
Per token per layer (for K and V):
2 × 8192 × 2 bytes = 32 KB
Per layer:
100,000 × 32 KB ≈ 3.2 GB
Total for 80 layers:
≈ 256 GB
The raw document may be a few megabytes. The KV cache is hundreds of gigabytes.
6. Modern attention variants (multi head, GQA)
In practice, dmodel = nheads × dhead. Modern models often use Grouped Query Attention (GQA), where multiple query heads share the same K and V heads.
Instead of learning a distinct Key and Value projection for every attention head, the model learns fewer K and V projections that are shared across groups of Query heads. Empirically, reducing the number of independent K and V heads results in little to no degradation in model quality, suggesting substantial redundancy in context representation.
The impact on memory is dramatic.
- Without GQA (full multi-head): KV cache ≈ 256 GB
- With GQA reducing KV heads by 8×: KV cache ≈ 32 GB
The memory footprint scales linearly with the number of KV heads. Reducing KV heads by a factor of 8 reduces KV cache size by the same factor.
A simple way to think about Grouped Query Attention (GQA) is this: we keep the Query projection at full width, but we reduce the width of the Key and Value projections.
In other words, Q stays full size, while K and V become 8× narrower. This directly reduces the size of the KV cache by the same factor, because the cache stores K and V for every token.
7. The Memory Wall: Fragmentation and Paged Attention
At scale, another issue appears: fragmentation. Requests have different context lengths, generate different token counts, and finish at different times.
KV cache requires large contiguous memory blocks. When many requests run in parallel, memory becomes fragmented. You may have enough total GPU memory, but not enough contiguous space to serve a new request. Throughput collapses.
Systems such as vLLM solve this using Paged Attention. Instead of allocating one large contiguous KV block, memory is divided into fixed-size pages, dynamically assigned and reused — similar to virtual memory in operating systems.
Paged Attention does not reduce the total size of the KV cache. It makes large-scale deployment feasible by eliminating fragmentation.
8. Chunking: compute less
Instead of optimizing how we compute over C, we reduce C.
- Split the document into chunks
- Compute an embedding per chunk
- Store embeddings
- Retrieve only relevant chunks
Instead of 100,000 tokens, you might inject 2,000. Everything scales down: KV size, memory bandwidth, and latency.
9. Context caching APIs
Commercial systems such as Google Gemini (context caching) and Anthropic Claude (prompt caching) now expose this mechanism directly. A large document can be processed once, its internal state materialized, and subsequent queries reuse that state at much lower marginal cost. This operationalizes the equation C + NQ instead of N(C + Q).
Conclusion
KV caching is what makes autoregressive decoding fast and practical. It eliminates the quadratic recomputation of past tokens and turns generation into a linear incremental process. Paged attention makes it deployable at scale. Chunking changes the problem by shrinking context itself.
But the theoretical compute win hides a structural tradeoff.
Caching replaces repeated computation with persistent memory. For a single long document reused many times, the gain can approach an order of magnitude. Yet the memory required to materialize that gain scales linearly with context length and layer count.
In large models with 80 layers and thousands of hidden dimensions, every additional token becomes permanent state. That state must remain electrically alive in high bandwidth memory. At 100,000 tokens, even optimized KV representations can require tens of gigabytes per session.
The implication is architectural.
A shared cache for a specific document can be economically justified. A universal per user session cache cannot.
Generalizing persistent KV state across thousands of concurrent users would multiply memory requirements into the terabyte range of high bandwidth memory. Even with Grouped Query Attention and paging strategies, the aggregate footprint becomes prohibitive.
KV caching scales in compute efficiency. It does not scale in memory economics.
This is why context caching appears in controlled APIs for document reuse, but not as an infinite session memory for everyday chat. Keeping every user’s conversational state fully materialized across dozens of layers would dramatically reduce throughput and sharply increase infrastructure cost.
The structure of modern LLMs makes this tradeoff unavoidable. Eighty layers of matrix multiplications do not disappear. They are merely shifted from recomputation into stored state.
Long context inference is therefore not just a mathematical problem. It is constrained by memory bandwidth, GPU capacity, and the economics of silicon.