LI

Legal IDE

Swiss legal workspace

Technical - English

KV cache, context windows, and why latency matters

Engineering notes for building responsive on-prem legal assistants.

KV cache in practice: prefill vs decode, prefix caching, and VRAM/RAM/disk tiers

This technical guide explains how KV cache works during transformer inference, why prefill and decode behave differently, how prefix caching compares to non-prefix reuse, and how to size cache tiers in VRAM, RAM, and disk for production systems.

1) What KV cache stores

During inference, transformer layers produce key and value tensors for every token. KV cache stores these tensors so the model can reuse them instead of recomputing the entire prompt at each new token. This reduces repeated work and improves latency, especially during long, iterative conversations. The core idea is widely used in production serving stacks and is fundamental to efficient LLM inference.

1.1) Matrix view (how the cache grows)

At decode step $t$, the cache holds keys and values for all previous tokens, and we append the new key/value for the latest token. The query vector is computed only for the newest token:

$$ K_{cache}^{(t)} = \begin{bmatrix} k_1 \\ k_2 \\ \vdots \\ k_t \end{bmatrix} \qquad V_{cache}^{(t)} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_t \end{bmatrix} \qquad q_t \in \mathbb{R}^{d_{head}} $$

The attention for the new token uses the full cached keys/values:

$$ \mathrm{Attention}(q_t, K_{cache}^{(t)}, V_{cache}^{(t)}) = \mathrm{softmax}\left(\frac{q_t (K_{cache}^{(t)})^{\top}}{\sqrt{d_{head}}}\right) V_{cache}^{(t)} $$

Figure: decode-time matrix flow

$$ \begin{array}{c} \textbf{Decode step } t \\ q_t \; (1 \times d_{head}) \; \cdot \; (K_{cache}^{(t)})^\top \; (d_{head} \times t) \; \Rightarrow \; s_t \; (1 \times t) \\ \mathrm{softmax}(s_t) \; (1 \times t) \; \cdot \; V_{cache}^{(t)} \; (t \times d_{head}) \; \Rightarrow \; o_t \; (1 \times d_{head}) \end{array} $$

2) Prefill vs decode (and why they feel different)

Prefill is the phase where the model processes the whole input prompt to build the KV cache and generate the first output token. It is typically more compute-heavy and can be batched efficiently.

Decode is the phase where the model generates subsequent tokens one by one. Decode is sequential per request and is often memory-bound because it must read the KV cache for every step. This is why caching has a bigger impact on decode latency and throughput.

Modern serving systems explicitly separate or tune these phases because their compute and memory characteristics differ. In disaggregated designs, a prefill engine computes KV cache and then a decode engine continues generation, which highlights their different resource profiles.

2.1 Prefill = dense math, GPU-friendly

Prefill is dominated by large matrix multiplications and attention kernels over the full prompt. It scales well with batching and benefits most from high FLOPs, large HBM capacity, and high HBM bandwidth. In practice, this is where CUDA-optimized kernels and high-end NVIDIA GPUs dominate, while other ecosystems focus on narrower deployment targets.

Selected GPU platforms for prefill (examples)

GPU / platform Memory Bandwidth Date (announcement)
NVIDIA H200 (Hopper) 141 GB HBM3e 4.8 TB/s Nov 13, 2023
NVIDIA Blackwell (B200, Blackwell family) 192 GB HBM3e - Mar 18, 2024
NVIDIA Blackwell Ultra (B300 family) 288 GB HBM3e 8 TB/s Aug 22, 2025
NVIDIA RTX PRO 6000 Blackwell (workstation) 96 GB GDDR7 1792 GB/s Mar 18, 2025

Blackwell family memory capacities (192 GB for Blackwell and 288 GB for Blackwell Ultra) and Blackwell Ultra bandwidth are documented in NVIDIA's Blackwell Ultra technical blog. The Blackwell platform announcement date is from NVIDIA's March 18, 2024 press release. H200 capacity and bandwidth come from NVIDIA's H200 press release. RTX PRO 6000 memory and bandwidth are from The Verge's announcement report.

System note (B300): NVIDIA's HGX B300 platform integrates eight Blackwell Ultra GPUs with 2.1 TB total memory and 1.8 TB/s GPU-to-GPU NVLink bandwidth per GPU in the system spec table.

2.2 Decode = memory latency and bandwidth

Decode reads KV cache on every token step. Even with fast compute, latency is dominated by memory access and data movement. This is why KV cache placement (VRAM vs RAM vs disk) and memory bandwidth are central to responsive inference.

CPU and SoC memory bandwidth (theoretical per socket or package)

Platform Memory channels / data rate Theoretical peak bandwidth Date
AMD EPYC Milan (3rd Gen) 8 channels DDR4-3200 (25.6 GB/s per channel) 204.8 GB/s Mar 15, 2021
AMD EPYC Genoa (4th Gen) 12 channels DDR5-4800 (38.4 GB/s per channel) 460.8 GB/s Nov 10, 2022
AMD EPYC Turin (5th Gen) 12 channels DDR5-6400 (51.2 GB/s per channel) 614.4 GB/s Oct 10, 2024
Intel Xeon 6 (P-cores) Up to 12 channels DDR5-6400 614.4 GB/s (theoretical) Sep 24, 2024
Intel 4th Gen Xeon (Sapphire Rapids) 8 channels DDR5-4800 (38.4 GB/s per channel) 307.2 GB/s Jan 10, 2023
Apple M3 Ultra (Mac Studio) Unified memory 800+ GB/s (up to 512 GB RAM) Mar 5, 2025
NVIDIA DGX Spark (GB10) 128 GB LPDDR5X 273 GB/s Mar 18, 2025
AMD Strix Halo (Ryzen AI Max+ 395) 128 GB LPDDR5X-8000 (256-bit) 256 GB/s Jan 6, 2025

Notes: DDR bandwidth values above are theoretical (channels x per-channel GB/s). Kingston provides the per-channel transfer rate figures and channel counts for Milan, Genoa, and Turin. Intel Xeon 6 supports up to 12 channels of DDR5-6400 and is offered in one-, two-, four-, and eight-socket server configurations, which scale aggregate bandwidth but add inter-socket latency (Xeon 6 product brief). Intel's 4th Gen Xeon product brief specifies 8 channels of DDR5 up to 4800 MT/s, and Intel's launch press release provides the Jan 10, 2023 date. Apple lists up to 512 GB unified memory for M3 Ultra and reports over 800 GB/s bandwidth. DGX Spark specs list 128 GB LPDDR5X and 273 GB/s bandwidth. Strix Halo bandwidth is based on LPDDR5X-8000 at 256-bit and is cited as 256 GB/s.

2.3 CPU instruction acceleration (AMX and AVX-512)

For CPU-side prefill, embedding, or reranking workloads, instruction-set acceleration matters:

  • Intel AMX adds matrix tiles and TMUL instructions for deep learning workloads on Xeon CPUs.
  • Intel AVX-512 provides wide 512-bit SIMD for vector-heavy math.

3) Prefix caching vs non-prefix reuse (CacheBlend-style)

Prefix caching stores KV for the beginning of a prompt. It is effective for chat history reuse and repeated system prompts, but it only helps when a later request shares the same prefix. Disk-based context caching also benefits most when large parts of the input repeat across requests; for example, DeepSeek's context caching only treats repeated prefix segments as cache hits.

Non-prefix reuse caches KV for reusable chunks that may appear anywhere inside the prompt (for example, a reused court decision or doctrine excerpt inserted between other sections). LMCache explicitly supports reuse of non-prefix text segments across requests and instances, which makes it useful for RAG-style workflows and long legal documents.

4) Three tiers of KV cache storage (and why retrieval latency matters)

A practical deployment uses tiered storage so the cache can be larger than GPU memory:

$$ \begin{array}{c} \textbf{VRAM (fastest)} \\ \downarrow \; \text{lower latency} \\ \textbf{CPU RAM (hot cache)} \\ \downarrow \; \text{higher latency} \\ \textbf{Disk (largest)} \\ \end{array} $$

  1. GPU VRAM (fastest, lowest retrieval latency)
  2. Holds active KV blocks for ongoing decode.
  3. Lowest latency, but limited capacity.

  4. CPU RAM (hot cache, moderate retrieval latency)

  5. Holds recently used KV blocks.
  6. Often uses pinned (page-locked) memory to speed GPU transfers.
  7. Acts as a hot cache when used with disk or remote storage.

  8. Local disk (largest, highest retrieval latency)

  9. Stores KV blocks for long documents or many sessions.
  10. Typically asynchronous to avoid blocking inference.
  11. Can prefetch into RAM to hide latency.

LMCache documents this multi-tier design across GPU, CPU DRAM, and local disk, with CPU RAM as a hot cache and disk as a large but slower tier. The CPU hot cache plus prefetch is specifically meant to reduce the latency impact of disk or remote retrievals.

5) KV cache sizing with real numbers

Use the model parameters to compute GB per 1000 tokens and scale to your context length. The sizing method is the same across models and storage tiers.

5.0 How the size is computed

Use model parameters (hidden size, attention heads, layers, KV heads) and apply:

head_size = hidden_size / num_attention_heads
total_elements = 2 * num_hidden_layers * tokens * num_key_value_heads * head_size
total_bytes = total_elements * dtype_size
KV_cache_GB = total_bytes / (1024^3)

5.1 Reference sizes (computed from model parameters)

  • Qwen/Qwen3-8B: 0.1373 GB per 1000 tokens
  • Qwen/Qwen3-32B (tp=2 on H100): 0.2441 GB per 1000 tokens
  • meta-llama/Llama-3.1-70B (tp=4 on H100): 0.3052 GB per 1000 tokens
  • deepseek-ai/DeepSeek-V3 (float16): 1.6289 GB per 1000 tokens
  • deepseek-ai/DeepSeek-R1 (float16): 1.6289 GB per 1000 tokens

5.2 Formula

KV_cache_GB = (GB_per_1000_tokens) * (tokens / 1000)

5.3 Examples (GB)

Qwen/Qwen3-8B (0.1373 GB / 1k tokens)

  • 4,000 tokens: 0.5492 GB
  • 16,000 tokens: 2.1968 GB
  • 32,000 tokens: 4.3936 GB
  • 40,960 tokens: 5.6238 GB

Qwen/Qwen3-32B (0.2441 GB / 1k tokens)

  • 4,000 tokens: 0.9764 GB
  • 16,000 tokens: 3.9056 GB
  • 32,000 tokens: 7.8112 GB
  • 40,960 tokens: 9.9983 GB

Llama-3.1-70B (0.3052 GB / 1k tokens)

  • 4,000 tokens: 1.2208 GB
  • 16,000 tokens: 4.8832 GB
  • 32,000 tokens: 9.7664 GB
  • 131,072 tokens: 40.0032 GB

DeepSeek-V3 / DeepSeek-R1 (1.6289 GB / 1k tokens, float16)

  • 4,000 tokens: 6.5155 GB
  • 16,000 tokens: 26.0620 GB
  • 32,000 tokens: 52.1240 GB
  • 40,960 tokens: 66.7188 GB
  • 131,072 tokens: 213.5000 GB

These values are computed from model parameters by scaling the per-1000-token size to the target context length.

6) Practical tiering guidance

  • VRAM is the critical path for active decoding. If VRAM fills up, KV blocks must be evicted, and the next request may require re-prefill, which increases latency.
  • CPU RAM should be enabled as a hot cache. Pinned memory improves GPU transfer speed and allows LRU eviction without blocking the GPU.
  • Disk provides a large store for long or cold contexts. Asynchronous writes and prefetch can reduce the perceived latency.

7) DeepSeek context caching on disk (why it matters for KV)

DeepSeek provides a concrete, production-grade example of disk-based context caching: its API offers context caching on disk, detecting repeated input segments and serving them from cache to reduce recomputation and latency. DeepSeek reports first-token latency reductions on long prompts (for example, a 128K prompt reduced from 13s to 500ms on cache hits), and documents that only repeated prefix parts count as cache hits.

The model configuration for DeepSeek-V3 and DeepSeek-R1 uses:

  • hidden_size = 7168
  • num_attention_heads = 128
  • num_hidden_layers = 61
  • num_key_value_heads = 128

These parameters drive the formula in Section 5.0 and yield the per-1k token sizes listed above.

For DeepSeek KV cache sizing, compute the GB per 1000 tokens value from these parameters and apply the formula in Section 5.2 to get VRAM/RAM/disk requirements for your context length. This gives you the tiered storage plan for VRAM (hot decode), RAM (hot cache), and disk (long-term reuse).

8) Design implications for serving systems

Because prefill is compute-bound and decode is memory-bound, production systems often:

  • Batch prefill aggressively.
  • Keep decode latency low by maximizing KV reuse.
  • Use tiered caches to avoid re-prefill on long prompts.
  • Separate prefill and decode engines when workloads are heavy.
  • When models exceed a single GPU, plan for tensor and pipeline parallelism. See /blog/tensor-pipeline-parallelism.

9) Summary

  • KV cache stores per-token keys and values and dramatically reduces repeated computation.
  • Prefill and decode behave differently, and infrastructure should reflect that.
  • Prefix caching helps with repeated prefixes; non-prefix reuse helps with RAG-style chunk reuse.
  • Tiered storage (VRAM, RAM, disk) is essential for long-context and multi-user workloads.
  • Use model-parameter values to estimate cache size in GB for your context lengths.

Sources

  • https://raw.githubusercontent.com/LMCache/LMCache/refs/heads/dev/examples/kv_cache_calculator/modelconfig.json
  • https://docs.lmcache.ai/getting_started/faq.html
  • https://docs.lmcache.ai/developer_guide/architecture.html
  • https://docs.lmcache.ai/kv_cache/cpu_ram.html
  • https://docs.lmcache.ai/kv_cache/local_storage.html
  • https://docs.lmcache.ai/kv_cache_optimizations/blending.html
  • https://api-docs.deepseek.com/news/news0802/
  • https://api-docs.deepseek.com/guides/kv_cache
  • https://investor.nvidia.com/news/press-release-details/2023/NVIDIA-Supercharges-Hopper-the-Worlds-Leading-AI-Computing-Platform/default.aspx
  • https://investor.nvidia.com/news/press-release-details/2024/NVIDIA-Blackwell-Platform-Arrives-to-Power-a-New-Era-of-Computing/default.aspx
  • https://developer.nvidia.com/blog/inside-nvidia-blackwell-ultra-the-chip-powering-the-ai-factory-era/
  • https://www.nvidia.com/en-us/data-center/hgx/
  • https://www.theverge.com/news/631868/nvidia-rtx-pro-6000-blackwell-gpu-professionals
  • https://www.kingston.com/en/memory/server-memory/milan
  • https://www.kingston.com/en/memory/server-memory/genoa
  • https://www.kingston.com/en/memory/server-memory/turin
  • https://www.intel.com/content/www/us/en/developer/articles/technical/4th-gen-intel-xeon-scalable-processors-product-brief.html
  • https://www.intel.com/pressroom/news/intel-launches-fourth-gen-xeon-scalable-processors/
  • https://www.intel.com/content/www/us/en/content-details/853732/intel-xeon-6-processor-with-performance-cores-product-brief.html
  • https://www.intel.com/pressroom/news/press-kits/intel-xeon-6-processor/
  • https://www.intel.com/content/www/us/en/products/docs/accelerator-engines/what-is-intel-amx.html
  • https://www.intel.com/content/www/us/en/developer/articles/technical/intel-avx-512-instructions.html
  • https://www.apple.com/newsroom/2025/03/apple-introduces-m3-ultra/
  • https://www.nvidia.com/en-us/dgx/dgx-spark/
  • https://docs.nvidia.com/dgx/dgxspark-user-guide/hardware-specs.html
  • https://nvidianews.nvidia.com/news/nvidia-announces-dgx-spark
  • https://www.techpowerup.com/cpu-specs/ryzen-ai-max-395.c3719
  • https://www.theverge.com/2025/1/6/24337295/amd-ryzen-ai-max-395-strix-halo-laptop-cpu-ces-2025
  • https://www.amd.com/en/newsroom/press-releases/2021-3-8-amd-to-host-digital-launch-of-3rd-gen-amd-epyc-pr.html
  • https://www.amd.com/en/newsroom/press-releases/2022-11-10-offering-unmatched-performance-leadership-energy-.html
  • https://ir.amd.com/news-events/press-releases/detail/1219/amd-launches-5th-gen-amd-epyc-cpus-maintaining-leadership

Interested in working with us?

We are always looking for technical talents with excellent English and German speaking skills.

Get in touch