AI/ML

LLM Inference Demystified: PagedAttention, KV Cache, MoE & Continuous Batching

The 5 key concepts every cloud architect should know about LLM serving: PagedAttention, KV cache mechanics, continuous batching, MoE trade-offs, and real production numbers.

Alexandre Agius

Alexandre Agius

AWS Solutions Architect

11 min read
Share:

Deploying LLMs in production requires understanding five concepts that rarely get explained together: KV cache, PagedAttention, continuous batching, Mixture of Experts, and the metrics that tell you if your serving setup is actually performing. This post covers all five with real numbers and practical intuition.

The Problem

You have a model β€” say Mixtral 8x7B or Mistral Large β€” and you need to serve it to users. You spin up a GPU instance, load the model, and send a request. It works. Then you send 10 concurrent requests and everything grinds to a halt. Or worse, you run out of GPU memory on a single long conversation.

The issues:

  • Memory waste β€” The KV cache for each request pre-allocates massive contiguous memory blocks, most of which goes unused. With naive allocation, a single H100 (80GB) might only serve 4-8 concurrent sequences.
  • Throughput bottleneck β€” Static batching waits for the slowest sequence in the batch to finish before starting new requests. A 500-token response blocks a 20-token response.
  • Hidden model size β€” Mixtral 8x7B sounds smaller than a 70B model, but it loads all 56B parameters into memory. Only ~14B are active per token. The name is misleading if you’re sizing GPUs.
  • Wrong metrics β€” You measure latency in milliseconds, but your users experience time-to-first-token (TTFT) and tokens-per-second (TPS). A system with great average latency can still feel slow.

These aren’t edge cases. They’re the default failure modes when deploying LLMs beyond prototyping.

The Solution

Five concepts, each solving a specific piece of the inference puzzle:

  1. KV Cache β€” avoid recomputing attention over the full sequence at every token
  2. PagedAttention β€” manage KV cache memory like an OS manages virtual memory
  3. Continuous Batching β€” let requests join and leave the batch dynamically
  4. Mixture of Experts (MoE) β€” activate only a fraction of the model per token
  5. Production Metrics β€” measure what actually matters for the user experience

Together, they form the inference pipeline that production serving engines like vLLM implement:

LLM inference pipeline showing how requests flow through continuous batching, tokenization, MoE routing, attention with PagedAttention KV cache, and output sampling

How It Works

KV Cache: Why Inference Is Memory-Bound

The transformer’s self-attention mechanism computes three vectors for each token: Query (Q), Key (K), and Value (V). During generation, every new token needs to attend to all previous tokens in the sequence. Without caching, you’d recompute K and V for the entire history at every step.

The KV cache stores previously computed K and V vectors so they’re only calculated once:

Step 1: "The"      β†’ compute K₁, V₁, store in cache
Step 2: "The cat"  β†’ compute Kβ‚‚, Vβ‚‚, read K₁V₁ from cache
Step 3: "The cat sat" β†’ compute K₃, V₃, read K₁V₁, Kβ‚‚Vβ‚‚ from cache
...
Step N: compute Kβ‚™, Vβ‚™, read all previous from cache

This turns a quadratic computation into a linear one. The trade-off: memory. For a model like Llama 2 70B with a 4096-token context, the KV cache for a single sequence takes ~2GB of GPU memory. Scale to 32 concurrent sequences and you need 64GB just for KV cache β€” nearly all of an H100’s 80GB.

The KV cache size per token per layer:

KV size = 2 Γ— num_heads Γ— head_dim Γ— num_layers Γ— precision_bytes
ModelKV Cache per TokenKV Cache at 4K ContextAt 32K Context
Mistral 7B (32 layers, GQA)~0.5 MB~2 GB~16 GB
Llama 2 70B (80 layers)~2.5 MB~10 GB~80 GB
Mixtral 8x7B (32 layers, GQA)~0.5 MB~2 GB~16 GB

The problem isn’t the cache itself β€” it’s how it’s allocated.

PagedAttention: Virtual Memory for GPU

Traditional KV cache allocation works like old-school C malloc: each sequence gets a pre-allocated contiguous block sized for the maximum possible sequence length. If your max length is 4096 tokens but the sequence only uses 200, you waste 95% of that allocation. And since the blocks must be contiguous, you get memory fragmentation β€” free space exists but isn’t usable.

PagedAttention (introduced by the vLLM paper) applies the same insight that revolutionized operating systems: virtual memory with paging.

Instead of one contiguous block per sequence:

  1. GPU memory is divided into fixed-size blocks (e.g., 16 tokens each)
  2. Each sequence gets a block table mapping logical positions to physical blocks
  3. Blocks are allocated on demand β€” a new block only when the previous one fills up
  4. Blocks don’t need to be contiguous in physical memory
  5. Completed sequences free their blocks immediately for reuse
Traditional allocation:
  Sequence A: [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘]  (50% wasted)
  Sequence B: [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘]  (75% wasted)
  Sequence C: [cannot allocate β€” no contiguous block available]

PagedAttention:
  Block pool: [A₁][Aβ‚‚][B₁][A₃][Bβ‚‚][free][free][free]...
  Sequence A: table β†’ blocks 0,1,3       (exact fit)
  Sequence B: table β†’ blocks 2,4         (exact fit)
  Sequence C: table β†’ blocks 5,6         (fits in free blocks!)

The results are dramatic:

MetricTraditionalPagedAttention
Memory waste60-80%<4%
Concurrent sequences (H100)4-816-32+
ThroughputBaseline2-4x higher

This is why vLLM became the de facto serving engine. PagedAttention isn’t an optimization β€” it’s a paradigm shift in how GPU memory is managed for inference.

Continuous Batching: No More Waiting

Static batching groups N requests together and processes them as a batch. The problem: all requests in the batch must wait for the longest sequence to complete before any new request can start.

Static batching:
  Time β†’
  Req A (20 tokens):   [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ]β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  ← done but waiting
  Req B (50 tokens):   [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ]  ← batch completes here
  Req C (queued):       β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ [starts after B finishes]

Continuous batching:
  Time β†’
  Req A (20 tokens):   [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ]
  Req B (50 tokens):   [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ]
  Req C:                         [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ]  ← joins immediately when A exits

Continuous batching (also called iteration-level scheduling) checks after every decoding step: has any sequence finished? If yes, evict it and pull the next request from the queue. This means:

  • Short responses don’t wait for long ones
  • The GPU stays saturated β€” no idle cycles between batches
  • TTFT for queued requests drops dramatically

In practice, continuous batching improves throughput by 2-3x over static batching with the same hardware, because the GPU never sits idle waiting for a slow sequence to finish.

Mixture of Experts: Sparse but Heavy

MoE models like Mixtral 8x7B use a clever trick: instead of one massive feed-forward network (FFN), they have multiple smaller β€œexpert” FFNs. A learned router selects the top-K experts (usually 2) for each token.

Standard transformer (dense):
  Token β†’ Attention β†’ FFN (all parameters active) β†’ Output

MoE transformer (sparse):
  Token β†’ Attention β†’ Router β†’ Expert 3 + Expert 7 (2 of 8 active) β†’ Output

The appeal: Mixtral 8x7B has the quality of a ~40-50B dense model but only activates ~14B parameters per token (2 experts Γ— 7B each, roughly). That means lower inference compute cost per token.

The catch that trips up every architect sizing GPUs:

What You Might ExpectReality
”8x7B = smaller than 70B”56B total parameters loaded in GPU memory
”Only 2 experts active = low memory”All 8 experts must be resident β€” the router decides at runtime
”Cheaper than dense 70B”Cheaper per token (less compute), but similar memory footprint

MoE models are compute-efficient (less FLOPs per token) but not memory-efficient (full model in VRAM). When sizing infrastructure:

  • GPU memory β€” plan for the full parameter count (56B for Mixtral 8x7B β‰ˆ ~112GB in FP16, ~56GB in INT8)
  • Compute β€” only ~25% of parameters are active, so inference is faster than an equivalently-sized dense model
  • Bandwidth β€” expert selection requires reading different weights per token, which can bottleneck on memory bandwidth

For serving, this means MoE models benefit heavily from quantization (INT8 or INT4) to fit in a single GPU, and from tensor parallelism across multiple GPUs when they don’t fit.

Production Metrics: Measuring What Matters

The standard metrics β€” latency and throughput β€” don’t capture the user experience of LLM inference. Here’s what to actually measure:

MetricWhat It MeasuresWhy It Matters
TTFT (Time to First Token)Delay before the first token appearsUser-perceived responsiveness. Includes queue wait + prefill time.
TPS (Tokens per Second)Output generation speedReading speed β€” too slow and users notice the β€œdrip”
Throughput (Requests per Second)System-level capacityHow many concurrent users the system handles
P50 / P95 / P99 latencyLatency distributionP50 is typical, P99 is the tail that causes complaints
TPOT (Time per Output Token)Time between consecutive tokensInverse of TPS β€” the generation cadence

The relationship between these:

Total latency = TTFT + (output_tokens Γ— TPOT)
TPS = 1 / TPOT

TTFT is dominated by the prefill phase β€” processing the entire input prompt in one forward pass. Long prompts (RAG with 10K context) have higher TTFT regardless of output length. This is why chunked prefill exists: split the input processing across multiple steps to avoid one massive initial delay.

Real-world targets:

Use CaseTTFT TargetTPS Target
Chatbot (interactive)< 500ms> 30 TPS
Code completion< 200ms> 50 TPS
Batch processingDon’t careMaximize throughput
Streaming API< 1s> 15 TPS

API vs Self-Hosted: The Break-Even

When does self-hosting make financial sense over API calls?

Using Mistral Large as an example:

API (Mistral)Self-Hosted (H100)
Cost per 1M input tokens$0.50~$0.15 (amortized)
Cost per 1M output tokens$1.50~$0.45 (amortized)
Monthly cost at 1M tokens/month~$2~$3,500 (H100 instance)
Monthly cost at 30M tokens/month~$45~$3,500
Monthly cost at 100M tokens/month~$150~$3,500

The break-even sits around 20-30M tokens/month for Mistral Large. Below that, API wins on cost. Above that, self-hosted wins β€” and the gap widens with volume.

But cost isn’t the only factor:

FactorAPISelf-Hosted
Data sovereigntyData leaves your infraFull control
LatencyNetwork hop + queueDirect GPU access
CustomizationLimited (temperature, top-p)Full control (quantization, batching config)
Ops burdenZeroSignificant (GPU monitoring, model updates, scaling)
ScalingInstantProvision + deploy

The practical rule: Start with API. Move to self-hosted when you have consistent volume > 20M tokens/month AND a team that can operate GPU infrastructure. Data sovereignty requirements can override the volume threshold.

What I Learned

  • KV cache is the memory bottleneck, not model weights. For long-context inference, KV cache can consume more GPU memory than the model itself. PagedAttention doesn’t just optimize β€” it fundamentally changes how many concurrent users a single GPU can serve (4-8 to 16-32+). This is why vLLM dominates production serving.

  • MoE model names are misleading for infrastructure planning. Mixtral β€œ8x7B” loads 56B parameters into GPU memory, not 7B. The sparsity saves compute per token but doesn’t save memory. Every architect I’ve seen sizes MoE GPUs wrong on the first attempt. Plan for full parameter count, then appreciate the compute savings.

  • Continuous batching is the single highest-impact optimization. PagedAttention gets the attention (no pun intended), but continuous batching’s 2-3x throughput improvement over static batching is what makes the economics work. It’s the difference between needing 3 GPUs and 1 GPU for the same workload.

  • TTFT and TPS matter more than average latency. Users perceive a 200ms TTFT with 30 TPS streaming as fast, even if total latency is 5 seconds for a 150-token response. A system with 500ms average latency but 3-second P99 TTFT feels broken. Measure what the user experiences, not what the system reports.

  • The API vs self-hosted break-even is lower than most people think. At ~20-30M tokens/month, self-hosting starts winning on cost. But the operational burden is real β€” GPU monitoring, model versioning, quantization tuning, scaling policies. The right move for most teams is API until proven otherwise.

What’s Next

  • Benchmark vLLM vs TGI vs Triton on the same model (Mistral 7B) with identical hardware β€” measure TTFT, TPS, and max concurrent sequences
  • Deep dive into quantization techniques (INT8, INT4, GPTQ, AWQ) β€” how much quality do you actually lose?
  • Explore speculative decoding: using a small draft model to predict multiple tokens, then verifying with the large model
  • Write a practical guide to GPU selection for LLM inference: H100 vs A100 vs L4 vs Inferentia2, with cost/performance matrices
Alexandre Agius

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

Related Posts

Back to Blog