LLM Architecture Explained Simply: 10 Questions From Prompt to Token
A beginner-friendly walkthrough of how an LLM actually works end-to-end: from typing a prompt to receiving a response — covering tokenization, embeddings, Transformer layers, KV cache, the training loop, embeddings for search, and why decoder-only models won.
Table of Contents
- The Problem
- The Solution
- How It Works
- Stage 1: Tokenization — Text to Numbers
- Stage 2: Embeddings — Numbers to Meaning
- Stage 3: The Transformer Stack — Where Thinking Happens
- Stage 4: The LM Head — Vectors to Probabilities
- Stage 5: Sampling — Choosing the Next Token
- Stage 6: The Autoregressive Loop — Token by Token
- Prefill vs Decode: Why the First Token Is Slow
- The KV Cache: Memory That Makes Decode Fast
- Why Decoder-Only Won
- Training vs Inference: Two Different Worlds
- What I Learned
- What’s Next
You type a prompt, hit enter, and tokens start streaming back. But what actually happens between your text and the model’s output? This post answers 10 simple questions that trace the complete path — from raw text to generated token — with no math prerequisites.
The Problem
Every explanation of LLMs either starts with the attention formula (losing 90% of readers) or stays so high-level it’s useless for practical work. You end up knowing that “Transformers use attention” without understanding what that means in practice.
Here are 10 questions that, once answered, give you a solid mental model of how LLMs work:
- How does an LLM work?
- How does inference work?
- How does training work?
- What do Attention and FFN actually do?
- How does the model know when to stop?
- What’s a token? How does vocabulary work?
- What’s the KV cache?
- What are embeddings? How are they used for search?
- What’s the difference between encoder and decoder?
- What’s decided before training vs learned during training?
If you deploy, fine-tune, or evaluate LLMs, these aren’t academic questions — they directly affect latency, memory usage, cost, and output quality.
The Solution
The entire LLM pipeline is a sequence of six stages. Every model — GPT-4, Claude, Mistral, Llama — follows this exact flow:
Prompt → Tokenizer → Embeddings → Transformer Stack → LM Head → Output Token
↓
(loop until EOS)
Each stage transforms the data into a different representation. Understanding these representations is understanding LLMs.
How It Works
Stage 1: Tokenization — Text to Numbers
LLMs don’t see text. They see sequences of integers. The tokenizer converts text into token IDs using a pre-built vocabulary.
Most modern models use Byte-Pair Encoding (BPE): start with individual characters, then iteratively merge the most frequent pairs until you reach your target vocabulary size (typically 32K-128K entries).
Input: "The capital of France is"
│
▼
Tokenize: ["The", " capital", " of", " France", " is"]
│
▼
Lookup: [464, 4891, 315, 6629, 374]
Key properties of tokenization:
| Property | Detail |
|---|---|
| Vocabulary size | 32K (Mistral), 128K (Llama 3), 100K (GPT-4) |
| Algorithm | BPE — byte-level, merge frequent pairs |
| Granularity | Subword — common words are one token, rare words split into pieces |
| Language efficiency | English ≈ 1 token per word. French ≈ 1.3-1.5x more tokens. Chinese ≈ 2x. |
| Reversible | Always — you can decode token IDs back to text |
Why this matters practically:
- Pricing — API providers charge per token, not per word. French prompts cost ~30-50% more than equivalent English ones.
- Context window — a “128K context” means 128K tokens, not characters. English gets ~100K words, while code gets significantly less (variable names, punctuation all consume tokens).
- Vocabulary trade-off — bigger vocab = fewer tokens per text (cheaper, faster) but larger embedding matrix (more memory). Llama 3 quadrupled vocab from 32K to 128K specifically to improve multilingual and code efficiency.
Stage 2: Embeddings — Numbers to Meaning
Token IDs are just integers — they carry no semantic information. The embedding layer converts each ID into a dense vector that captures meaning.
Token ID 6629 ("France")
│
▼
Embedding matrix lookup (128K × 4096)
│
▼
[0.023, -0.451, 0.892, ..., 0.117] ← 4096-dimensional vector
This vector is learned during training. Similar concepts end up near each other in this space — “France” and “Germany” have similar vectors, while “France” and “function” are far apart.
Positional encoding is added to give the model a sense of order. The model needs to know that “dog bites man” means something different from “man bites dog.” Modern models use RoPE (Rotary Position Embedding) — a rotation applied to the vector that encodes relative position information.
After this stage, the input is a matrix: [sequence_length × model_dimension]. For our 5-token prompt on a 4096-dim model, that’s [5 × 4096].
Embeddings beyond LLMs: vectors for search
The same concept of “text as vectors” powers embedding models — specialized models trained not to predict the next token, but to map text into vectors where similar meanings are close together. Unlike LLM embeddings (which are an internal layer), embedding models produce a single vector per input text, optimized for similarity search.
"The cat sat on the mat" → [0.82, 0.15, -0.33, ...]
"A feline rested on a rug" → [0.79, 0.18, -0.30, ...] ← close (similar meaning)
"Stock prices rose sharply today" → [-0.45, 0.91, 0.12, ...] ← far (different meaning)
These vectors enable RAG (Retrieval-Augmented Generation): embed a query, search a vector database (FAISS, Pinecone, OpenSearch) for the closest document vectors using cosine similarity, then pass the retrieved text (not the vectors) to the LLM as context. The embedding is only used for search — the LLM reads plain text.
1. Query: "How do cats behave?" → embed → vector → search FAISS
2. FAISS returns: document IDs [0, 2] (not text, not vectors — just IDs)
3. You look up: texts[0] → "Felines are independent animals..."
4. Prompt to LLM: "Context: Felines are independent animals...
Question: How do cats behave?"
5. LLM generates answer from the context
Stage 3: The Transformer Stack — Where Thinking Happens
This is the core of the model. The embedding vectors pass through a stack of identical Transformer blocks — 32 layers for Mistral 7B, 80 for Llama 70B. Each block does two things:
-
Self-Attention — “Who is related to what?” Looks at all tokens in the sequence and computes how much each token should influence every other token. This is where “France” connects to “capital” and “is.”
-
Feed-Forward Network (FFN) — “What does it mean?” Transforms each token’s representation through learned knowledge. This is where the model “remembers” that the capital of France is Paris. The FFN holds ~2/3 of the model’s parameters and acts as an associative memory.
Each layer refines the representation a bit more:
Layers 1-10: Syntactic structure (grammar, word order)
Layers 11-50: Semantic understanding (meaning, facts, relationships)
Layers 51-80: Generation preparation (what token comes next)
For a deep dive into the Attention + FFN duo, see Transformer Anatomy: Attention + FFN Demystified.
Stage 4: The LM Head — Vectors to Probabilities
After the final Transformer layer, each token position has a 4096-dimensional vector that encodes “everything the model understands about what should come next.” The LM head converts this into a probability over the entire vocabulary:
Final hidden state: [4096-dim vector]
│
▼
Linear projection: 4096 → 128,000 (vocab size)
│
▼
Softmax: convert raw scores to probabilities
│
▼
{ "Paris": 0.87, "Lyon": 0.03, "the": 0.02, "Berlin": 0.01, ... }
This gives a probability for every token in the vocabulary. The model doesn’t “pick a word” — it produces a distribution over all possible next tokens.
Stage 5: Sampling — Choosing the Next Token
The sampling strategy controls how the model selects from that probability distribution. This is where temperature, top-k, and top-p come in:
| Parameter | What It Does | Effect |
|---|---|---|
| Temperature | Scales the logits before softmax. T < 1 sharpens (more deterministic), T > 1 flattens (more creative). | T=0: always picks highest probability. T=1: natural distribution. T=2: very random. |
| Top-k | Only consider the top K most probable tokens. | k=1: greedy. k=50: considers 50 options. |
| Top-p (nucleus) | Only consider tokens whose cumulative probability reaches P. | p=0.9: considers the smallest set of tokens covering 90% probability mass. |
In practice, most serving setups use temperature=0.7 with top_p=0.9 for a balance of coherence and variety. Coding tasks use lower temperature (0.1-0.3) for more deterministic output.
The selected token is appended to the sequence, and we loop back to Stage 3.
Stage 6: The Autoregressive Loop — Token by Token
LLMs generate text one token at a time. The selected token is fed back as input, and the entire Transformer stack runs again to produce the next token. This continues until either:
- The model outputs an EOS (End of Sequence) token — a special vocabulary entry that signals “I’m done”
- The sequence hits the maximum length limit
- The user/system stops generation
Step 1: "The capital of France is" → model predicts "Paris"
Step 2: "The capital of France is Paris" → model predicts "."
Step 3: "The capital of France is Paris." → model predicts EOS
→ Done. Return "Paris."
This is why LLMs can’t “plan ahead” — each token is generated based solely on what came before. The model doesn’t see the full response, draft it, then output it. It commits to each token as it goes.
Prefill vs Decode: Why the First Token Is Slow
Inference has two distinct phases:
Prefill — Process the entire input prompt in one forward pass. All input tokens are processed in parallel (they’re all known). This is compute-heavy but efficient because of parallelism. The KV cache is populated for all input tokens.
Decode — Generate output tokens one at a time. Each step processes only the new token, reading the KV cache for all previous tokens. This is memory-bandwidth-bound — the GPU spends most of its time reading cached values.
Prefill: [████████████████████] ← all input tokens, one pass
Time: 200ms (for 1000 input tokens)
Decode: [█][█][█][█][█][█]... ← one token per step
Time: 30ms per token
This is why TTFT (Time to First Token) is dominated by prompt length. A 100-token prompt has fast TTFT. A 10,000-token RAG prompt has much higher TTFT, regardless of output length. After prefill, decode speed is roughly constant.
The KV Cache: Memory That Makes Decode Fast
During prefill, the Attention mechanism computes keys and values for every input token. These are stored in the KV cache — a per-request memory buffer that lives in GPU memory. During decode, each new token only needs to compute its own key/value pair and attend to the cached ones, avoiding recomputation of the entire sequence.
Prefill: "What is the capital of France?"
→ Compute K,V for all 7 tokens → Store in KV cache
Decode step 1: "The"
→ Compute K,V for "The" only → Attend to 7 cached + 1 new → Predict next
Decode step 2: "capital"
→ Compute K,V for "capital" only → Attend to 8 cached + 1 new → Predict next
Without the KV cache, every decode step would reprocess the entire sequence from scratch — making generation quadratically slower.
KV cache lifecycle:
| Phase | What Happens |
|---|---|
| Prefill | KV cache is built (all input tokens) |
| Decode | KV cache grows by one entry per step |
| End of request | KV cache is discarded |
The cache is per request — 100 concurrent users means 100 separate KV caches in GPU memory. This is the primary memory bottleneck in production serving, and it’s why optimizations matter:
- Sliding Window Attention (SWA) — Mistral’s approach. Only cache the last W tokens (e.g., 4096). Older entries are dropped as the window slides. Memory stays constant regardless of sequence length.
- Grouped-Query Attention (GQA) — Multiple query heads share the same KV heads, reducing cache size at the architecture level.
- PagedAttention — Used by vLLM. Stores the KV cache in non-contiguous memory pages (like OS virtual memory) instead of one big contiguous block. Pages are allocated on demand and freed immediately when a request finishes. This dramatically increases the number of concurrent requests a GPU can serve.
The combination of SWA + GQA + PagedAttention is how Mistral handles 256K context windows without exhausting GPU memory.
Why Decoder-Only Won
The original 2017 Transformer had two halves:
- Encoder — processes input bidirectionally (each token sees all other tokens, including future ones)
- Decoder — generates output autoregressively (each token only sees past tokens)
Early models used both: the encoder processes the input, the decoder generates the output (T5, BART, original GPT). But the field converged on decoder-only architectures (GPT-2, GPT-3, Llama, Mistral, Claude). Why?
| Factor | Encoder-Decoder | Decoder-Only |
|---|---|---|
| Simplicity | Two separate components, cross-attention between them | One unified stack |
| Scaling | Must decide how to split parameters between encoder and decoder | All parameters in one stack |
| Pre-training | Often trained with different objectives per component | Single next-token prediction objective |
| Flexibility | Best for seq-to-seq (translation, summarization) | Works for everything (chat, code, reasoning, translation) |
| In-context learning | Weaker — encoder can’t easily learn from examples in the prompt | Strong — examples in the prompt are just more context |
The key insight: decoder-only models treat the input as “just more context.” There’s no architectural distinction between the prompt and the response — it’s all one sequence. This simplicity turned out to scale better and generalize more broadly.
Training vs Inference: Two Different Worlds
| Aspect | Training | Inference |
|---|---|---|
| Goal | Adjust weights to minimize prediction error | Use frozen weights to generate text |
| Data | Trillions of tokens, all at once | One prompt at a time |
| Compute | Thousands of GPUs for weeks | 1-8 GPUs per request |
| Cost | $10M-$100M+ | $0.001-$0.01 per request |
| Gradient | Yes — backpropagation updates every weight | No — weights are frozen |
| Parallelism | Full sequence parallelism (teacher forcing) | Prefill parallel, decode sequential |
During training, the model sees the correct next token (this is called teacher forcing). It processes the entire sequence in parallel and compares its predictions against the ground truth at every position. The error signal flows backward through all layers to update weights.
The training loop in 5 steps:
1. Forward pass: tokens → Attention → FFN → predicted next token
2. Loss: compare prediction vs actual next token (cross-entropy)
3. Backpropagation: compute gradient for every weight ("how to adjust?")
4. Optimizer: adjust weights in the direction that reduces loss
5. Repeat: billions of times across the training data
The same Attention + FFN layers used during inference are the ones being trained. Training runs both a forward pass (same as inference) and a backward pass (compute gradients and update weights). Inference only runs the forward pass — weights are frozen.
Hyperparameters vs learned weights:
- Hyperparameters — decided by humans before training: number of layers, model dimension, vocabulary size, learning rate, number of attention heads. These define the architecture. You can’t change them after training — choose wrong and you start over.
- Learned weights — adjusted during training: all the matrices in attention (Q, K, V, O), FFN (gate, up, down), and embeddings. These encode knowledge.
Think of it like building a library: hyperparameters decide how many shelves and rooms (architecture). Training fills the shelves with books (knowledge). Inference is a visitor looking up information in the frozen library.
Fine-tuning: updating weights after pre-training
Pre-training produces a general-purpose model. Fine-tuning adapts it to a specific domain or task by continuing the training loop on a smaller, curated dataset. There are two approaches:
| Method | What changes | Cost | Quality |
|---|---|---|---|
| Full fine-tuning | All weights updated | Expensive (full GPU memory) | Best adaptation |
| LoRA | Frozen weights + small trainable adapters | ~5-10% of full cost | Near-equivalent quality |
LoRA (Low-Rank Adaptation) is the clever shortcut: instead of updating billions of FFN weights, you freeze them and train tiny matrices (~1-5% of original parameters) alongside them. The model learns the delta — what to change — without rewriting its entire knowledge base.
What I Learned
-
Tokenization is the most overlooked bottleneck. Vocabulary size directly affects cost (tokens per word), context efficiency (tokens per document), and multilingual performance. Llama 3’s jump from 32K to 128K vocabulary wasn’t cosmetic — it was a ~30% efficiency gain for non-English languages and code.
-
Prefill vs decode explains most latency confusion. When someone says “the model is slow,” they usually mean TTFT (prefill) is high because the prompt is long. Decode speed is nearly constant regardless of prompt size. Understanding this split is essential for debugging latency and right-sizing infrastructure.
-
The KV cache is the production bottleneck most people miss. It’s per-request, grows with every decode step, and 100 concurrent users means 100 caches in GPU memory. Understanding its lifecycle (built during prefill, used during decode, discarded after) explains why PagedAttention and Sliding Window Attention exist — they’re not optional optimizations, they’re what makes long-context serving feasible.
-
Embeddings for search and embeddings inside LLMs are related but different. Inside an LLM, embeddings are a layer that converts token IDs to vectors. Embedding models are separately trained to produce one vector per text chunk, optimized for similarity search. The key RAG insight: embeddings find the right documents, but you pass the text (not the vectors) to the LLM.
-
Training is inference plus three extra steps. The forward pass is identical — same Attention, same FFN. Training adds loss computation, backpropagation, and weight updates. Understanding this makes fine-tuning and LoRA intuitive: full fine-tuning reruns all three extra steps on all weights, LoRA only runs them on tiny adapter matrices.
-
The model doesn’t plan — it commits. Each token is chosen based only on what came before. There’s no internal draft, no look-ahead, no revision. This is why chain-of-thought prompting works: it forces the model to “show its work” in the token stream, giving later tokens more context to work with. The model can’t think silently — its thoughts must be tokens.
What’s Next
- Deep dive into attention variants: Multi-Head (MHA), Grouped-Query (GQA), Multi-Query (MQA) — trade-offs between quality and KV cache size
- Explore how fine-tuning changes the weights: what LoRA actually modifies and where in the pipeline
- Compare tokenizers across model families: BPE vs SentencePiece vs Unigram, with real token counts for the same text
- Write a practical guide to prompt engineering grounded in architecture: why system prompts work, why token order matters, why examples help
Related Posts
Transformer Anatomy: Attention + FFN Demystified
A deep dive into the Transformer architecture — how attention connects tokens and why the Feed-Forward Network is the real brain of the model. Plus the key to understanding Mixture of Experts (MoE).
AIGetting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon
A practical walkthrough of two paths to working with Mistral — the managed API for fast prototyping and self-hosted deployment for full control — with real code covering prompting, model selection, function calling, RAG, and INT8 quantization.
AILLM Distillation vs Quantization: Making Models Smaller, Smarter, Cheaper
Two strategies to shrink LLMs — one compresses weights, the other transfers knowledge. A practical guide to distillation and quantization: when to use each, how to implement them with Hugging Face, and why the real answer is both.
