LLM Distillation vs Quantization: Making Models Smaller, Smarter, Cheaper
Two strategies to shrink LLMs — one compresses weights, the other transfers knowledge. A practical guide to distillation and quantization: when to use each, how to implement them with Hugging Face, and why the real answer is both.
Table of Contents
- The Problem: Big Models Are Expensive
- Quantization: Compressing What You Have
- The Intuition
- Memory Impact
- How to Quantize with Hugging Face
- What You Lose
- When to Use Quantization
- Distillation: Teaching a Smaller Model
- The Teacher-Student Framework
- Why Distributions Matter
- Approach 1: Synthetic Data Distillation (Simple)
- Approach 2: Logit Distillation (Advanced)
- Approach 3: Using TRL’s GKDTrainer
- What You Gain
- What You Lose
- When to Use Distillation
- The Trade-Off: Side by Side
- The Winning Move: Combine Both
- Real-World Example
- Decision Framework
- Conclusion
You’ve got a 70B-parameter model that’s brilliant but costs a fortune to run. You need something lighter. Two paths exist: quantization (compress the model you have) and distillation (train a smaller model to mimic the big one).
They’re often confused, sometimes conflated, and rarely explained together. This article fixes that.
The Problem: Big Models Are Expensive
A model like Llama 3 70B needs ~140 GB of VRAM in float16. That’s two A100 80 GB GPUs just for inference. At cloud prices, that’s roughly $3-5/hour. For a production endpoint handling thousands of requests, the bill adds up fast.
You want to keep the intelligence. You want to lose the cost. Let’s look at your two options.
Quantization: Compressing What You Have
Quantization reduces the numerical precision of model weights. Instead of storing each parameter as a 16-bit float, you store it as 8-bit, 4-bit, or even 2-bit integers.
The Intuition
Imagine a painting with 16 million colors. Quantization reduces it to 256 colors. If done well, you barely notice. If done aggressively, things get blocky.
For a weight that was 0.0023841858 in float16, the int4 version might store it as one of 16 possible values in that range. Close enough for the model to still work — not identical, but functional.
Memory Impact
| Precision | Bits per param | 7B model | 70B model |
|---|---|---|---|
| float32 | 32 | 28 GB | 280 GB |
| float16 | 16 | 14 GB | 140 GB |
| int8 | 8 | 7 GB | 70 GB |
| int4 | 4 | 3.5 GB | 35 GB |
That 70B model that needed two A100s? In int4, it fits on a single GPU. A 7B model in int4 runs on a laptop with 8 GB of RAM.
How to Quantize with Hugging Face
Post-Training Quantization (PTQ) — the most common approach. Zero training required.
Using bitsandbytes (the easiest path):
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# 4-bit quantization with NormalFloat4
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
bnb_4bit_use_double_quant=True, # quantize the quantization constants too
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-70B",
quantization_config=quantization_config,
device_map="auto",
)
That’s it. You now have a 4-bit 70B model running on a single GPU.
Using GPTQ (slightly better quality, needs calibration data):
from transformers import AutoModelForCausalLM, GPTQConfig
gptq_config = GPTQConfig(
bits=4,
dataset="c4", # calibration dataset
tokenizer=tokenizer,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-70B",
quantization_config=gptq_config,
device_map="auto",
)
Using GGUF (for llama.cpp / local inference):
# Convert and quantize for CPU/hybrid inference
python convert_hf_to_gguf.py meta-llama/Llama-3-70B --outtype q4_k_m
What You Lose
Quantization is a lossy compression. The degradation is mechanical — you’re literally throwing away precision:
- float16 → int8: Usually negligible loss (<1% on benchmarks)
- int8 → int4: Noticeable on reasoning tasks, fine for chat/generation
- int4 → int2: Significant degradation, only for very specific use cases
The model’s architecture stays exactly the same. Same number of layers, same attention heads, same vocabulary. Only the numbers get less precise.
When to Use Quantization
- You need a quick win with zero training effort
- You want to run a large model on limited hardware
- You’re serving inference and want to cut GPU costs
- You need to prototype fast before investing in training
Distillation: Teaching a Smaller Model
Distillation is fundamentally different. Instead of compressing an existing model, you train a new, smaller model to reproduce the behavior of a larger one.
The Teacher-Student Framework
The core idea:
- You have a teacher — a large, high-quality model (e.g., Llama 3 70B)
- You create a student — a smaller model (e.g., Llama 3 8B)
- You train the student to mimic the teacher’s outputs
The student doesn’t just learn “the right answer” — it learns the teacher’s probability distribution over all possible tokens. This is much richer than simple correct/incorrect labels.
Why Distributions Matter
When a teacher model sees “The capital of France is ___”, it doesn’t just say “Paris.” It assigns probabilities:
Paris: 0.92
Lyon: 0.03
Marseille: 0.01
the: 0.008
...
This distribution contains knowledge: Lyon and Marseille are also French cities (hence higher probability than random words). A student trained on these distributions learns these nuances. A student trained only on “Paris” doesn’t.
This is called dark knowledge — the information hidden in the “wrong” answers.
Approach 1: Synthetic Data Distillation (Simple)
The easiest approach. Use the teacher to generate a dataset, then fine-tune the student on it.
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
# Step 1: Generate data with the teacher
teacher = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70B")
teacher_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-70B")
# Generate responses to your prompts
prompts = load_your_prompts() # your domain-specific questions
teacher_responses = []
for prompt in prompts:
inputs = teacher_tokenizer(prompt, return_tensors="pt")
output = teacher.generate(**inputs, max_new_tokens=512)
response = teacher_tokenizer.decode(output[0], skip_special_tokens=True)
teacher_responses.append({"prompt": prompt, "response": response})
# Step 2: Fine-tune the student on teacher data
student = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")
training_config = SFTConfig(
output_dir="./distilled-8b",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
bf16=True,
)
trainer = SFTTrainer(
model=student,
args=training_config,
train_dataset=teacher_dataset,
)
trainer.train()
This is how most open-source “distilled” models are created. It’s straightforward fine-tuning — the magic is in the quality of the teacher’s data.
Approach 2: Logit Distillation (Advanced)
For maximum knowledge transfer, you match the student’s probability distributions to the teacher’s:
import torch
import torch.nn.functional as F
def distillation_loss(student_logits, teacher_logits, temperature=2.0, alpha=0.5):
"""
Combine distillation loss with standard cross-entropy.
Temperature softens the distributions — higher T reveals more
of the teacher's dark knowledge.
"""
# Soft targets from teacher
soft_loss = F.kl_div(
F.log_softmax(student_logits / temperature, dim=-1),
F.softmax(teacher_logits / temperature, dim=-1),
reduction='batchmean'
) * (temperature ** 2)
return soft_loss
The temperature parameter is key. At temperature=1, the distribution is sharp (Paris 92%, everything else near 0%). At temperature=2-5, it softens, letting the student learn from the “wrong” answers too.
Approach 3: Using TRL’s GKDTrainer
Hugging Face’s TRL library provides a dedicated trainer for Generalized Knowledge Distillation:
from trl import GKDTrainer, GKDConfig
config = GKDConfig(
output_dir="./distilled-model",
num_train_epochs=3,
per_device_train_batch_size=2,
learning_rate=1e-5,
teacher_model_name_or_path="meta-llama/Llama-3-70B",
temperature=2.0,
lmbda=0.5, # balance between distillation and standard loss
bf16=True,
)
trainer = GKDTrainer(
model="meta-llama/Llama-3-8B",
args=config,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
What You Gain
A distilled model is smarter for its size than a model trained from scratch:
- An 8B distilled from a 70B can outperform a regular 8B by 5-15% on benchmarks
- The student inherits the teacher’s reasoning patterns, not just its answers
- You can distill for specific domains (code, medical, legal) for even better results
What You Lose
- Training cost: GPUs for hours or days, depending on dataset size
- Teacher dependency: You need access to the teacher for data generation
- Ceiling effect: The student can’t surpass the teacher (usually)
When to Use Distillation
- You need a production model that’s fast and smart
- You have access to a strong teacher model and compute budget
- You want to create a domain-specific small model
- You’re building a product where inference cost is the bottleneck
The Trade-Off: Side by Side
| Quantization | Distillation | |
|---|---|---|
| What changes | Number precision | Model architecture |
| Effort | Minutes | Hours to days |
| Compute cost | Negligible | Significant (GPU training) |
| Speed gain | Moderate (same arch) | Large (smaller arch) |
| Quality loss | Mechanical, passive | Controlled, intentional |
| Reversible | Yes (keep originals) | No (new model) |
| Specialization | No | Yes (domain-specific) |
Think of it this way:
- Quantization = putting a book in a smaller font. Same book, harder to read.
- Distillation = having someone write a shorter book that captures the essence.
The Winning Move: Combine Both
The best production setup often combines distillation then quantization:
Llama 3 70B (140 GB, float16)
│
│ distillation
▼
Llama 3 8B-distilled (16 GB, float16)
│
│ quantization (int4)
▼
Llama 3 8B-distilled-Q4 (3.5 GB)
You go from a model that needs two A100s to one that runs on a laptop. And the distilled-then-quantized 8B will outperform a standard 8B that was only quantized.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
# Load your distilled model with 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="bfloat16",
)
# Best of both worlds
model = AutoModelForCausalLM.from_pretrained(
"./distilled-8b", # your distilled model
quantization_config=bnb_config,
device_map="auto",
)
Real-World Example
This is essentially what happened with models like:
- Phi-3 — Microsoft distilled knowledge from much larger models into 3.8B parameters
- Gemma 2 — Google’s smaller models benefit from their larger Gemini models
- Mistral 7B — punches way above its weight class, likely thanks to distillation techniques
- Llama 3.2 1B/3B — Meta distilled from the larger Llama 3.1 models
When you download a “7B model that’s surprisingly good,” there’s usually distillation in its lineage somewhere.
Decision Framework
Start with quantization if:
- You need results today
- You don’t have training infrastructure
- The base model already does what you need
- You just want to reduce serving costs
Invest in distillation if:
- You need a model for a specific domain
- Inference cost is your primary concern long-term
- You have a strong teacher model available
- You need the best quality-to-size ratio possible
Do both if:
- You’re building a production system at scale
- Every millisecond of latency matters
- You want the smallest possible model that still works well
Conclusion
Quantization and distillation aren’t competitors — they’re complementary tools for different stages of the optimization pipeline.
Quantization is your quick win: take any model, make it smaller, deploy it now. Distillation is your long-term investment: spend compute once to create a model that’s genuinely smarter for its size.
The frontier of efficient AI isn’t about choosing one over the other. It’s about knowing when to apply each — and increasingly, applying both in sequence for maximum impact.
Your 70B model doesn’t need to stay expensive. The knowledge is what matters, not the bytes it takes to store it.
Related Posts
Python, Transformers, and SageMaker: A Practical Guide for Cloud Engineers
Everything a cloud/AWS engineer needs to know about Python, the Hugging Face Transformers framework, SageMaker integration, quantization, CUDA, and AWS Inferentia — without being a data scientist.
AIGetting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon
A practical walkthrough of two paths to working with Mistral — the managed API for fast prototyping and self-hosted deployment for full control — with real code covering prompting, model selection, function calling, RAG, and INT8 quantization.
AITransformer Anatomy: Attention + FFN Demystified
A deep dive into the Transformer architecture — how attention connects tokens and why the Feed-Forward Network is the real brain of the model. Plus the key to understanding Mixture of Experts (MoE).
