LLM Distillation vs Quantization: Making Models Smaller, Smarter, Cheaper

Two strategies to shrink LLMs — one compresses weights, the other transfers knowledge. A practical guide to distillation and quantization: when to use each, how to implement them with Hugging Face, and why the real answer is both.

Alexandre Agius

AWS Solutions Architect

February 25, 2026 9 min read

AI LLM Distillation Quantization Hugging Face Transformers TRL Deep Learning Optimization

You’ve got a 70B-parameter model that’s brilliant but costs a fortune to run. You need something lighter. Two paths exist: quantization (compress the model you have) and distillation (train a smaller model to mimic the big one).

They’re often confused, sometimes conflated, and rarely explained together. This article fixes that.

The Problem: Big Models Are Expensive

A model like Llama 3 70B needs ~140 GB of VRAM in float16. That’s two A100 80 GB GPUs just for inference. At cloud prices, that’s roughly $3-5/hour. For a production endpoint handling thousands of requests, the bill adds up fast.

You want to keep the intelligence. You want to lose the cost. Let’s look at your two options.

Quantization: Compressing What You Have

Quantization reduces the numerical precision of model weights. Instead of storing each parameter as a 16-bit float, you store it as 8-bit, 4-bit, or even 2-bit integers.

The Intuition

Imagine a painting with 16 million colors. Quantization reduces it to 256 colors. If done well, you barely notice. If done aggressively, things get blocky.

For a weight that was 0.0023841858 in float16, the int4 version might store it as one of 16 possible values in that range. Close enough for the model to still work — not identical, but functional.

Memory Impact

Precision	Bits per param	7B model	70B model
float32	32	28 GB	280 GB
float16	16	14 GB	140 GB
int8	8	7 GB	70 GB
int4	4	3.5 GB	35 GB

That 70B model that needed two A100s? In int4, it fits on a single GPU. A 7B model in int4 runs on a laptop with 8 GB of RAM.

How to Quantize with Hugging Face

Post-Training Quantization (PTQ) — the most common approach. Zero training required.

Using bitsandbytes (the easiest path):

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4-bit quantization with NormalFloat4
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
    bnb_4bit_use_double_quant=True,  # quantize the quantization constants too
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-70B",
    quantization_config=quantization_config,
    device_map="auto",
)

That’s it. You now have a 4-bit 70B model running on a single GPU.

Using GPTQ (slightly better quality, needs calibration data):

from transformers import AutoModelForCausalLM, GPTQConfig

gptq_config = GPTQConfig(
    bits=4,
    dataset="c4",  # calibration dataset
    tokenizer=tokenizer,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-70B",
    quantization_config=gptq_config,
    device_map="auto",
)

Using GGUF (for llama.cpp / local inference):

# Convert and quantize for CPU/hybrid inference
python convert_hf_to_gguf.py meta-llama/Llama-3-70B --outtype q4_k_m

What You Lose

Quantization is a lossy compression. The degradation is mechanical — you’re literally throwing away precision:

float16 → int8: Usually negligible loss (<1% on benchmarks)
int8 → int4: Noticeable on reasoning tasks, fine for chat/generation
int4 → int2: Significant degradation, only for very specific use cases

The model’s architecture stays exactly the same. Same number of layers, same attention heads, same vocabulary. Only the numbers get less precise.

When to Use Quantization

You need a quick win with zero training effort
You want to run a large model on limited hardware
You’re serving inference and want to cut GPU costs
You need to prototype fast before investing in training

Distillation: Teaching a Smaller Model

Distillation is fundamentally different. Instead of compressing an existing model, you train a new, smaller model to reproduce the behavior of a larger one.

The Teacher-Student Framework

The core idea:

You have a teacher — a large, high-quality model (e.g., Llama 3 70B)
You create a student — a smaller model (e.g., Llama 3 8B)
You train the student to mimic the teacher’s outputs

The student doesn’t just learn “the right answer” — it learns the teacher’s probability distribution over all possible tokens. This is much richer than simple correct/incorrect labels.

Why Distributions Matter

When a teacher model sees “The capital of France is ___”, it doesn’t just say “Paris.” It assigns probabilities:

Paris:     0.92
Lyon:      0.03
Marseille: 0.01
the:       0.008
...

This distribution contains knowledge: Lyon and Marseille are also French cities (hence higher probability than random words). A student trained on these distributions learns these nuances. A student trained only on “Paris” doesn’t.

This is called dark knowledge — the information hidden in the “wrong” answers.

Approach 1: Synthetic Data Distillation (Simple)

The easiest approach. Use the teacher to generate a dataset, then fine-tune the student on it.

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig

# Step 1: Generate data with the teacher
teacher = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-70B")
teacher_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-70B")

# Generate responses to your prompts
prompts = load_your_prompts()  # your domain-specific questions
teacher_responses = []

for prompt in prompts:
    inputs = teacher_tokenizer(prompt, return_tensors="pt")
    output = teacher.generate(**inputs, max_new_tokens=512)
    response = teacher_tokenizer.decode(output[0], skip_special_tokens=True)
    teacher_responses.append({"prompt": prompt, "response": response})

# Step 2: Fine-tune the student on teacher data
student = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")

training_config = SFTConfig(
    output_dir="./distilled-8b",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=2e-5,
    bf16=True,
)

trainer = SFTTrainer(
    model=student,
    args=training_config,
    train_dataset=teacher_dataset,
)

trainer.train()

This is how most open-source “distilled” models are created. It’s straightforward fine-tuning — the magic is in the quality of the teacher’s data.

Approach 2: Logit Distillation (Advanced)

For maximum knowledge transfer, you match the student’s probability distributions to the teacher’s:

import torch
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, temperature=2.0, alpha=0.5):
    """
    Combine distillation loss with standard cross-entropy.
    
    Temperature softens the distributions — higher T reveals more
    of the teacher's dark knowledge.
    """
    # Soft targets from teacher
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / temperature, dim=-1),
        F.softmax(teacher_logits / temperature, dim=-1),
        reduction='batchmean'
    ) * (temperature ** 2)
    
    return soft_loss

The temperature parameter is key. At temperature=1, the distribution is sharp (Paris 92%, everything else near 0%). At temperature=2-5, it softens, letting the student learn from the “wrong” answers too.

Approach 3: Using TRL’s GKDTrainer

Hugging Face’s TRL library provides a dedicated trainer for Generalized Knowledge Distillation:

from trl import GKDTrainer, GKDConfig

config = GKDConfig(
    output_dir="./distilled-model",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=1e-5,
    teacher_model_name_or_path="meta-llama/Llama-3-70B",
    temperature=2.0,
    lmbda=0.5,  # balance between distillation and standard loss
    bf16=True,
)

trainer = GKDTrainer(
    model="meta-llama/Llama-3-8B",
    args=config,
    train_dataset=dataset,
    processing_class=tokenizer,
)

trainer.train()

What You Gain

A distilled model is smarter for its size than a model trained from scratch:

An 8B distilled from a 70B can outperform a regular 8B by 5-15% on benchmarks
The student inherits the teacher’s reasoning patterns, not just its answers
You can distill for specific domains (code, medical, legal) for even better results

What You Lose

Training cost: GPUs for hours or days, depending on dataset size
Teacher dependency: You need access to the teacher for data generation
Ceiling effect: The student can’t surpass the teacher (usually)

When to Use Distillation

You need a production model that’s fast and smart
You have access to a strong teacher model and compute budget
You want to create a domain-specific small model
You’re building a product where inference cost is the bottleneck

The Trade-Off: Side by Side

	Quantization	Distillation
What changes	Number precision	Model architecture
Effort	Minutes	Hours to days
Compute cost	Negligible	Significant (GPU training)
Speed gain	Moderate (same arch)	Large (smaller arch)
Quality loss	Mechanical, passive	Controlled, intentional
Reversible	Yes (keep originals)	No (new model)
Specialization	No	Yes (domain-specific)

Think of it this way:

Quantization = putting a book in a smaller font. Same book, harder to read.
Distillation = having someone write a shorter book that captures the essence.

The Winning Move: Combine Both

The best production setup often combines distillation then quantization:

Llama 3 70B (140 GB, float16)
        │
        │ distillation
        ▼
Llama 3 8B-distilled (16 GB, float16)
        │
        │ quantization (int4)
        ▼
Llama 3 8B-distilled-Q4 (3.5 GB)

You go from a model that needs two A100s to one that runs on a laptop. And the distilled-then-quantized 8B will outperform a standard 8B that was only quantized.

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# Load your distilled model with 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="bfloat16",
)

# Best of both worlds
model = AutoModelForCausalLM.from_pretrained(
    "./distilled-8b",  # your distilled model
    quantization_config=bnb_config,
    device_map="auto",
)

Real-World Example

This is essentially what happened with models like:

Phi-3 — Microsoft distilled knowledge from much larger models into 3.8B parameters
Gemma 2 — Google’s smaller models benefit from their larger Gemini models
Mistral 7B — punches way above its weight class, likely thanks to distillation techniques
Llama 3.2 1B/3B — Meta distilled from the larger Llama 3.1 models

When you download a “7B model that’s surprisingly good,” there’s usually distillation in its lineage somewhere.

Decision Framework

Start with quantization if:

You need results today
You don’t have training infrastructure
The base model already does what you need
You just want to reduce serving costs

Invest in distillation if:

You need a model for a specific domain
Inference cost is your primary concern long-term
You have a strong teacher model available
You need the best quality-to-size ratio possible

Do both if:

You’re building a production system at scale
Every millisecond of latency matters
You want the smallest possible model that still works well

Conclusion

Quantization and distillation aren’t competitors — they’re complementary tools for different stages of the optimization pipeline.

Quantization is your quick win: take any model, make it smaller, deploy it now. Distillation is your long-term investment: spend compute once to create a model that’s genuinely smarter for its size.

The frontier of efficient AI isn’t about choosing one over the other. It’s about knowing when to apply each — and increasingly, applying both in sequence for maximum impact.

Your 70B model doesn’t need to stay expensive. The knowledge is what matters, not the bytes it takes to store it.

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

LinkedIn GitHub

LLM Distillation vs Quantization: Making Models Smaller, Smarter, Cheaper

The Problem: Big Models Are Expensive

Quantization: Compressing What You Have

The Intuition

Memory Impact

How to Quantize with Hugging Face

What You Lose

When to Use Quantization

Distillation: Teaching a Smaller Model

The Teacher-Student Framework

Why Distributions Matter

Approach 1: Synthetic Data Distillation (Simple)

Approach 2: Logit Distillation (Advanced)

Approach 3: Using TRL’s GKDTrainer

What You Gain

What You Lose

When to Use Distillation

The Trade-Off: Side by Side

The Winning Move: Combine Both

Real-World Example

Decision Framework

Conclusion

Alexandre Agius

Related Posts

Python, Transformers, and SageMaker: A Practical Guide for Cloud Engineers

Getting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon

Transformer Anatomy: Attention + FFN Demystified