Getting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon
A practical walkthrough of two paths to working with Mistral — the managed API for fast prototyping and self-hosted deployment for full control — with real code covering prompting, model selection, function calling, RAG, and INT8 quantization.
Table of Contents
- The Problem
- The Solution
- How It Works
- Path 1: The Mistral API
- Model Selection: Small vs Medium vs Large
- Function Calling: Connecting Models to Your Code
- RAG via the API: Embeddings + FAISS
- Path 2: Self-Hosted Deployment
- INT8 Quantization: Half the Memory, Same Quality
- Self-Hosted RAG: No API Dependency
- API vs Self-Hosted: The Decision Matrix
- What I Learned
- What’s Next
Mistral AI gives you two distinct ways to use their models: a managed API (La Plateforme) and open-weight models you can self-host. Most tutorials cover one or the other. This post covers both, side by side, so you can make an informed choice for your use case.
The Problem
You want to build with Mistral models. You go to their docs and immediately face a fork: do you use the API (mistralai Python SDK) or download the weights and run them yourself (HuggingFace Transformers)? Each path has different capabilities, costs, and trade-offs — but you won’t discover them until you’ve invested hours going down one road.
The API gives you function calling, JSON mode, and embeddings out of the box. Self-hosted gives you full control over quantization, latency, and data residency. Knowing which features live where — and what the code actually looks like — saves you from making the wrong architectural choice.
The Solution
Work through both paths in a single session. Start with the API for rapid iteration (prompting, model selection, function calling, RAG), then switch to self-hosted for deployment control (FP16 loading, INT8 quantization, local RAG).
The decision comes down to your constraints: if you need speed to market and don’t want to manage GPUs, use the API. If you need data sovereignty, predictable costs, or custom inference optimization, self-host.
How It Works
Path 1: The Mistral API
The API is the fastest way to get a response from a Mistral model. Install the SDK, provide an API key, and you’re generating text in four lines of code.
from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage
client = MistralClient(api_key="your-key")
response = client.chat(
model="mistral-small-latest",
messages=[ChatMessage(role="user", content="What is the capital of France?")]
)
print(response.choices[0].message.content)
This is the “hello world” of Mistral. The model parameter is where it gets interesting.
Model Selection: Small vs Medium vs Large
Mistral offers tiered models optimized for different workloads. The naming is straightforward — smaller models are faster and cheaper, larger models are more capable.
| Model | Best for | Relative cost |
|---|---|---|
mistral-small-latest | Classification, simple extraction, routing | Lowest |
mistral-medium-latest | Email composition, summarization, language tasks | Medium |
mistral-large-latest | Complex reasoning, math, multi-step logic | Highest |
The practical difference shows up in tasks that require reasoning. Given a dataset of transactions and asked to find the two closest payment amounts and calculate the date difference, mistral-small gets confused. mistral-large solves it correctly by first sorting, then comparing, then calculating.
The cost difference between small and large is roughly 10x. The rule of thumb: start with small, escalate only when quality drops. Classification and extraction rarely need large. Reasoning and math usually do.
Function Calling: Connecting Models to Your Code
Function calling is where the API becomes genuinely useful for production systems. Instead of the model guessing at data, it calls your functions to retrieve real information.
The flow has four steps:
Step 1 — Define tools as JSON schemas:
tools = [{
"type": "function",
"function": {
"name": "retrieve_payment_status",
"description": "Get payment status of a transaction",
"parameters": {
"type": "object",
"properties": {
"transaction_id": {
"type": "string",
"description": "The transaction id."
}
},
"required": ["transaction_id"]
}
}
}]
Step 2 — Model generates function arguments (not the answer):
response = client.chat(
model="mistral-large-latest",
messages=chat_history,
tools=tools,
tool_choice="auto"
)
# response contains: name="retrieve_payment_status", arguments={"transaction_id": "T1001"}
Step 3 — You execute the function with those arguments:
function_result = retrieve_payment_status(df, transaction_id="T1001")
# Returns: {"status": "Paid"}
Step 4 — Feed the result back, model generates the final answer:
chat_history.append({"role": "tool", "content": function_result, "tool_call_id": tool_id})
response = client.chat(model=model, messages=chat_history)
# "The status of your transaction T1001 is Paid."
The model never touches your database. It just decides which function to call and what arguments to pass. You execute, you return, the model synthesizes. This separation is what makes function calling safe for production.
RAG via the API: Embeddings + FAISS
The API includes an embeddings endpoint (mistral-embed) that produces 1024-dimensional vectors. Combined with FAISS for similarity search, you get a RAG pipeline in about 30 lines.
# Embed documents
def get_text_embedding(text):
response = client.embeddings(model="mistral-embed", input=text)
return response.data[0].embedding
# Chunk, embed, index
chunks = [text[i:i+512] for i in range(0, len(text), 512)]
embeddings = np.array([get_text_embedding(chunk) for chunk in chunks])
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)
# Query
query_embedding = np.array([get_text_embedding(question)])
D, I = index.search(query_embedding, k=2)
retrieved = [chunks[i] for i in I[0]]
Then inject the retrieved chunks into the prompt as context. The model answers based on your documents, not its training data. This is the #1 enterprise pattern for Mistral deployments — grounded answers with no hallucination on your proprietary data.
Path 2: Self-Hosted Deployment
Switching to self-hosted means downloading model weights and running inference on your own GPU. The trade-off: more setup, but full control.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
On a T4 GPU (16 GB VRAM), Mistral-7B in FP16 occupies ~14.3 GB — a tight fit. The model loads, but you have barely any headroom for KV cache during generation.
Key parameters for inference:
outputs = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.7, # 0.1 = deterministic, 1.0 = creative
do_sample=True, # False = greedy decoding
pad_token_id=tokenizer.eos_token_id # Mistral has no pad token
)
The pad_token_id line is important — Mistral-7B doesn’t define a pad token by default, so you need to set it explicitly or you’ll get warnings that clutter your output.
INT8 Quantization: Half the Memory, Same Quality
With FP16 eating 14.3 GB on a 16 GB GPU, you have no room for anything else. INT8 quantization cuts that in half:
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=6.0
)
model_int8 = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto"
)
# GPU memory: ~7.5 GB (vs 14.3 GB for FP16)
The output quality is nearly identical. Ask both models to explain quantization and you get the same structure, same key points, same level of detail. The INT8 version just uses half the memory and leaves room for longer contexts and batch processing.
Real numbers from a T4:
| Precision | VRAM Used | Quality | Use case |
|---|---|---|---|
| FP16 | 14.3 GB | Baseline | Dev/testing |
| INT8 | 7.5 GB | ~99% of FP16 | Production on budget GPUs |
Self-Hosted RAG: No API Dependency
For self-hosted RAG, you swap mistral-embed for a local embedding model like sentence-transformers/all-MiniLM-L6-v2 (384 dimensions, runs on CPU) and use the same FAISS retrieval pattern:
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = embedder.encode(documents)
# Same retrieval logic: cosine similarity + top-k
query_embedding = embedder.encode([question])
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
top_k = np.argsort(similarities)[-2:][::-1]
The advantage: no API calls, no data leaving your infrastructure. The embedding model is tiny (~80 MB), runs on CPU, and produces results in milliseconds. For European enterprises with data sovereignty requirements, this is often the deciding factor.
API vs Self-Hosted: The Decision Matrix
| Factor | API (La Plateforme) | Self-Hosted |
|---|---|---|
| Setup time | Minutes | Hours |
| GPU required | No | Yes |
| Function calling | Built-in | You build it |
| JSON mode | Built-in | Prompt engineering |
| Embeddings | mistral-embed (1024d) | Bring your own model |
| Data residency | Mistral’s servers | Your infrastructure |
| Cost model | Per token | Fixed GPU cost |
| Model selection | Small/Medium/Large | Any open-weight model |
| Quantization control | None | Full (FP16/INT8/INT4) |
| Max throughput | Rate limited | Hardware limited |
For prototyping and most SaaS products: start with the API. For regulated industries, high-volume inference, or when you need to control every parameter: self-host.
What I Learned
- Function calling is the killer feature of the API — it’s not just chat completion. The four-step flow (define tools, model generates args, you execute, model synthesizes) is what makes Mistral viable for production systems that need to interact with real data.
- INT8 quantization is free performance on budget hardware — going from 14.3 GB to 7.5 GB with no perceptible quality loss means you can run Mistral-7B on a T4 with headroom to spare. There’s no reason to run FP16 in production on memory-constrained GPUs.
- The two paths aren’t mutually exclusive — the most practical architecture uses the API for rapid development and function calling, then migrates latency-sensitive or high-volume inference to self-hosted once the use case is validated. Start managed, graduate to self-hosted.
What’s Next
- Test the new Mistral SDK (
from mistralai import Mistralwithclient.chat.complete()) against the legacyMistralClientAPI - Benchmark self-hosted RAG latency (embedding + retrieval + generation) vs API-based RAG end-to-end
- Explore vLLM as a self-hosted serving layer to get API-like throughput with self-hosted control
Related Posts
TFLOPS: The GPU Metric Every AI Engineer Should Understand
What TFLOPS actually measures, why FP16 matters for LLMs, and why the most important GPU bottleneck for inference isn't compute at all.
AILLM Distillation vs Quantization: Making Models Smaller, Smarter, Cheaper
Two strategies to shrink LLMs — one compresses weights, the other transfers knowledge. A practical guide to distillation and quantization: when to use each, how to implement them with Hugging Face, and why the real answer is both.
AIPython, Transformers, and SageMaker: A Practical Guide for Cloud Engineers
Everything a cloud/AWS engineer needs to know about Python, the Hugging Face Transformers framework, SageMaker integration, quantization, CUDA, and AWS Inferentia — without being a data scientist.
