Getting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon

A practical walkthrough of two paths to working with Mistral — the managed API for fast prototyping and self-hosted deployment for full control — with real code covering prompting, model selection, function calling, RAG, and INT8 quantization.

Alexandre Agius

AWS Solutions Architect

February 24, 2026 8 min read

AI LLM Mistral RAG Function Calling Quantization Inference

Table of Contents

The Problem
The Solution
How It Works
Path 1: The Mistral API
Model Selection: Small vs Medium vs Large
Function Calling: Connecting Models to Your Code
RAG via the API: Embeddings + FAISS
Path 2: Self-Hosted Deployment
INT8 Quantization: Half the Memory, Same Quality
Self-Hosted RAG: No API Dependency
API vs Self-Hosted: The Decision Matrix
What I Learned
What’s Next

Mistral AI gives you two distinct ways to use their models: a managed API (La Plateforme) and open-weight models you can self-host. Most tutorials cover one or the other. This post covers both, side by side, so you can make an informed choice for your use case.

The Problem

You want to build with Mistral models. You go to their docs and immediately face a fork: do you use the API (mistralai Python SDK) or download the weights and run them yourself (HuggingFace Transformers)? Each path has different capabilities, costs, and trade-offs — but you won’t discover them until you’ve invested hours going down one road.

The API gives you function calling, JSON mode, and embeddings out of the box. Self-hosted gives you full control over quantization, latency, and data residency. Knowing which features live where — and what the code actually looks like — saves you from making the wrong architectural choice.

The Solution

Work through both paths in a single session. Start with the API for rapid iteration (prompting, model selection, function calling, RAG), then switch to self-hosted for deployment control (FP16 loading, INT8 quantization, local RAG).

Two Paths to Mistral AI

The decision comes down to your constraints: if you need speed to market and don’t want to manage GPUs, use the API. If you need data sovereignty, predictable costs, or custom inference optimization, self-host.

How It Works

Path 1: The Mistral API

The API is the fastest way to get a response from a Mistral model. Install the SDK, provide an API key, and you’re generating text in four lines of code.

from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage

client = MistralClient(api_key="your-key")

response = client.chat(
    model="mistral-small-latest",
    messages=[ChatMessage(role="user", content="What is the capital of France?")]
)
print(response.choices[0].message.content)

This is the “hello world” of Mistral. The model parameter is where it gets interesting.

Model Selection: Small vs Medium vs Large

Mistral offers tiered models optimized for different workloads. The naming is straightforward — smaller models are faster and cheaper, larger models are more capable.

Model	Best for	Relative cost
`mistral-small-latest`	Classification, simple extraction, routing	Lowest
`mistral-medium-latest`	Email composition, summarization, language tasks	Medium
`mistral-large-latest`	Complex reasoning, math, multi-step logic	Highest

The practical difference shows up in tasks that require reasoning. Given a dataset of transactions and asked to find the two closest payment amounts and calculate the date difference, mistral-small gets confused. mistral-large solves it correctly by first sorting, then comparing, then calculating.

The cost difference between small and large is roughly 10x. The rule of thumb: start with small, escalate only when quality drops. Classification and extraction rarely need large. Reasoning and math usually do.

Function Calling: Connecting Models to Your Code

Function calling is where the API becomes genuinely useful for production systems. Instead of the model guessing at data, it calls your functions to retrieve real information.

The flow has four steps:

Step 1 — Define tools as JSON schemas:

tools = [{
    "type": "function",
    "function": {
        "name": "retrieve_payment_status",
        "description": "Get payment status of a transaction",
        "parameters": {
            "type": "object",
            "properties": {
                "transaction_id": {
                    "type": "string",
                    "description": "The transaction id."
                }
            },
            "required": ["transaction_id"]
        }
    }
}]

Step 2 — Model generates function arguments (not the answer):

response = client.chat(
    model="mistral-large-latest",
    messages=chat_history,
    tools=tools,
    tool_choice="auto"
)
# response contains: name="retrieve_payment_status", arguments={"transaction_id": "T1001"}

Step 3 — You execute the function with those arguments:

function_result = retrieve_payment_status(df, transaction_id="T1001")
# Returns: {"status": "Paid"}

Step 4 — Feed the result back, model generates the final answer:

chat_history.append({"role": "tool", "content": function_result, "tool_call_id": tool_id})
response = client.chat(model=model, messages=chat_history)
# "The status of your transaction T1001 is Paid."

The model never touches your database. It just decides which function to call and what arguments to pass. You execute, you return, the model synthesizes. This separation is what makes function calling safe for production.

RAG via the API: Embeddings + FAISS

The API includes an embeddings endpoint (mistral-embed) that produces 1024-dimensional vectors. Combined with FAISS for similarity search, you get a RAG pipeline in about 30 lines.

# Embed documents
def get_text_embedding(text):
    response = client.embeddings(model="mistral-embed", input=text)
    return response.data[0].embedding

# Chunk, embed, index
chunks = [text[i:i+512] for i in range(0, len(text), 512)]
embeddings = np.array([get_text_embedding(chunk) for chunk in chunks])

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# Query
query_embedding = np.array([get_text_embedding(question)])
D, I = index.search(query_embedding, k=2)
retrieved = [chunks[i] for i in I[0]]

Then inject the retrieved chunks into the prompt as context. The model answers based on your documents, not its training data. This is the #1 enterprise pattern for Mistral deployments — grounded answers with no hallucination on your proprietary data.

Path 2: Self-Hosted Deployment

Switching to self-hosted means downloading model weights and running inference on your own GPU. The trade-off: more setup, but full control.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

On a T4 GPU (16 GB VRAM), Mistral-7B in FP16 occupies ~14.3 GB — a tight fit. The model loads, but you have barely any headroom for KV cache during generation.

Key parameters for inference:

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,      # 0.1 = deterministic, 1.0 = creative
    do_sample=True,       # False = greedy decoding
    pad_token_id=tokenizer.eos_token_id  # Mistral has no pad token
)

The pad_token_id line is important — Mistral-7B doesn’t define a pad token by default, so you need to set it explicitly or you’ll get warnings that clutter your output.

INT8 Quantization: Half the Memory, Same Quality

With FP16 eating 14.3 GB on a 16 GB GPU, you have no room for anything else. INT8 quantization cuts that in half:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

model_int8 = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)
# GPU memory: ~7.5 GB (vs 14.3 GB for FP16)

The output quality is nearly identical. Ask both models to explain quantization and you get the same structure, same key points, same level of detail. The INT8 version just uses half the memory and leaves room for longer contexts and batch processing.

Real numbers from a T4:

Precision	VRAM Used	Quality	Use case
FP16	14.3 GB	Baseline	Dev/testing
INT8	7.5 GB	~99% of FP16	Production on budget GPUs

Self-Hosted RAG: No API Dependency

For self-hosted RAG, you swap mistral-embed for a local embedding model like sentence-transformers/all-MiniLM-L6-v2 (384 dimensions, runs on CPU) and use the same FAISS retrieval pattern:

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = embedder.encode(documents)

# Same retrieval logic: cosine similarity + top-k
query_embedding = embedder.encode([question])
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
top_k = np.argsort(similarities)[-2:][::-1]

The advantage: no API calls, no data leaving your infrastructure. The embedding model is tiny (~80 MB), runs on CPU, and produces results in milliseconds. For European enterprises with data sovereignty requirements, this is often the deciding factor.

API vs Self-Hosted: The Decision Matrix

Factor	API (La Plateforme)	Self-Hosted
Setup time	Minutes	Hours
GPU required	No	Yes
Function calling	Built-in	You build it
JSON mode	Built-in	Prompt engineering
Embeddings	`mistral-embed` (1024d)	Bring your own model
Data residency	Mistral’s servers	Your infrastructure
Cost model	Per token	Fixed GPU cost
Model selection	Small/Medium/Large	Any open-weight model
Quantization control	None	Full (FP16/INT8/INT4)
Max throughput	Rate limited	Hardware limited

For prototyping and most SaaS products: start with the API. For regulated industries, high-volume inference, or when you need to control every parameter: self-host.

What I Learned

Function calling is the killer feature of the API — it’s not just chat completion. The four-step flow (define tools, model generates args, you execute, model synthesizes) is what makes Mistral viable for production systems that need to interact with real data.
INT8 quantization is free performance on budget hardware — going from 14.3 GB to 7.5 GB with no perceptible quality loss means you can run Mistral-7B on a T4 with headroom to spare. There’s no reason to run FP16 in production on memory-constrained GPUs.
The two paths aren’t mutually exclusive — the most practical architecture uses the API for rapid development and function calling, then migrates latency-sensitive or high-volume inference to self-hosted once the use case is validated. Start managed, graduate to self-hosted.

What’s Next

Test the new Mistral SDK (from mistralai import Mistral with client.chat.complete()) against the legacy MistralClient API
Benchmark self-hosted RAG latency (embedding + retrieval + generation) vs API-based RAG end-to-end
Explore vLLM as a self-hosted serving layer to get API-like throughput with self-hosted control

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

LinkedIn GitHub

Getting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon

The Problem

The Solution

How It Works

Path 1: The Mistral API

Model Selection: Small vs Medium vs Large

Function Calling: Connecting Models to Your Code

RAG via the API: Embeddings + FAISS

Path 2: Self-Hosted Deployment

INT8 Quantization: Half the Memory, Same Quality

Self-Hosted RAG: No API Dependency

API vs Self-Hosted: The Decision Matrix

What I Learned

What’s Next

Alexandre Agius

Related Posts

TFLOPS: The GPU Metric Every AI Engineer Should Understand

LLM Distillation vs Quantization: Making Models Smaller, Smarter, Cheaper

Python, Transformers, and SageMaker: A Practical Guide for Cloud Engineers