AI

Getting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon

A practical walkthrough of two paths to working with Mistral — the managed API for fast prototyping and self-hosted deployment for full control — with real code covering prompting, model selection, function calling, RAG, and INT8 quantization.

Alexandre Agius

Alexandre Agius

AWS Solutions Architect

8 min read
Share:

Mistral AI gives you two distinct ways to use their models: a managed API (La Plateforme) and open-weight models you can self-host. Most tutorials cover one or the other. This post covers both, side by side, so you can make an informed choice for your use case.

The Problem

You want to build with Mistral models. You go to their docs and immediately face a fork: do you use the API (mistralai Python SDK) or download the weights and run them yourself (HuggingFace Transformers)? Each path has different capabilities, costs, and trade-offs — but you won’t discover them until you’ve invested hours going down one road.

The API gives you function calling, JSON mode, and embeddings out of the box. Self-hosted gives you full control over quantization, latency, and data residency. Knowing which features live where — and what the code actually looks like — saves you from making the wrong architectural choice.

The Solution

Work through both paths in a single session. Start with the API for rapid iteration (prompting, model selection, function calling, RAG), then switch to self-hosted for deployment control (FP16 loading, INT8 quantization, local RAG).

Two Paths to Mistral AI

The decision comes down to your constraints: if you need speed to market and don’t want to manage GPUs, use the API. If you need data sovereignty, predictable costs, or custom inference optimization, self-host.

How It Works

Path 1: The Mistral API

The API is the fastest way to get a response from a Mistral model. Install the SDK, provide an API key, and you’re generating text in four lines of code.

from mistralai.client import MistralClient
from mistralai.models.chat_completion import ChatMessage

client = MistralClient(api_key="your-key")

response = client.chat(
    model="mistral-small-latest",
    messages=[ChatMessage(role="user", content="What is the capital of France?")]
)
print(response.choices[0].message.content)

This is the “hello world” of Mistral. The model parameter is where it gets interesting.

Model Selection: Small vs Medium vs Large

Mistral offers tiered models optimized for different workloads. The naming is straightforward — smaller models are faster and cheaper, larger models are more capable.

ModelBest forRelative cost
mistral-small-latestClassification, simple extraction, routingLowest
mistral-medium-latestEmail composition, summarization, language tasksMedium
mistral-large-latestComplex reasoning, math, multi-step logicHighest

The practical difference shows up in tasks that require reasoning. Given a dataset of transactions and asked to find the two closest payment amounts and calculate the date difference, mistral-small gets confused. mistral-large solves it correctly by first sorting, then comparing, then calculating.

The cost difference between small and large is roughly 10x. The rule of thumb: start with small, escalate only when quality drops. Classification and extraction rarely need large. Reasoning and math usually do.

Function Calling: Connecting Models to Your Code

Function calling is where the API becomes genuinely useful for production systems. Instead of the model guessing at data, it calls your functions to retrieve real information.

The flow has four steps:

Step 1 — Define tools as JSON schemas:

tools = [{
    "type": "function",
    "function": {
        "name": "retrieve_payment_status",
        "description": "Get payment status of a transaction",
        "parameters": {
            "type": "object",
            "properties": {
                "transaction_id": {
                    "type": "string",
                    "description": "The transaction id."
                }
            },
            "required": ["transaction_id"]
        }
    }
}]

Step 2 — Model generates function arguments (not the answer):

response = client.chat(
    model="mistral-large-latest",
    messages=chat_history,
    tools=tools,
    tool_choice="auto"
)
# response contains: name="retrieve_payment_status", arguments={"transaction_id": "T1001"}

Step 3 — You execute the function with those arguments:

function_result = retrieve_payment_status(df, transaction_id="T1001")
# Returns: {"status": "Paid"}

Step 4 — Feed the result back, model generates the final answer:

chat_history.append({"role": "tool", "content": function_result, "tool_call_id": tool_id})
response = client.chat(model=model, messages=chat_history)
# "The status of your transaction T1001 is Paid."

The model never touches your database. It just decides which function to call and what arguments to pass. You execute, you return, the model synthesizes. This separation is what makes function calling safe for production.

RAG via the API: Embeddings + FAISS

The API includes an embeddings endpoint (mistral-embed) that produces 1024-dimensional vectors. Combined with FAISS for similarity search, you get a RAG pipeline in about 30 lines.

# Embed documents
def get_text_embedding(text):
    response = client.embeddings(model="mistral-embed", input=text)
    return response.data[0].embedding

# Chunk, embed, index
chunks = [text[i:i+512] for i in range(0, len(text), 512)]
embeddings = np.array([get_text_embedding(chunk) for chunk in chunks])

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

# Query
query_embedding = np.array([get_text_embedding(question)])
D, I = index.search(query_embedding, k=2)
retrieved = [chunks[i] for i in I[0]]

Then inject the retrieved chunks into the prompt as context. The model answers based on your documents, not its training data. This is the #1 enterprise pattern for Mistral deployments — grounded answers with no hallucination on your proprietary data.

Path 2: Self-Hosted Deployment

Switching to self-hosted means downloading model weights and running inference on your own GPU. The trade-off: more setup, but full control.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

On a T4 GPU (16 GB VRAM), Mistral-7B in FP16 occupies ~14.3 GB — a tight fit. The model loads, but you have barely any headroom for KV cache during generation.

Key parameters for inference:

outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.7,      # 0.1 = deterministic, 1.0 = creative
    do_sample=True,       # False = greedy decoding
    pad_token_id=tokenizer.eos_token_id  # Mistral has no pad token
)

The pad_token_id line is important — Mistral-7B doesn’t define a pad token by default, so you need to set it explicitly or you’ll get warnings that clutter your output.

INT8 Quantization: Half the Memory, Same Quality

With FP16 eating 14.3 GB on a 16 GB GPU, you have no room for anything else. INT8 quantization cuts that in half:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

model_int8 = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto"
)
# GPU memory: ~7.5 GB (vs 14.3 GB for FP16)

The output quality is nearly identical. Ask both models to explain quantization and you get the same structure, same key points, same level of detail. The INT8 version just uses half the memory and leaves room for longer contexts and batch processing.

Real numbers from a T4:

PrecisionVRAM UsedQualityUse case
FP1614.3 GBBaselineDev/testing
INT87.5 GB~99% of FP16Production on budget GPUs

Self-Hosted RAG: No API Dependency

For self-hosted RAG, you swap mistral-embed for a local embedding model like sentence-transformers/all-MiniLM-L6-v2 (384 dimensions, runs on CPU) and use the same FAISS retrieval pattern:

from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('all-MiniLM-L6-v2')
doc_embeddings = embedder.encode(documents)

# Same retrieval logic: cosine similarity + top-k
query_embedding = embedder.encode([question])
similarities = np.dot(doc_embeddings, query_embedding.T).flatten()
top_k = np.argsort(similarities)[-2:][::-1]

The advantage: no API calls, no data leaving your infrastructure. The embedding model is tiny (~80 MB), runs on CPU, and produces results in milliseconds. For European enterprises with data sovereignty requirements, this is often the deciding factor.

API vs Self-Hosted: The Decision Matrix

FactorAPI (La Plateforme)Self-Hosted
Setup timeMinutesHours
GPU requiredNoYes
Function callingBuilt-inYou build it
JSON modeBuilt-inPrompt engineering
Embeddingsmistral-embed (1024d)Bring your own model
Data residencyMistral’s serversYour infrastructure
Cost modelPer tokenFixed GPU cost
Model selectionSmall/Medium/LargeAny open-weight model
Quantization controlNoneFull (FP16/INT8/INT4)
Max throughputRate limitedHardware limited

For prototyping and most SaaS products: start with the API. For regulated industries, high-volume inference, or when you need to control every parameter: self-host.

What I Learned

  • Function calling is the killer feature of the API — it’s not just chat completion. The four-step flow (define tools, model generates args, you execute, model synthesizes) is what makes Mistral viable for production systems that need to interact with real data.
  • INT8 quantization is free performance on budget hardware — going from 14.3 GB to 7.5 GB with no perceptible quality loss means you can run Mistral-7B on a T4 with headroom to spare. There’s no reason to run FP16 in production on memory-constrained GPUs.
  • The two paths aren’t mutually exclusive — the most practical architecture uses the API for rapid development and function calling, then migrates latency-sensitive or high-volume inference to self-hosted once the use case is validated. Start managed, graduate to self-hosted.

What’s Next

  • Test the new Mistral SDK (from mistralai import Mistral with client.chat.complete()) against the legacy MistralClient API
  • Benchmark self-hosted RAG latency (embedding + retrieval + generation) vs API-based RAG end-to-end
  • Explore vLLM as a self-hosted serving layer to get API-like throughput with self-hosted control
Alexandre Agius

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

Related Posts

Back to Blog