Posts tagged Inference

5 posts

Cloud

AWS + Cerebras: wafer-scale inference is coming to Bedrock

AWS is deploying Cerebras CS-3 systems in its data centers, pairing Trainium for prefill with the Wafer Scale Engine 3 for decode. Why disaggregated inference is the right architecture, and what makes a 4-trillion-transistor chip the right tool for the decode problem.

23 Jul 2026·5 MIN READ Read →

AI Engineering

Can your robot's brain live in the cloud?

Wafer-scale inference on Bedrock and open 3T-class models both landed this month. Neither will run your robot's control loop — but together they redraw the line between what a robot computes on-board and what it can safely delegate to the cloud.

23 Jul 2026·6 MIN READ Read →

LLM Inference Demystified: PagedAttention, KV Cache, MoE & Continuous Batching

The 5 key concepts every cloud architect should know about LLM serving: PagedAttention, KV cache mechanics, continuous batching, MoE trade-offs, and real production numbers.

26 Feb 2026·13 MIN READ Read →

Getting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon

A practical walkthrough of two paths to working with Mistral — the managed API for fast prototyping and self-hosted deployment for full control — with real code covering prompting, model selection, function calling, RAG, and INT8 quantization.

24 Feb 2026·9 MIN READ Read →

TFLOPS: The GPU Metric Every AI Engineer Should Understand

What TFLOPS actually measures, why FP16 matters for LLMs, and why the most important GPU bottleneck for inference isn't compute at all.

24 Feb 2026·9 MIN READ Read →

Back to Blog