AI

Building a RAG System That Actually Works: Chunking, Vector Engines, and Testing

Most RAG tutorials stop at 'put vectors in a database.' This post covers what actually determines quality: how you chunk documents, which vector search engine to pick, and how to measure and iterate on retrieval performance using Bedrock Knowledge Bases and LLM-as-judge evaluation.

Alexandre Agius

Alexandre Agius

AWS Solutions Architect

13 min read
Share:

Most RAG tutorials make it look simple: embed your documents, store the vectors, retrieve and generate. But the difference between a demo that works and a production system that gives reliable answers comes down to three decisions most teams get wrong: how you chunk, how you search, and whether you actually measure quality.

The Problem

The RAG pattern is well-understood. Take your documents, split them into chunks, generate vector embeddings, store them, and at query time find the most similar chunks to inject into your LLM prompt. The architecture isn’t the hard part.

The hard part is that every step has hidden decisions that compound:

  • Chunking: Split too large and relevant information gets diluted. Split too small and you lose context. Split at the wrong boundary and you break a sentence mid-thought.
  • Vector search: Three different engines with different trade-offs in speed, memory, and accuracy. Picking wrong means either blowing your budget on RAM or getting slow, imprecise results.
  • Quality measurement: Most teams ship RAG systems without ever measuring retrieval quality. They notice the answers “feel wrong” but have no systematic way to diagnose whether the problem is chunking, embedding, retrieval, or the LLM itself.

If you’ve already picked your vector store (and if you haven’t, I wrote a comprehensive comparison of all 9 AWS options), this post covers what happens after that decision: building and tuning the retrieval pipeline.

The Solution

The key insight is that RAG is an iteration loop, not a one-time setup. You chunk, embed, retrieve, evaluate, and then adjust. Each cycle improves retrieval quality. The faster you can iterate, the faster you reach production-grade answers.

RAG pipeline showing the three phases: indexing (chunking, embedding, storage), query time (search, retrieval, generation), and evaluation (metrics, LLM-as-judge, iteration)

The pipeline has three phases:

  1. Indexing — chunk documents, generate embeddings, store chunks alongside their vectors
  2. Query — embed the question, search for nearest neighbors, assemble context, generate an answer
  3. Evaluation — measure quality, identify gaps, adjust chunking or embedding strategy, re-index

Amazon Bedrock Knowledge Bases can manage the first two phases. Bedrock Model Evaluation handles the third. Together, they turn what used to be weeks of pipeline engineering into a configuration-and-iterate workflow.

How It Works

How RAG Actually Stores and Retrieves Data

A common misconception is that vectors somehow “contain” the text and need to be decoded back. They don’t. Each document in your vector store contains both the vector (for searching) and the original text (for reading):

{
  "embedding": [0.023, -0.841, 0.112, ...],
  "text": "AWS Lambda scales automatically based on incoming requests...",
  "metadata": {"source": "docs/lambda.pdf", "page": 12, "chunk_index": 3}
}

The vector is a mathematical fingerprint of that specific text. At query time, you convert the user’s question into a vector using the same embedding model, search for the nearest vectors, and retrieve the text stored alongside them — not the vectors themselves. That text goes into the LLM prompt as context.

Critically, you store chunks, not full documents. A 50-page PDF becomes 80-200 individual entries in your vector store, each containing one chunk’s text and its embedding. The chunking strategy determines what those entries look like.

Chunking Strategies: The Biggest Quality Lever

How you split documents impacts retrieval quality more than which vector store you use or which embedding model you pick. Here are the five strategies available in Bedrock Knowledge Bases.

Fixed Size — Cut every N tokens with configurable overlap.

Document: "AAAA BBBB CCCC DDDD EEEE FFFF GGGG HHHH"

Chunk size: 4 words | Overlap: 1 word

Chunk 1: "AAAA BBBB CCCC DDDD"
Chunk 2: "DDDD EEEE FFFF GGGG"    <- "DDDD" repeated (overlap)
Chunk 3: "GGGG HHHH"

Simple, predictable, fast. The problem: it cuts mid-sentence, mid-paragraph, mid-idea. It doesn’t care about meaning. Use it when you don’t know what else to pick, or when your documents have uniform structure (logs, structured data, CSV records).

Semantic — Uses the embedding model itself to detect topic shifts. When consecutive sentences become too different in meaning, it splits there.

"Lambda scales automatically. It handles millions of requests."     <- topic: scaling
"Pricing is based on invocations. You pay per 100ms of compute."   <- topic: pricing

Fixed chunking might produce:
  "...It handles millions of requests. Pricing is..."  <- mixed topics

Semantic chunking produces:
  Chunk 1: "Lambda scales automatically. It handles millions of requests."
  Chunk 2: "Pricing is based on invocations. You pay per 100ms of compute."

Each chunk is coherent — about one thing. The trade-off: it’s slower (must embed sentences to detect boundaries) and produces variable-sized chunks. Best for documents that mix many topics — whitepapers, long reports, technical manuals.

Hierarchical (Parent-Child) — Creates two levels of chunks. Searches with the small child (precise matching), but passes the larger parent to the LLM (richer context).

Parent (1000 tokens):
  "Lambda is a serverless compute service. It scales automatically.
   Pricing is per invocation. You pay per 100ms..."

  Child 1 (200 tokens): "Lambda is a serverless compute service..."
  Child 2 (200 tokens): "Pricing is per invocation..."

Query: "How much does Lambda cost?"
  -> Matches Child 2 (precise)
  -> Sends Parent to LLM (full context)

Best of both worlds: precise retrieval with rich context. Use it when answers need surrounding context to make sense — legal documents, technical specs, contracts.

No Chunking — You pre-chunk the documents yourself before uploading to S3. Each file is treated as one chunk. Total control, but you build and maintain the chunking pipeline.

Custom Transformation (Lambda) — Bedrock calls your Lambda function during sync. You receive the raw document, you return the chunks. Full flexibility while still using managed embedding and indexing. Use it when your documents have specific structure (HTML tables, XML, multi-column PDFs) that generic strategies don’t handle well.

Choosing a Starting Strategy

Your documents look like…Start with
Don’t know yetFixed size (512 tokens, 10% overlap)
Long reports mixing many topicsSemantic
Specs where context matters for comprehensionHierarchical
Structured data (tables, XML, forms)Custom Lambda
You already have a chunking pipelineNo chunking

The right answer depends on your data, your questions, and your embedding model. The only way to know for sure is to measure — which brings us to evaluation.

Vector Search Engines: FAISS vs Lucene vs NMSLIB

If you’re using OpenSearch as your vector store, you get to choose which library performs the similarity search. This is unique to OpenSearch — Elasticsearch only uses Lucene.

EngineStorageQuery SpeedRecallFilteringCost Driver
FAISSIn-memory (RAM)FastestHighPre-filter (efficient_filter)RAM
NMSLIBIn-memory (RAM)FastHighestNo native filteringRAM
LuceneDisk-basedSlowerHighNative (best integration)Disk

All three are free — they’re open-source libraries bundled into OpenSearch. There’s no per-engine pricing. What you pay for is the infrastructure to run them.

FAISS (Facebook AI Similarity Search) is the production workhorse. It supports both HNSW (graph-based) and IVF (inverted file) index types. IVF trades some recall for much lower memory usage — useful for very large datasets. FAISS also supports quantization to compress vectors and reduce RAM. The catch: vectors must live in RAM.

NMSLIB (Non-Metric Space Library) delivers the best recall accuracy but can’t filter during search — you search first, then filter results (post-filter), which can return fewer than k results. It’s being phased out in favor of FAISS and Lucene.

Lucene uses the same engine that powers OpenSearch text search. Vectors live on disk with segment-level caching, which means dramatically lower RAM requirements. It also works with UltraWarm tier (FAISS and NMSLIB don’t). Query speed is slower than FAISS, but often good enough for RAG workloads where you’re calling an LLM anyway.

The engine choice affects cost indirectly through instance sizing:

  • FAISS with 10M vectors at 1536 dimensions needs ~60 GB RAM just for vectors — you need r6g.2xlarge or bigger
  • Lucene with the same dataset keeps vectors on disk — smaller instances or even UltraWarm work fine
Need lowest latency + highest QPS?          -> FAISS (HNSW)
Need huge dataset, budget-constrained?      -> Lucene or FAISS (IVF)
Need filtering + vector search together?    -> Lucene or FAISS with efficient_filter
Need vectors in UltraWarm storage?          -> Lucene (only option)

Measuring RAG Quality: The Part Most Teams Skip

Shipping a RAG system without measuring retrieval quality is like deploying an API without monitoring. You know something is wrong only when users complain. There are two levels of measurement.

Retrieval metrics tell you if the right chunks are being found:

MetricWhat It Measures
Recall@kOf the relevant chunks, how many appear in the top-k results?
Precision@kOf the top-k results, how many are actually relevant?
MRR (Mean Reciprocal Rank)How high does the first relevant chunk rank?
nDCGAre relevant chunks ranked higher than irrelevant ones?

If recall is low, your chunks are too big (relevant info diluted) or too small (context lost).

End-to-end metrics tell you if the final answer is good:

MetricWhat It Measures
FaithfulnessDoes the answer stick to the retrieved context? (no hallucination)
Answer RelevancyDoes the answer actually address the question?
Context RelevancyAre the retrieved chunks relevant to the question?
CorrectnessIs the answer factually correct vs ground truth?

How to Evaluate: RAGAS and LLM-as-Judge

The practical approach starts with a golden test set: 50-100 questions with expected answers and the source passages they come from. Then you run each chunking strategy through evaluation.

RAGAS is the most popular open-source framework. It computes faithfulness, relevancy, context precision, and recall automatically using an LLM-as-judge pattern — one model scores the outputs of another.

Bedrock Model Evaluation does this natively on AWS. You can use a Bedrock model (Claude, Nova) to score outputs on custom criteria without external tooling. For RAG specifically, it evaluates faithfulness, relevance, and correctness against your test set.

The full iteration loop:

1. Build golden test set (50-100 questions + expected answers)

2. Try different strategies:
   | Strategy        | Chunk Size  | Overlap       |
   | Fixed small     | 256 tokens  | 32 tokens     |
   | Fixed medium    | 512 tokens  | 64 tokens     |
   | Semantic        | Variable    | By topic shift |
   | Hierarchical    | Both levels | Parent + child |

3. Run each through Bedrock KB (re-sync per config)

4. Evaluate with Bedrock Model Evaluation or RAGAS

5. Compare recall@5, faithfulness, answer relevancy

6. Pick the winner for YOUR data

The Re-Embedding Tax

Every time you change any of these, you must regenerate embeddings:

ChangeRe-embed?Why
Chunk size or overlapYesDifferent text produces different vectors
Chunking methodYesCompletely different chunks
Embedding modelYesDifferent model, different vector space
Text preprocessingYesInput text changed

What does NOT require re-embedding: changing the k-NN engine, distance metric, number of results (k), LLM model, or prompt template.

This is why RAG tuning is expensive at scale. For 10,000 documents, each iteration costs pennies in embedding and takes minutes. For 1M documents, each iteration costs $10-100 and takes hours. Test on a representative subset first (500-1,000 docs), find the best strategy, then re-embed the full corpus once.

Bedrock Knowledge Bases: The Managed Iteration Loop

Bedrock Knowledge Bases eliminates the engineering work of the iteration cycle. Instead of building chunking pipelines, managing embedding jobs, and configuring vector indices, you:

  1. Upload documents to S3
  2. Pick a chunking strategy and embedding model in the console
  3. Hit “Sync”
  4. Query via the RetrieveAndGenerate API

Changing strategy? Update the config, hit Sync again. Bedrock re-chunks, re-embeds, and re-indexes everything. No pipeline code to maintain.

The underlying vector store options include OpenSearch Serverless, Aurora PostgreSQL, S3 Vectors, Neptune Analytics, and third-party options like Pinecone. For a detailed comparison of each, see the vector store guide.

Where Bedrock KB doesn’t help: it has no built-in comparison tool. It won’t tell you which chunking strategy gives better recall or whether your answers are faithful. You still need to run evaluation on top — either RAGAS or Bedrock Model Evaluation. But spinning up three Knowledge Bases with different configs and running queries against each is much faster than building three custom pipelines.

Cost to watch: OpenSearch Serverless has a minimum of ~$350/month (2 OCUs) even at zero queries. For testing and low-volume workloads, consider Aurora pgvector or S3 Vectors as the backing store instead.

Diagnosing Common RAG Problems

SymptomLikely CauseFix
Answer is vague or incompleteChunks too small, context lostIncrease chunk size or add overlap
Answer hallucinates detailsRetrieved chunks not relevantSmaller chunks, better embedding model, or semantic chunking
Right doc found but wrong sectionChunks too big, relevant passage dilutedDecrease chunk size
Inconsistent quality across doc typesOne-size-fits-all chunkingDifferent strategies per doc type, or custom Lambda
Good retrieval but bad answersLLM prompt issue, not retrievalTune prompt template, add instructions, try reranking

What I Learned

  • Chunking is the biggest quality lever — Teams spend weeks debating vector stores and embedding models, then use default fixed-size chunking and wonder why answers are poor. Start with chunking experiments.
  • The vector engine is free, the RAM is not — OpenSearch gives you three k-NN engines at no extra cost. The real cost difference is whether your vectors live in memory (FAISS/NMSLIB) or on disk (Lucene). For most RAG workloads, Lucene is the right default.
  • You can’t improve what you don’t measure — A golden test set of 50-100 questions is the minimum investment to tune a RAG system. Without it, you’re guessing. Bedrock Model Evaluation and RAGAS make measurement practical.
  • Bedrock KB turns iteration from days into minutes — Changing a chunking strategy and re-syncing takes one click. The managed pipeline pays for itself in iteration speed, even if you eventually move to a self-managed approach.
  • Test on a subset, deploy on the full corpus — Re-embedding 1M documents for each experiment is wasteful. Find the winning strategy on 500-1,000 representative docs, then embed everything once.

What’s Next

  • Benchmark semantic vs hierarchical chunking on a real enterprise dataset using Bedrock Model Evaluation
  • Build a reference pipeline: Bedrock KB (retrieval) + reranking + ElastiCache (semantic caching) + evaluation loop
  • Test FAISS IVF with quantization for large-scale RAG (1M+ vectors) and compare cost/recall vs Lucene
  • Explore Bedrock’s custom Lambda chunking for structured documents (tables, forms, multi-column PDFs)

References:

Alexandre Agius

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

Related Posts

Back to Blog