Retrieval-augmented generation is the most over-demoed and under-shipped pattern in AI right now. Building a RAG prototype on a PDF corpus takes an afternoon. Building a production RAG system that answers user questions correctly, consistently, and at acceptable latency is a multi-month project.

Here is what we have learned.

The retrieval problem is the hard problem

Most teams optimize the generation side — prompt engineering, model selection, response formatting. The retrieval side is where production RAG actually fails.

Chunking strategy matters more than embedding model. A sensible chunking strategy (respecting document structure, overlapping windows, metadata filters) with a mid-tier embedding model beats a naive chunking strategy with the best embedding model.
Reranking is not optional. A cheap retriever surfacing 50 candidates and a cross-encoder reranker promoting the best 5 outperforms a single-stage dense retriever at almost every task.
Hybrid is the default. BM25 + dense embeddings with score fusion catches queries that either method alone misses.

Evaluation is the whole game

You cannot improve what you cannot measure. Every production RAG system we have shipped has a dedicated evaluation harness from week one, with:

A curated question set that covers the intents users actually have.
Ground-truth answers written by domain experts.
Automated metrics (answer relevance, faithfulness, citation accuracy) plus a small human-reviewed sample per release.

Without this, you are flying blind. Every prompt tweak or model upgrade feels like an improvement until you measure it and realize it regressed on the queries that matter.

Answer quality is a product decision

The last 20% of answer quality comes from product choices, not model choices. Things that matter:

Say "I don't know" well. The system should decline to answer when it is not confident, and say so in a way that builds trust rather than feeling evasive.
Cite everything. Every factual claim should link back to the retrieved source. Users who can verify answers will use the system more.
Handle follow-ups. Real users ask follow-up questions. Your retriever needs conversation context, not just the latest turn.

The first 80% of RAG quality comes from the basics. The last 20% is where most teams quit. Do not quit.

RAGLLMProduction AI

Building RAG that actually works in production

The retrieval problem is the hard problem

Evaluation is the whole game

Answer quality is a product decision

Ready to Scale with Precision?