A working RAG demo takes a senior engineer about four hours: LangChain, OpenAI embeddings, Pinecone, a PDF loader, a system prompt. A working production RAG takes 10-16 weeks. The gap between the two isn't engineering velocity — it's the dozens of architectural decisions a demo glosses over. Chunk size? Use the LangChain default of 1000 characters. Embedding model? Whatever's at the top of the leaderboard this week. Retrieval? Top-K=4 because that's what every tutorial does. Each of those defaults is wrong for production. This playbook walks through every one. After 20+ production RAG deployments across fintech, healthcare, legal, and enterprise customer support, NKKTech has seen the same failure patterns and the same architectural wins repeat. Here's the playbook we wish we'd had in 2023.
What "Production RAG" Actually Means
A demo RAG is judged on whether it returns a plausible answer for one or two questions. A production RAG is judged on five different things, every one of which can break independently. Retrieval recall: when the answer exists in the corpus, does the retriever surface it in the top-K results? Demos rarely measure this; production systems live or die by it. Answer faithfulness: does the generated answer accurately reflect the retrieved context, or does the LLM hallucinate beyond it? A hallucinating RAG is worse than no RAG, because users trust grounded answers more. Coverage and freshness: when documents are added, updated, or deleted in your source-of-truth system, how fast does the RAG index reflect the change? "Within 60 seconds" and "within 60 minutes" are entirely different architectures. Cost per query: a production RAG that costs $0.40/query is unshippable for B2C; one that costs $0.001/query may be fine for high-volume enterprise. Latency: p50 under 1.5s and p99 under 4s are typical product UX bars for chat-style RAG; analytical or backend RAG can run longer.
Most production RAG incidents we've debugged at NKKTech traced back to one of three architectural choices made on day one and never revisited: the chunking strategy, the embedding model, and the retrieval pattern. Get those three right and most other problems are tractable. Get them wrong and you spend month seven trying to patch around them.
Chunking Strategy: Where Most RAG Systems Die
Chunking is the lossy compression of your knowledge into retrievable units. Every chunking choice is a tradeoff between context preservation and retrieval precision, and the right tradeoff is domain-specific.
Fixed-size chunking (e.g., 1000 chars with 200 char overlap, the LangChain default) is the worst choice for most real-world documents. It splits sentences mid-clause, fragments tables, and breaks lists. The retriever returns chunks that look relevant by embedding similarity but are actually unparseable. This works for short, uniform text (FAQ answers, product descriptions); it fails for long-form documents, contracts, technical specs, and structured data.
Recursive character chunking (LangChain's RecursiveCharacterTextSplitter, splitting on \n\n → \n → . → space) is better. It preserves paragraph and sentence boundaries when possible. Use this as the baseline for prose-heavy content.
Semantic chunking — using an embedding model to detect natural topic boundaries within a document and chunking at those boundaries — produces the highest retrieval quality for long-form content where topics shift mid-document. The cost: 2-5x slower ingestion. Worth it for legal, medical, and research corpora.
Structure-aware chunking is what production systems actually need. For Markdown documents, chunk by headings (preserve the heading hierarchy in each chunk's metadata). For HTML, chunk by semantic sections. For tables, treat each row (or a row plus its header) as its own chunk; don't try to embed an entire 200-row table as one chunk. For code, chunk by function or class. The implementation is more work — you need a per-format parser — but the retrieval improvement is dramatic.
Real numbers from a NKKTech legal-tech project: switching from 1000-char fixed chunking to structure-aware chunking (clause-level for contracts) improved retrieval@5 (the chance the relevant chunk appears in top-5) from 64% to 91% on a 2,400-document corpus. No change to embedding model, no change to retrieval logic — just smarter chunking. The work took 6 engineering days.
A few non-obvious tips. Always include the document title and a path-from-root ("Acme MSA / Section 3 / Subsection 3.2") in each chunk's prefix metadata. This dramatically improves both retrieval and the LLM's grounding ability. Set chunk size based on your generation model's effective context (not its max context): GPT-4o handles 8-12k tokens of retrieved context well; past that, retrieval precision matters more than recall, so smaller chunks (300-500 tokens) outperform larger ones. And always log chunk sizes for inspection — most production bugs we've debugged came from one outlier chunk (a 12,000-character chunk because a parser dropped a delimiter, dominating retrieval scores for every query).
Embedding Models: Which to Pick in 2026
Embedding choice in 2026 has stabilized around 4-5 strong options, each with different cost/quality/speed profiles. The right pick depends on your corpus language, your query patterns, and your unit economics.
OpenAI text-embedding-3-large (3072 dimensions, $0.13/M tokens). The 2026 default for English-heavy corpora. Reliable quality, high availability, integrates with everything. Slight overkill (and overcost) for many use cases. We use it as the baseline for new projects and downgrade only if eval shows we can.
OpenAI text-embedding-3-small (1536 dimensions, $0.02/M tokens). The cost-conscious default. About 90% of the retrieval quality of -large at 15% of the cost. For high-volume B2C or large corpora, the math always wins. Switch to -small unless your eval shows clear regression.
Voyage AI voyage-3-large (1024 dims, $0.12/M tokens). The current quality leader on MTEB English benchmarks. Worth eval'ing if you have a tough retrieval problem and OpenAI -large isn't hitting your target.
Cohere embed-multilingual-v3 (1024 dims). The default for multilingual corpora (English + Japanese + Korean + Chinese + Vietnamese). OpenAI embeddings are English-biased; Cohere multilingual significantly outperforms on cross-language retrieval.
BGE-M3 (BAAI, open-source, self-hosted). The serious open-source option. Cheap to run at scale on your own GPUs, supports 100+ languages, and has hybrid (dense + sparse) embeddings built in. The right choice when (a) you have strict data residency requirements, or (b) your query volume is high enough that self-hosting pays back the ops investment in 3-6 months.
A practical recommendation: don't agonize over embedding choice on day one. Start with text-embedding-3-small. Build an eval set. After 2-4 weeks of production data, test 2-3 alternatives on the same eval. Switch only if the improvement is meaningful (5%+ recall@10 improvement, sustained across query types). Embedding migration is a 1-2 day engineering job once you have the eval; it's the agonizing-on-day-one decision that wastes a month.
Vector Database: pgvector, Pinecone, Qdrant, or Weaviate?
The vector DB market consolidated in 2025-2026. Four options dominate production deployments. Choosing depends on your stack, scale, and operational appetite.
pgvector (PostgreSQL extension, open-source, self-hosted). The default for most B2B production RAG up to ~10M chunks. Already in your stack, no new operational surface, transactions and SQL work normally, and HNSW indexing in pgvector 0.7+ is competitive with dedicated vector DBs on retrieval speed and recall. Limitations: scaling past 10M chunks needs PostgreSQL partitioning and indexing care; managed offerings (Supabase, Neon, Crunchy Data) hide most of this complexity.
Pinecone (managed, $70/mo for serverless starter). The default if you don't want to operate infrastructure. Excellent retrieval latency (single-digit ms p50), built-in hybrid search, robust namespaces for multi-tenant. Cost scales linearly with vector count and storage; the math becomes uncomfortable past ~50M vectors. Use case: B2B SaaS with multi-tenant requirements and limited DevOps capacity.
Qdrant (open-source, self-hosted or managed). The right answer when you need advanced filtering (rich payload conditions: "vectors where customer_id=X AND created_at > Y AND tag IN (z1,z2)"). Self-hosted on a few-thousand-dollars-a-month Kubernetes cluster gives you 100M+ vector capacity. Managed Qdrant Cloud is competitive with Pinecone on price. Strong choice for complex enterprise filtering use cases.
Weaviate (open-source, managed cloud available). Strong for hybrid search out of the box (BM25 + vector built in), GraphQL API, schema-driven data modeling. Slightly heavier operationally than Qdrant. Use case: when you want the database to own more of the search logic instead of building it in application code.
What we recommend at NKKTech: for any client starting fresh, default to pgvector. The combined argument (no new infra to operate, no new query language to learn, no new vendor on the procurement list, transactions work normally) is compelling for the first 12 months. Migrate to a dedicated vector DB only when you have measured a specific problem pgvector can't solve — usually retrieval latency at 10M+ chunks under heavy concurrent load.
What we don't recommend: ChromaDB for production (great for prototyping, ops story is weak at scale), Elasticsearch with the dense-vector plugin (you'll spend more on Elastic ops than on the actual problem), or any database whose company looks like it might be acquired/sunset within a year (the vector DB consolidation is ongoing; check the company's runway before betting your data layer on it).
📥 Free Download: Vietnam Offshore Dev Cost Guide 2026
Real developer rates, project cost breakdowns, and a budget planning template. Used by 200+ startup founders.
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.
Hybrid Retrieval: Why Pure Vector Search Isn't Enough
Pure vector (dense) search is the default tutorial path. In production it underperforms hybrid retrieval (dense + sparse / keyword) on almost every benchmark we've measured.
The reason is straightforward: embeddings are good at semantic similarity but poor at exact matches. A user query for "section 3.2 of MSA-2024-001" should match documents containing that exact string, but the embedding model treats "3.2" and "3.3" as semantically identical (they're close in vector space). A user query for an acronym, product code, error code, or proper noun has the same problem. Pure vector retrieval fails on these queries and the user gets irrelevant chunks.
Hybrid retrieval combines dense vector search with sparse keyword search (BM25 or its variants), then merges the results. The two retrievers are complementary: vector catches semantic matches even when no keyword overlaps; keyword catches exact-match precision even when semantic similarity is weak. A standard implementation: run both retrievers in parallel, fetch top-K from each (K=20 typically), merge with Reciprocal Rank Fusion (RRF), take final top-K (K=5-10).
Reranking adds another quality layer on top of hybrid retrieval. After your retriever returns 20-50 candidates, a reranker (Cohere Rerank 3, Voyage rerank-2, or a fine-tuned cross-encoder you host yourself) re-scores them with a model that's specifically trained for relevance scoring. The reranker is more accurate than the retriever (because it can compare query and candidate jointly), but slower and more expensive — so you only run it on the 20-50 candidates the retriever already filtered down. Reranking typically improves retrieval@5 by 8-15 percentage points on our internal evals.
Metadata filtering is the third dimension. Most production RAG queries should be constrained by structured metadata before semantic retrieval runs: only retrieve from documents the user has access to (RBAC), only from documents tagged with the relevant category, only from documents in the right language, only from documents updated within the freshness window. Filtering at the vector DB layer is dramatically faster than fetching everything then post-filtering in application code. This is where Qdrant's payload-filter performance shines and pgvector with proper indexes also performs well.
Real example: a NKKTech enterprise-search RAG for a 60k-employee company had retrieval@5 of 71% on internal HR/policy queries. Adding BM25 hybrid retrieval lifted it to 79%; adding Cohere reranking lifted it to 88%; adding department-aware metadata filtering (user can only see their department's docs) lifted it to 94%. Same chunks, same embedding model, same vector DB — three architectural additions, each 1-3 engineering days, total 3 weeks of work for 23-point improvement.
The Generation Layer: Prompt Engineering for RAG
Retrieval brings you context; generation turns context into an answer. The generation prompt is where hallucination is born or killed. Most demos use the LangChain default "Answer based on context" which is acceptable; production RAG needs a more disciplined prompt structure.
What goes in the generation prompt: a clear role assignment ("You are a customer support agent for [Company]"), the retrieval context (clearly delimited, with source citations preserved), explicit grounding instructions ("Answer using only the provided context. If the answer is not in the context, say 'I don't have that information in my knowledge base' — do not guess"), citation requirements ("After every claim, cite the source document and section in brackets"), output format constraints if applicable (JSON schema, bullet structure, max length).
A few patterns that consistently improve quality. Citations as a hard requirement: forcing the LLM to cite source-doc + section after every factual claim cuts hallucination dramatically because the model has to ground each statement. Refusal as a valid answer: explicitly authorizing the model to say "I don't know" is more important than people think; without that authorization the model defaults to making something up. Structured retrieval display: present each retrieved chunk with its title path, source URL, and last-updated date, separated by clear delimiters; this gives the LLM the metadata to cite confidently. Chain-of-thought for hard queries: for complex multi-hop questions, instruct the model to think through the answer step by step before responding; not necessary for simple lookups, but improves accuracy 5-15% on reasoning-heavy queries.
What doesn't help: cramming 20+ chunks of context. Past 8-12k tokens of retrieval context, generation quality drops because the model can't attend reliably to all of it (the well-known "lost in the middle" problem). Better to retrieve 5-7 high-quality chunks with reranking than 20 mediocre chunks. Also doesn't help: long, prescriptive system prompts with 50 rules. The model attends to the first and last few rules; the middle 40 are noise. Keep the system prompt under 800 tokens and put the most important constraints at the top and bottom.
Model choice for generation: GPT-4o or Claude 3.5 Sonnet for nuanced or high-stakes answers (legal, medical, customer-facing); GPT-4o-mini or Claude Haiku for high-volume, lower-stakes lookups (internal knowledge bases, simple FAQ); always test both — the cheaper models are often within 2-3% quality of the flagship for well-grounded RAG queries.
RAG Evaluation: Retrieval Metrics + End-to-End Quality
RAG evaluation is a two-layer problem. You evaluate the retriever and the generator separately, then evaluate them together. Skipping either layer makes diagnosis impossible when something breaks.
Retrieval evaluation: build a labeled set of 100-500 query-document pairs where you know which document(s) should be retrieved for each query. Metrics: recall@K (fraction of queries where the right doc is in top-K), MRR (mean reciprocal rank of the first correct doc), NDCG@K (discounted score that rewards higher-ranked correct docs). Recall@5 and Recall@10 are the most operational metrics; aim for 85%+ recall@5 in production. Below 70% retrieval@5 you cannot ship — the generator can't compensate for missing context.
Generation evaluation: given a query and retrieved context, score the answer. Metrics that matter: faithfulness (does the answer actually follow from the context? — use LLM-as-judge with a structured rubric, or a specialized model like Ragas), answer relevance (does the answer address what the user asked?), and citation accuracy (do the citations point to chunks that actually support the claim?). The Ragas library bundles these and is the de-facto standard in 2026.
End-to-end evaluation: take real user queries (or curated representative queries), run them through the full RAG pipeline, score with a combined rubric. We also track refusal rate (how often does the system correctly say "I don't know" when the answer isn't in the corpus?) — too low means hallucinations are leaking through; too high means retrieval is too conservative.
Maintain a regression eval set as a Git-tracked file in your repo. Run it in CI on every PR that touches retrieval, embeddings, or the generation prompt. Block deploys if scores regress more than 3-5%. We've seen clients ship 8-15 RAG releases without a quality regression because their CI eval caught every problem before deploy; we've seen others ship 2-3 silent regressions per quarter and discover them through customer complaints, which is significantly more expensive to fix.
Production Operations: Indexing, Updates, Drift
Demo RAG indexes a fixed corpus once; production RAG indexes a moving corpus continuously. The operational difference is where production-grade RAG separates from prototype-grade.
Indexing pipeline. Documents arrive from source systems (S3 buckets, internal wikis, ticket systems, databases); they need to be chunked, embedded, and inserted into the vector DB with the right metadata. Production indexing should be incremental, idempotent, and resumable. Incremental: only re-process documents that have changed (track by source content hash + modified timestamp). Idempotent: re-indexing the same document twice produces the same end state (deduplicate by chunk hash). Resumable: if the pipeline fails halfway through, it can pick up where it left off without losing or duplicating work.
Update strategy. When a source document changes, you have two options: (a) delete its old chunks and re-index from scratch, or (b) compute a chunk-level diff and only update changed chunks. (a) is simpler and always correct; (b) is faster for large documents but requires careful chunking determinism. For most clients we recommend (a) until volume forces (b).
Delete strategy. When a document is deleted from the source system, its chunks must be deleted from the vector DB — otherwise the RAG hallucinates from stale content. Use soft-deletes initially (mark chunks as deleted but keep them for 7-30 days in case of accidental deletion), then hard-delete after the grace period.
Freshness budget. Document the SLA: "changes propagate to the RAG within X minutes/hours". 60 minutes is sufficient for most enterprise knowledge bases; 60 seconds requires event-driven indexing on every source change; under 60 seconds usually means RAG is the wrong pattern and you should query the source system live instead.
Drift detection. Over time, the corpus shifts: new topics emerge, old documents become irrelevant, vocabulary changes. Without monitoring, retrieval quality slowly degrades. Track three signals: distribution of retrieval scores over time (sudden drops indicate corpus or query-distribution shift), refusal rate over time (rising refusal = your corpus is no longer covering user queries), and user feedback signals (thumbs down, follow-up questions, escalation rates). Set alerts. Schedule a corpus quality review every 90 days.
Observability is non-negotiable. Every RAG query should log: the query, retrieved chunk IDs and scores, the prompt sent to the generator, the generator's output, latency per stage, tokens used. Without these logs you cannot debug production bugs. With them you can answer "why did the agent give this user this wrong answer?" in 5 minutes.
Cost Optimization for RAG at Scale
RAG cost has three components: embedding cost (indexing + query embedding), retrieval cost (storage + compute for vector search), and generation cost (LLM tokens for prompt + completion). At scale (10k+ queries/day) all three matter; at small scale only generation matters.
Embedding cost optimization. Embedding the corpus once is usually a small fixed cost; embedding queries is the recurring cost. For high-volume systems, consider: (a) caching common query embeddings (a single embedded version of "what's our refund policy" used across hundreds of users), (b) using a smaller embedding model for queries while keeping the larger model for the indexed corpus (embedding-3-small for queries, -large for corpus is a common pattern), (c) self-hosting BGE-M3 for query embeddings at high volume.
Retrieval cost. Vector DB cost scales with vector count and storage. Reduce by: (a) lower-dimension embeddings where quality permits (1536 dims vs 3072 dims halves storage), (b) quantized embeddings (float8 or int8 instead of float32 cuts memory 4-16x with marginal quality loss), (c) tiered retrieval (run cheap keyword filter first, then vector search on the filtered subset).
Generation cost. This is the biggest line item in most production RAGs. Five optimizations matter. Reduce retrieval context size: 5 high-quality reranked chunks (3-4k tokens) usually beats 15 mediocre chunks (8-10k tokens), at 60% lower cost per query. Cache repeated queries: as discussed above, semantic caching with 20-40% hit rate is normal in production. Use prompt caching: Anthropic and OpenAI both support caching of the stable parts of the prompt (system prompt, tool catalogs, base instructions), cutting prompt tokens 60-80% on cached portions. Route by query complexity: simple lookups go to GPT-4o-mini ($0.15/M input tokens), complex reasoning queries go to GPT-4o ($2.50/M input tokens) or Claude 3.5 Sonnet. Batch where you can: if your RAG produces non-realtime outputs (overnight report generation, document processing pipelines), use batched APIs at 50% cost.
Real numbers from a NKKTech enterprise customer-support RAG: 30,000 queries/day, originally $14,200/month total cost. After applying the optimizations above — embedding-3-small for queries, prompt caching, model routing for trivial questions, smaller retrieval context — monthly cost dropped to $4,100. 71% reduction, no measurable quality drop on the client eval set. The engineering investment was 8 days of work spread across two weeks.
Implementation Roadmap
If you're starting fresh, here's the sequence that's worked across NKKTech's production deployments.
Week 1-2: Eval set first. Pick 50-200 representative queries from your real data (anonymize if needed). For each, label the correct source document(s) and an ideal answer. This is your North Star. Spend the time to do it right; you'll save weeks later.
Week 2-4: PoC pipeline. Use pgvector + text-embedding-3-small + GPT-4o-mini + simple recursive chunking. Skip optimization. Run your eval set against this baseline. Document the scores — they'll be your reference for every future change.
Week 4-6: Chunking strategy. Examine your documents. Pick structure-aware chunking matched to your content type. Re-index, re-evaluate. Expect a 10-25 percentage-point recall improvement over the fixed-size baseline.
Week 6-8: Hybrid retrieval and reranking. Add BM25 alongside vector search, fuse with RRF. Add Cohere Rerank 3 on the top 20 candidates. Re-evaluate. Expect another 8-15 percentage-point recall improvement.
Week 8-12: Production hardening. Set up incremental indexing pipeline, update/delete handling, observability with full query logging, eval gate in CI. Run your eval against every change.
Week 12-16: Cost optimization (only if volume justifies). Prompt caching, semantic caching, model routing, retrieval-context tuning. Re-evaluate. Expect 50-75% cost reduction with under 2% quality loss.
The most common mistake we see: teams try to do all of this in parallel, get confused about what's helping and what's hurting, and ship something that mostly works but is fragile. Sequential, eval-driven changes are slower in week 4 and faster in week 16.
If you'd rather have NKKTech build the RAG for you — fixed-scope, senior engineers, in 10-16 weeks — book a free 30-minute discovery call. We've built production RAG for legal, fintech, healthcare, customer support, and internal knowledge-base use cases. Singapore-law or Vietnam-law contracts, whichever your procurement prefers. ISO 9001 + 22301 certified. No drip sequences — just a conversation about whether we're the right fit.
📥 Free Download: Vietnam Offshore Dev Cost Guide 2026
Real developer rates, project cost breakdowns, and a budget planning template. Used by 200+ startup founders.
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.

10+ years building AI systems for Toyota, Sony, and Rakuten in Japan. Founded NKKTech in 2018 with a senior-only engineering model.
Thêm trong pillar này
RAG Chunking Strategies: Fixed, Semantic, Recursive, Hybrid (2026)
8 min🔍Hybrid Retrieval: When Pure Semantic Search Fails (and How to Fix It)
10 min🗄️Vector Database Comparison 2026: Pinecone vs Weaviate vs pgvector vs Qdrant
8 min🎯RAG Evaluation Metrics Explained: Precision, Faithfulness, Answer Relevance
8 minWant to build this with NKKTech?
Building a RAG system and want a second opinion before you commit to an architecture? Book a free 30-minute call with a NKKTech senior engineer for a no-pitch technical review of your current setup or planned approach.
Book a Free Call