After deploying RAG for 25+ NKKTech clients, here's a pattern that surprised us: pure semantic search (cosine similarity on dense embeddings) wins ~70% of queries but loses badly on the other 30%. The losses are systematic — exact-match queries (product SKUs, error codes, regulation IDs), short queries (1-2 words where there's not enough text for embedding to be discriminative), and queries with negation (semantic embeddings don't naturally distinguish 'find without X' from 'find with X'). Production RAG fixes this with hybrid retrieval: combine sparse keyword search (BM25) with dense vector search, then optionally rerank. Here's the production recipe with the eval numbers that matter.
The Cases Where Pure Semantic Search Fails
Three failure modes that show up in eval suites repeatedly. Exact-match losses: 'SKU 47291-X' as a query, the document has 'SKU 47291-X' literally. Dense embeddings encode the rough semantic shape ('product identifier') but lose the specific token sequence; cosine similarity is dominated by surrounding context. Recall@1 for these queries: ~40-50% on dense alone, ~95% on BM25. Short-query losses: a 1-word query like 'auth' or 'pricing.' Dense embeddings need context to differentiate ('user authentication' from 'token authorization' from 'auth0 integration'); a single word produces an embedding too generic to retrieve well. Recall@5 drops from 0.78 (multi-word) to 0.42 (single-word) on dense alone in our eval data. Negation losses: 'find errors that do NOT involve the database.' Dense embeddings treat negation tokens as just another semantic signal; the embedding ends up close to documents about database errors. Hard to fix at retrieval time — usually need to handle at query-rewrite time before retrieval. The 30% of queries where pure semantic search fails is concentrated in these three modes, so the fix targets them directly.
BM25 + Dense: The Hybrid Retrieval Recipe
Run two retrievers in parallel and combine their results. BM25 (or its variant, BM25F for fielded text) handles exact-match and short-query well; dense (cosine on sentence-transformer or OpenAI embeddings) handles semantic well. Implementation: in pgvector, you can use the same Postgres table for both — one column with the dense vector for vector search, plus a tsvector column with ts_rank_cd for BM25-equivalent. Query goes to both, top-K from each is combined. In Pinecone, you can use hybrid search via sparse-dense vectors; Pinecone handles the combination. In Elasticsearch, you can use script_score with both signals. The hybrid query is ~1.5-2x the latency of pure dense (you're doing two retrievals), but quality typically lifts 8-15 points on Recall@5 across mixed query distributions in our eval data. The pattern works in any vector DB that supports keyword + vector; if your vector DB doesn't, switch to one that does (this is one of the few opinionated tech recommendations we make).
📥 Free Download: Vietnam Offshore Dev Cost Guide 2026
Real developer rates, project cost breakdowns, and a budget planning template. Used by 200+ startup founders.
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.
Reciprocal Rank Fusion vs Weighted Combination
Two ways to combine BM25 and dense results. Reciprocal Rank Fusion (RRF): score each document by sum(1 / (rank_in_BM25 + k)) + sum(1 / (rank_in_dense + k)) where k=60 is the standard constant. Simple, robust, no tuning required. Default choice. Weighted combination: normalize both scores to [0, 1], then take alpha * dense_score + (1 - alpha) * bm25_score. Requires tuning alpha (typically 0.5-0.7 for mixed query loads). More flexible because you can adjust the dense/sparse balance, but more fragile because the optimal alpha depends on query distribution and document corpus. We default to RRF for new deployments — it just works out of the box. Switch to weighted combination only when the eval data clearly shows a different optimal alpha per query type (e.g., technical-document corpus with lots of exact-match queries might want alpha=0.4 = more weight on BM25). One trap: don't combine scores from different retrievers using raw values without normalization — BM25 scores can be 0-50, cosine similarity 0-1, naive sum is dominated by BM25.
Adding a Reranker: Cohere, Cross-Encoders, or Custom
After retrieving top-20 with hybrid, rerank to top-5 using a cross-encoder model. The first-stage hybrid retrieval optimizes for recall (don't miss the relevant doc); the reranker optimizes for precision (put the most relevant docs first). Cohere's rerank-3.5 model is the easy choice — paid API, ~20ms latency for 20 docs, $0.02 / 1000 reranks. Quality lift typically +5-12 points on NDCG@5 over hybrid alone. Cross-encoder alternative: ms-marco-MiniLM-L-12-v2 or similar HuggingFace model self-hosted on a small GPU. Free to run, ~80-150ms latency, quality slightly behind Cohere but close. Custom reranker: fine-tune a cross-encoder on your domain data. Worth it only if you have clear quality plateau on the standard models AND you have labeled relevance data (which most teams don't). Decision tree: default to Cohere unless cost or data residency rules it out, then HuggingFace self-hosted, then custom fine-tuning as last resort. For the broader RAG architecture decisions — chunking, embedding model selection, context budget — see our RAG Implementation Playbook.
Eval Framework for Hybrid Retrieval Quality
Three metrics every production RAG needs to track on a frozen eval set. Recall@5: of the queries in the eval set, what fraction have at least one truly-relevant document in the top-5 retrieved? Target: 0.85+. NDCG@5: how well-ranked are the relevant documents within the top-5? Target: 0.75+. Answer faithfulness: separately scored — when the LLM synthesis runs on the retrieved chunks, how often does it cite the chunks accurately vs hallucinate? Target: 0.92+. Separately track these by query type (exact-match, short, semantic, negation) because the system can have 0.92 overall recall but 0.45 recall on exact-match queries, and your users notice. Eval set construction: 200-500 (query, relevant_doc_ids) pairs is the minimum viable; 1000+ for production-critical systems. Generate the eval set 70% from real user queries (with relevance manually labeled by your domain expert), 30% from synthetic queries that stress specific failure modes (exact-match, short, negation). Run the eval suite in CI — block deployments that regress Recall@5 by more than 2 points or NDCG@5 by more than 3 points.
Production Cost vs Quality Tradeoffs
Hybrid retrieval is ~1.5-2x the cost of pure dense (two retrievals) plus the reranker cost. For typical workloads at 100K queries/month: pure dense costs $0 in retrieval (just vector DB compute), $0 in reranker. Hybrid + Cohere rerank: ~$80/month additional. Hybrid + self-hosted cross-encoder: ~$200/month additional GPU. Quality lift: 10-25 points on NDCG@5. The cost is trivial for the quality lift — at scale this rounds to zero. The real production cost question isn't whether to use hybrid retrieval; it's whether to add a reranker. Decision rule: add the reranker if your eval data shows NDCG@5 below 0.75 after hybrid alone. Skip it if hybrid already hits target. Two failure modes we see often: teams add a reranker before fixing hybrid retrieval (wastes the reranker's lift because the input candidates are bad), and teams skip reranker because 'it's an extra service' even though their eval data shows clear quality gaps that reranking would close. Match the architecture to the eval data, not to architectural preferences.
📥 Free Download: Vietnam Offshore Dev Cost Guide 2026
Real developer rates, project cost breakdowns, and a budget planning template. Used by 200+ startup founders.
Ready to build?
NKKTech delivers AI Development projects from $30K.
Fixed scope. Senior Vietnam engineers. 14-day kickoff.

10+ years building AI systems for Toyota, Sony, and Rakuten in Japan. Founded NKKTech in 2018 with a senior-only engineering model.
Want to build this with NKKTech?
Running RAG in production and want a retrieval-quality audit? Book a free 30-minute call. We'll look at your retrieval metrics, check the query-type breakdown, and recommend specific fixes for the lowest-performing query types.
Book a Free Call