Command Palette

Search for a command to run...

HomeArticlesAdvanced RAG Architecture & Optimization
rag

Advanced RAG Architecture & Optimization

Understand how to implement Hybrid Search, Query Rewriting, Cohere Reranking, and Parent-Document Retrieval to optimize precision and recall in production.

Amarjit Singh

Amarjit Singh

AI Engineer & Creator

12 min read June 20, 2026
Advanced RAG Architecture & Optimization

Advanced RAG Architecture & Optimization

Retrieval-Augmented Generation (RAG) is the industry standard for grounding LLMs in private data. However, Naive RAG—chunking text, creating embeddings, indexing in a vector store, and retrieving via cosine similarity—rarely cuts it in production. It struggles with exact keyword matching, long-document context, and noise.

To reach production-grade search, we must transition to a multi-stage retrieval pipeline.

---

The Production RAG Pipeline

mermaid
graph TD
    UserQuery[User Query] --> QR[Query Rewriter]
    QR --> DS[Dense Vector Search]
    QR --> SS[Sparse BM25 Search]
    DS --> Rec[Reciprocal Rank Fusion]
    SS --> Rec
    Rec --> Rerank[Cross-Encoder Reranker]
    Rerank --> ContextFilter[Context Packer / Filter]
    ContextFilter --> LLM[LLM Generation]

---

1. Hybrid Search (Dense + Sparse)

Dense embeddings (e.g., OpenAI, Cohere) capture semantic meaning, but they fail at exact strings like product serial numbers, UUIDs, or niche terminology. Sparse search (BM25) excels at exact matches.

Combining both via Reciprocal Rank Fusion (RRF) provides the best of both worlds:

python
# Concept of RRF ranking calculation
def reciprocal_rank_fusion(dense_results, sparse_results, k=60):
    rrf_scores = {}
    for rank, doc_id in enumerate(dense_results):
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + (1.0 / (k + rank + 1))
    
    for rank, doc_id in enumerate(sparse_results):
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + (1.0 / (k + rank + 1))
        
    return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

---

2. Cross-Encoder Reranking

Retrieving 50 documents and sending them to the LLM is expensive and degrades model responses (referred to as the "lost in the middle" phenomenon). Instead:

  1. Retrieve top-50 candidates using fast hybrid search.
  2. Run candidates through a Cross-Encoder Reranker (like Cohere Rerank, BGE-Reranker).
  3. Inject only the top-5 reranked documents into the prompt context.

Cross-encoders evaluate the query and document together, outputting a true similarity score that outperforms embedding distance by a large margin.

---

3. Parent-Document Retrieval (Chunk Splitting)

When indexing, we often split text into small chunks (e.g., 200 tokens) so that embedding vectors are highly focused. However, 200 tokens may lack the full context required to answer a question.

With Parent-Document Retrieval:

  • We segment documents into small child chunks for embedding search.
  • When a child chunk matches the query, we retrieve its entire parent document (or a much larger parent chunk of 1000 tokens) to feed to the LLM.

This maintains high retrieval precision while delivering rich context for generation.

COMPILING DYNAMIC AI DIGEST HUB VIA GPT-OSS-120B...