Advanced RAG Architecture & Optimization

Retrieval-Augmented Generation (RAG) is the industry standard for grounding LLMs in private data. However, Naive RAG—chunking text, creating embeddings, indexing in a vector store, and retrieving via cosine similarity—rarely cuts it in production. It struggles with exact keyword matching, long-document context, and noise.

To reach production-grade search, we must transition to a multi-stage retrieval pipeline.

---

The Production RAG Pipeline

mermaid

graph TD
    UserQuery[User Query] --> QR[Query Rewriter]
    QR --> DS[Dense Vector Search]
    QR --> SS[Sparse BM25 Search]
    DS --> Rec[Reciprocal Rank Fusion]
    SS --> Rec
    Rec --> Rerank[Cross-Encoder Reranker]
    Rerank --> ContextFilter[Context Packer / Filter]
    ContextFilter --> LLM[LLM Generation]

---

1. Hybrid Search (Dense + Sparse)

Dense embeddings (e.g., OpenAI, Cohere) capture semantic meaning, but they fail at exact strings like product serial numbers, UUIDs, or niche terminology. Sparse search (BM25) excels at exact matches.

Combining both via Reciprocal Rank Fusion (RRF) provides the best of both worlds:

python

# Concept of RRF ranking calculation
def reciprocal_rank_fusion(dense_results, sparse_results, k=60):
    rrf_scores = {}
    for rank, doc_id in enumerate(dense_results):
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + (1.0 / (k + rank + 1))
    
    for rank, doc_id in enumerate(sparse_results):
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + (1.0 / (k + rank + 1))
        
    return sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)

---

2. Cross-Encoder Reranking

Retrieving 50 documents and sending them to the LLM is expensive and degrades model responses (referred to as the "lost in the middle" phenomenon). Instead:

Retrieve top-50 candidates using fast hybrid search.
Run candidates through a Cross-Encoder Reranker (like Cohere Rerank, BGE-Reranker).
Inject only the top-5 reranked documents into the prompt context.

Cross-encoders evaluate the query and document together, outputting a true similarity score that outperforms embedding distance by a large margin.

---

3. Parent-Document Retrieval (Chunk Splitting)

When indexing, we often split text into small chunks (e.g., 200 tokens) so that embedding vectors are highly focused. However, 200 tokens may lack the full context required to answer a question.

With Parent-Document Retrieval:

We segment documents into small child chunks for embedding search.
When a child chunk matches the query, we retrieve its entire parent document (or a much larger parent chunk of 1000 tokens) to feed to the LLM.

This maintains high retrieval precision while delivering rich context for generation.

Advanced RAG Architecture & Optimization

Amarjit Singh

Advanced RAG Architecture & Optimization

The Production RAG Pipeline

1. Hybrid Search (Dense + Sparse)

2. Cross-Encoder Reranking

3. Parent-Document Retrieval (Chunk Splitting)

Related Insights

Prompt Engineering for Production Systems

High Performance LLM Inference with Rust

Command Palette

Advanced RAG Architecture & Optimization

Amarjit Singh

Advanced RAG Architecture & Optimization

The Production RAG Pipeline

1. Hybrid Search (Dense + Sparse)

2. Cross-Encoder Reranking

3. Parent-Document Retrieval (Chunk Splitting)

Related Insights

Prompt Engineering for Production Systems

High Performance LLM Inference with Rust