· — Dishant Sethi ·May 27, 2026·16 min read

Why Your RAG Pipeline Is Failing in Production (And How to Fix It)

Q: What is the right reranker model for production?

cross-encoder/ms-marco-MiniLM-L-6-v2 is a strong starting point: it runs with 50–100ms latency on CPU for a 20-candidate set and is well-calibrated on information retrieval tasks. For domain-specific content (legal, medical, code), fine-tune a cross-encoder on your own query/passage pairs. If latency is critical, Cohere Rerank or Jina Reranker via API offload the computation and handle batching automatically.

A diagnostic guide to the 5 most common RAG pipeline failures in production — bad chunking, missing reranking, stale indexes, no hybrid retrieval, and no eval loop — with code snippets and fixes for each.

Key Takeaways
80% of RAG failures trace back to the ingestion layer, not the LLM — fix chunking and indexing before tuning your prompts
Chunk size alone can swing retrieval precision by 20–40%; there is no universal right answer, and the correct value depends on your document type and query pattern
Adding a cross-encoder reranker on top of vector search typically lifts answer correctness by 15–25% with minimal latency cost
Stale indexes are invisible in standard monitoring: a document updated 3 months ago may still be answering queries from its old content
Teams without an eval loop discover regressions 4–8× slower than teams with automated retrieval quality checks running on every deployment

A RAG pipeline looks straightforward on paper: retrieve relevant chunks, stuff them into a prompt, get an answer. Teams wire it up in a weekend, the demo works, and they ship it. Then, weeks later, users start complaining that the system returns outdated information, misses obvious answers, or confidently cites the wrong document.

RAG pipeline debugging starts at the retrieval layer, not the LLM. The five failure modes that break production RAG systems — bad chunking, missing reranking, stale indexes, no hybrid retrieval, and no eval loop — are all fixable at the data and infrastructure layer. None require changing your model or rewriting your application.

Why RAG Fails Silently in Production

The LLM itself is almost always fine. The retrieval layer is what's broken — and most observability tooling points at the model, not the retriever. You can spend days tweaking system prompts and temperature settings while the root cause sits in how you chunked your documents three months ago. Production RAG failure leaves no stack trace.

There is no exception, no 500 error, no latency spike. The system continues to return answers. They are just wrong, or incomplete, or stale. Without an explicit eval loop tied to retrieval quality, you will not know until a user tells you.

This guide covers the five failure modes Prodinit encounters most often when auditing RAG systems in production, with diagnosis steps and fixes for each.

Failure Mode 1: Bad Chunking Strategy

Fixed-size character splitting destroys retrieval quality for anything beyond plain prose. A 512-token chunk of a legal contract may split a clause mid-sentence; a 512-token chunk of code may span four unrelated functions. Neither produces embeddings specific enough to surface the right document for a precise query.

Why it breaks

Chunking is the most consequential decision in a RAG pipeline and the one teams spend the least time on. The default in most frameworks is a fixed-size character or token split with a small overlap. This works in demos. In production, it destroys retrieval quality for anything that isn't plain prose.

The problem with fixed-size chunking:

A 512-token chunk of a legal contract may split a clause mid-sentence, leaving neither chunk with enough context to be retrieved correctly
A 512-token chunk of code may contain four unrelated functions, causing the entire chunk to match queries loosely but none of them precisely
Tables, structured data, and numbered lists lose their semantics when split by character count

When your chunks are semantically incoherent, your embeddings are noisy. Noisy embeddings produce low-confidence nearest-neighbor results. The retriever returns tangentially related chunks, the LLM hallucinates to fill the gap, and the answer looks plausible but wrong.

Diagnosis

Check what your chunks actually look like:

import json

def audit_chunks(chunks: list[str], sample_size: int = 20) -> dict:
    import random
    sample = random.sample(chunks, min(sample_size, len(chunks)))
    
    stats = {
        "avg_tokens": sum(len(c.split()) for c in chunks) / len(chunks),
        "min_tokens": min(len(c.split()) for c in chunks),
        "max_tokens": max(len(c.split()) for c in chunks),
        "truncated_sentences": sum(
            1 for c in sample
            if not c.strip().endswith((".", "?", "!", "```", "}"))
        ),
        "sample": sample[:3],
    }
    return stats

# Run this against your chunk corpus
audit = audit_chunks(your_chunks)
print(json.dumps(audit, indent=2))

Red flags: truncated_sentences above 30%, average tokens below 100 or above 600, or chunks that end mid-code-block.

Fix

Switch to semantic chunking. For prose documents, split on sentence boundaries and merge until a semantic similarity threshold is crossed. For structured content, use document-aware splitters that respect headings, tables, and code blocks.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Document-aware splitter that respects structure
splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", "! ", "? ", " ", ""],
    chunk_size=600,          # tokens, not characters
    chunk_overlap=60,        # ~10% overlap for context continuity
    length_function=len,
    is_separator_regex=False,
)

# For code: use language-aware splitters
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter

code_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=800,
    chunk_overlap=80,
)

There is no universal correct chunk size. Run retrieval precision benchmarks at 256, 512, and 1024 tokens against a sample of real queries. Pick the size that maximises the percentage of queries where the correct answer appears in the top-3 retrieved chunks.

Failure Mode 2: Missing Reranking

Vector similarity search retrieves the right candidates but ranks them poorly. The chunk with the highest cosine similarity is not always the most useful chunk for the specific query — it is the closest in embedding space, not the most relevant to the question. Without a cross-encoder reranker, you are systematically passing the wrong context to your LLM.

Why it breaks

Vector similarity search is excellent at candidate retrieval. It is poor at ranking. Cosine similarity between two high-dimensional embeddings captures semantic proximity, not answer relevance for a specific query. The top result by cosine distance is not always the most useful chunk for the question at hand.

Teams that skip reranking are essentially treating their retrieval problem as solved after the first-stage ANN search. In practice, the chunk that best answers the query is often ranked 3rd or 5th by embedding similarity — close enough to retrieve, not close enough to surface first.

If your system passes the top-1 or top-2 chunks to the LLM without reranking and truncates the rest, you are systematically dropping the best answers.

Diagnosis

Run a relevance audit on your retrieval results:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def audit_retrieval_rank(query: str, retrieved_chunks: list[str], 
                          ground_truth_chunk: str) -> dict:
    scores = reranker.predict(
        [(query, chunk) for chunk in retrieved_chunks]
    )
    reranked = sorted(
        enumerate(retrieved_chunks), 
        key=lambda x: scores[x[0]], 
        reverse=True
    )
    
    vector_rank = retrieved_chunks.index(ground_truth_chunk) + 1
    reranked_rank = next(
        i + 1 for i, (orig_idx, _) in enumerate(reranked)
        if retrieved_chunks[orig_idx] == ground_truth_chunk
    )
    
    return {
        "query": query,
        "vector_rank": vector_rank,
        "reranked_rank": reranked_rank,
        "improved": reranked_rank < vector_rank,
    }

If reranked rank is better than vector rank on more than 30% of your test queries, you have a reranking gap that is actively hurting answer quality.

Fix

Add a cross-encoder reranker as a second-pass filter. Retrieve k=20 candidates from your vector store, rerank them, and pass the top-3 to your LLM. The cross-encoder sees the full query and each chunk together, which lets it score relevance directly rather than proximity in embedding space.

from sentence_transformers import CrossEncoder
from typing import List

class RerankedRetriever:
    def __init__(self, vector_store, reranker_model: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        self.vector_store = vector_store
        self.reranker = CrossEncoder(reranker_model)
    
    def retrieve(self, query: str, top_k: int = 3, candidate_k: int = 20) -> List[str]:
        # First-stage: broad vector retrieval
        candidates = self.vector_store.similarity_search(query, k=candidate_k)
        
        # Second-stage: cross-encoder reranking
        pairs = [(query, doc.page_content) for doc in candidates]
        scores = self.reranker.predict(pairs)
        
        ranked = sorted(
            zip(candidates, scores),
            key=lambda x: x[1],
            reverse=True
        )
        return [doc.page_content for doc, _ in ranked[:top_k]]

Cross-encoder reranking adds 50–200ms of latency for a 20-candidate set. For most production RAG workloads, that is an acceptable trade for a 15–25% improvement in answer correctness.

Failure Mode 3: Stale Index

Your embedding index is a snapshot of your documents at indexing time. When a policy is updated, a product spec is revised, or a pricing page changes, the index does not update automatically — queries continue retrieving content from weeks or months ago, with no error signal to indicate the problem.

Why it breaks

Stale index is insidious because it is invisible in standard observability. Query latency is normal. Embedding lookups return results. The system appears healthy. Users are just silently receiving outdated information.

The problem compounds with time. A document indexed 6 months ago and updated 3 times since is a liability, not an asset.

Diagnosis

Implement index freshness tracking:

import hashlib
from datetime import datetime
from dataclasses import dataclass

@dataclass
class IndexedDocument:
    doc_id: str
    content_hash: str
    indexed_at: datetime
    source_updated_at: datetime

def audit_index_freshness(indexed_docs: list[IndexedDocument], 
                           max_age_days: int = 30) -> dict:
    now = datetime.utcnow()
    stale = []
    
    for doc in indexed_docs:
        age = (now - doc.indexed_at).days
        if age > max_age_days:
            stale.append({"id": doc.doc_id, "age_days": age})
        
        if doc.source_updated_at > doc.indexed_at:
            stale.append({
                "id": doc.doc_id, 
                "reason": "source_updated_after_index",
                "gap_hours": (doc.source_updated_at - doc.indexed_at).seconds // 3600,
            })
    
    return {
        "total_documents": len(indexed_docs),
        "stale_count": len(stale),
        "stale_pct": round(len(stale) / len(indexed_docs) * 100, 1),
        "stale_docs": stale[:10],
    }

Fix

Implement incremental re-indexing on document change, not on a fixed schedule. Track content hashes. When a source document's hash changes, queue it for re-embedding immediately.

import hashlib
from datetime import datetime

class IncrementalIndexer:
    def __init__(self, vector_store, embedder):
        self.vector_store = vector_store
        self.embedder = embedder
        self.index_registry: dict[str, str] = {}  # doc_id -> content_hash
    
    def _content_hash(self, content: str) -> str:
        return hashlib.sha256(content.encode()).hexdigest()
    
    def upsert_document(self, doc_id: str, content: str, metadata: dict):
        new_hash = self._content_hash(content)
        
        if self.index_registry.get(doc_id) == new_hash:
            return  # Content unchanged, skip re-indexing
        
        self.vector_store.delete(filter={"doc_id": doc_id})
        
        chunks = self.chunk(content)
        embeddings = self.embedder.embed_documents(chunks)
        
        self.vector_store.add_embeddings(
            texts=chunks,
            embeddings=embeddings,
            metadatas=[{**metadata, "doc_id": doc_id, "indexed_at": datetime.utcnow().isoformat()}
                       for _ in chunks],
        )
        
        self.index_registry[doc_id] = new_hash

Wire this to your content management system's webhook or change-data-capture stream. Every document update should trigger an upsert within minutes, not the next scheduled batch run.

Failure Mode 4: No Hybrid Retrieval (BM25 + Vector)

Pure vector search fails on exact-match queries. When a user searches for a specific error code, API endpoint, or product identifier, vector similarity often surfaces semantically related content that never contains the exact string. BM25 handles rare-term and exact-match queries precisely — hybrid retrieval combines both and consistently outperforms either approach alone.

Why it breaks

Pure vector search excels at semantic similarity. It is poor at exact matching. When a user queries for a specific product code, a person's name, an API endpoint, or an error message, vector search often surfaces semantically related but lexically different results. The chunk containing the exact string ERR_QUOTA_EXCEEDED may score lower than a chunk about "error handling" that never mentions the specific code.

BM25 (the algorithm behind classic keyword search) handles exact and rare-term matching extremely well. It rewards documents that contain the query terms, with inverse document frequency weighting meaning that rare, specific terms get boosted. What BM25 misses is paraphrase, synonym, and conceptual matching — exactly what vector search handles.

Teams that use only vector search leave a meaningful precision gap for queries with specific identifiers. Teams that use only BM25 miss semantic intent. Hybrid retrieval combines both, and on standard retrieval benchmarks it consistently outperforms either approach alone.

Diagnosis

Run a query set that mixes semantic queries ("how does the refund policy work?") and exact-match queries ("what is the timeout value for API_GATEWAY_CONNECT?"). Compare top-3 precision for vector-only versus BM25-only versus hybrid across both query types. If vector-only precision on exact-match queries is more than 15 percentage points lower than on semantic queries, you have a pure-vector blind spot.

Fix

Implement reciprocal rank fusion (RRF) to merge vector and BM25 rankings:

from rank_bm25 import BM25Okapi
import numpy as np
from typing import List

class HybridRetriever:
    def __init__(self, vector_store, documents: List[str], 
                 rrf_k: int = 60, alpha: float = 0.5):
        self.vector_store = vector_store
        self.documents = documents
        self.alpha = alpha         # 0 = BM25 only, 1 = vector only
        self.rrf_k = rrf_k
        
        tokenized = [doc.lower().split() for doc in documents]
        self.bm25 = BM25Okapi(tokenized)
    
    def _rrf_score(self, rank: int) -> float:
        return 1.0 / (self.rrf_k + rank)
    
    def retrieve(self, query: str, top_k: int = 5) -> List[str]:
        vector_results = self.vector_store.similarity_search(query, k=top_k * 4)
        
        tokenized_query = query.lower().split()
        bm25_scores = self.bm25.get_scores(tokenized_query)
        bm25_ranked = np.argsort(bm25_scores)[::-1][:top_k * 4]
        
        rrf_scores: dict[str, float] = {}
        
        for rank, doc in enumerate(vector_results):
            doc_id = doc.metadata["id"]
            rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + self.alpha * self._rrf_score(rank)
        
        for rank, idx in enumerate(bm25_ranked):
            doc_id = f"doc_{idx}"
            rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + (1 - self.alpha) * self._rrf_score(rank)
        
        sorted_ids = sorted(rrf_scores, key=rrf_scores.get, reverse=True)
        return sorted_ids[:top_k]

Start with alpha=0.5 (equal weight) and tune based on your query distribution. If your users ask mostly exact-product or identifier queries, shift toward alpha=0.3 to weight BM25 more heavily.

Failure Mode 5: No Eval Loop

Without an automated eval loop, every regression in your RAG pipeline is invisible until a user complaint surfaces it. Teams without retrieval quality checks running on every deployment discover degradation 4–8× slower than teams that do — and by then, the root cause is typically tangled across multiple changes and hard to isolate.

Why it breaks

You cannot improve what you do not measure. RAG systems degrade over time as documents are updated, query patterns shift, and underlying model versions change. Without an automated eval loop tied to retrieval quality metrics, every one of these changes is invisible until a user complaint surfaces it.

The eval loop is not optional. It is the mechanism that keeps your RAG pipeline honest over its operational lifetime.

Diagnosis

Check whether your deployment pipeline currently runs any of these:

Retrieval precision@k (what fraction of ground-truth relevant chunks appear in the top-k retrieved?)
Answer faithfulness (does the generated answer stay within the retrieved context, or does it hallucinate beyond it?)
Answer relevance (does the generated answer actually address the query?)
Context recall (does the retrieved set contain all the information needed to answer correctly?)

If none of these are tracked per deployment, you are operating blind.

Fix

Build a retrieval eval suite using a golden query set and run it in CI on every deployment:

from dataclasses import dataclass
from typing import List, Optional

@dataclass
class EvalCase:
    query: str
    expected_doc_ids: List[str]
    expected_answer_contains: Optional[str] = None

def precision_at_k(retrieved_ids: List[str], relevant_ids: List[str], k: int) -> float:
    top_k = retrieved_ids[:k]
    hits = sum(1 for doc_id in top_k if doc_id in relevant_ids)
    return hits / k

def run_retrieval_eval(retriever, eval_cases: List[EvalCase], k: int = 3) -> dict:
    results = []
    
    for case in eval_cases:
        retrieved = retriever.retrieve(case.query, top_k=k)
        retrieved_ids = [r["id"] for r in retrieved]
        
        precision = precision_at_k(retrieved_ids, case.expected_doc_ids, k)
        recall = sum(
            1 for doc_id in case.expected_doc_ids if doc_id in retrieved_ids
        ) / len(case.expected_doc_ids)
        
        results.append({
            "query": case.query,
            f"precision@{k}": precision,
            "recall": recall,
        })
    
    avg_precision = sum(r[f"precision@{k}"] for r in results) / len(results)
    avg_recall = sum(r["recall"] for r in results) / len(results)
    
    return {
        f"avg_precision@{k}": round(avg_precision, 3),
        "avg_recall": round(avg_recall, 3),
        "per_query": results,
    }

def ci_gate(current_metrics: dict, baseline_metrics: dict, 
             relative_threshold: float = 0.05) -> bool:
    baseline_p = baseline_metrics["avg_precision@3"]
    current_p = current_metrics["avg_precision@3"]
    regression = (baseline_p - current_p) / baseline_p
    
    if regression > relative_threshold:
        print(f"FAIL: precision@3 regressed {regression:.1%} (baseline={baseline_p:.3f}, current={current_p:.3f})")
        return False
    return True

Run this eval suite against a golden set of 50–200 query/relevant-document pairs on every deploy. Gate the deployment if precision@3 drops more than 5% relative to the last passing run. Eval loops for agentic systems that wrap RAG — multi-step agents with tool calls, not just single-turn retrieval — require additional instrumentation; the AI agents in production guide covers the trace-level and metric-level setup for those workloads.

RAG Pipeline Debugging Checklist

Run this before spending time on prompt engineering or model tuning. These five failure modes are sequential — chunking problems corrupt every downstream step, so work top to bottom. If any item below fails, fix it before moving to the next row.

Check	Tool / Signal	Pass Condition
Chunk quality	Run `audit_chunks()`	`truncated_sentences` < 30%, avg tokens 200–600
Chunk strategy	Manual inspection	Chunks are semantically coherent units
Reranker present	Code review	Cross-encoder reranker on first-stage candidates
Reranker improves rank	`audit_retrieval_rank()`	Ground-truth rank improves in > 30% of queries
Index freshness	Hash comparison	No document indexed > 30 days without change check
CDC / webhook	Infrastructure review	Document updates trigger re-index within minutes
Hybrid retrieval	Code review	BM25 + vector fusion implemented
Hybrid alpha tuned	Precision comparison	Hybrid P@3 ≥ max(vector-only, BM25-only) P@3
Eval suite exists	CI pipeline	Retrieval eval runs on every deployment
Regression gate	CI config	Deploy blocked if precision drops > 5% relative

Get Prodinit's AI engineering guides in your inbox

Deep-dives on production LLMs, voice AI, and MLOps — published weekly. No sales emails.

Frequently Asked Questions

Should I always use hybrid retrieval, or is vector-only ever sufficient?

Vector-only retrieval is sufficient when your query population is entirely semantic and your documents contain no specific identifiers, codes, or exact phrases that users will query directly. In practice, most production corpora have at least some exact-match critical content (error codes, names, dates), and hybrid retrieval is the safer default. Start hybrid and tune the BM25 weight down if query analysis shows it is not helping.

How large should my golden eval set be?

A minimum viable eval set is 50 query/relevant-document pairs — enough to detect regressions of roughly 10 percentage points with 80% statistical power. For 5-point precision regressions, you need 150–200 cases. Prioritise coverage across document categories, semantic queries, exact-match queries, and known edge cases. A small, well-curated set beats a large, noisy one.

What is retrieval precision@k?

Precision@k measures what fraction of the top-k retrieved chunks are actually relevant to the query. If you retrieve 3 chunks (k=3) and 2 are correct, precision@3 is 0.67. It is the primary signal for whether your retriever is surfacing the right content — a drop of more than 5% relative to your baseline should block a deployment.

What is the right reranker model for production?

cross-encoder/ms-marco-MiniLM-L-6-v2 is a strong starting point: it runs with 50–100ms latency on CPU for a 20-candidate set and is well-calibrated on information retrieval tasks. For domain-specific content (legal, medical, code), fine-tune a cross-encoder on your own query/passage pairs. If latency is critical, Cohere Rerank or Jina Reranker via API offload the computation and handle batching automatically.

More from the blog

RAG

Why Your RAG Pipeline Is Failing in Production (And How to Fix It)

Why RAG Fails Silently in Production

Failure Mode 1: Bad Chunking Strategy

Why it breaks

Diagnosis

Fix

Failure Mode 2: Missing Reranking

Why it breaks

Diagnosis

Fix

Failure Mode 3: Stale Index

Why it breaks

Diagnosis

Fix

Failure Mode 4: No Hybrid Retrieval (BM25 + Vector)

Why it breaks

Diagnosis

Fix

Failure Mode 5: No Eval Loop

Why it breaks

Diagnosis

Fix

RAG Pipeline Debugging Checklist

Frequently Asked Questions

More from the blog

RAG Pipeline Chunking Strategies: Split Documents for Better Retrieval

LLM Fine-Tuning vs RAG: A Production Decision Framework for Engineering Teams

How to Evaluate LLM Outputs: Building Evals That Actually Catch Regressions

Stay ahead in AI engineering.