RAG
7 / 7
RAG 3 min

Re-ranking: Quality Over Quantity

Use cross-encoders to improve the quality of retrieved context before it reaches the model.

Hybrid search gathers a high-recall list of candidates (e.g., 20 or 50 chunks). Sending all 50 chunks to an LLM is expensive, causes latency, and can confuse the model. Studies show models ignore text placed in the middle of long prompts, a phenomenon known as the 'lost in the middle' effect. To fix this, production RAG pipelines apply a second-pass neural reranker.

* Bi-Encoders (First Pass - Retrieval): Generate vectors for queries and documents independently. Cosine distance can be calculated in microseconds across millions of documents. However, because query and document tokens do not interact during embedding, fine nuances can be missed.
* Cross-Encoders (Second Pass - Reranking): Take the query and document chunk together as a single input, calculating attention across all tokens simultaneously. They are computationally expensive and slow for search, but serve as high-precision rerankers for a small pool of candidates.

Resume screening vs. the final interview

A bi-encoder is the automated resume screener that scans thousands of applications in seconds, matching keywords and general shape against the job description without ever really understanding either document deeply. It is fast enough to run on your entire candidate pool. A cross-encoder is the final interview panel. Slow, expensive, one candidate at a time, but they read the actual resume *and* the actual job description together, side by side, and can catch nuance the screener missed entirely. You would never run the interview panel on 10,000 applicants, that's what the first-pass screener is for. But you would also never make a final hiring decision using only the screener's rough shortlist. Retrieval plus reranking is exactly this two-stage funnel.

Interactive: Bi-Encoder vs Cross-Encoder Reranking
RAG Quality Pass

These 6 chunks were retrieved by a fast bi-encoder using cosine similarity. Run the cross-encoder to see which ones actually answer the question.

Query: "How do I cancel my subscription and get a refund?"
#1

Subscription Tiers: the Pro plan is $29/month and the Team plan is $79/month, including 10 seats.

cosine 0.83
#2

The Team plan includes priority support and higher API rate limits than the Pro plan.

cosine 0.79
#3

To cancel your subscription, go to Settings, then Billing, then Cancel Plan. Access continues until the end of the billing period.

cosine 0.74
#4

Refunds are issued within 5 to 7 business days to your original payment method if requested within 14 days of the charge.

cosine 0.71
#5

You can downgrade your subscription at any time from the billing settings page.

cosine 0.65
#6

Our API supports webhooks for real-time subscription status updates.

cosine 0.52
Our Project Implementation: Multi-Stage Retrieval Flow
In a complete production RAG pipeline, we chain these steps sequentially:

1. Fetch candidate lists matching security filters.
2. Merge candidate lists using RRF.
3. Rerank candidates using a Cross-Encoder (like Cohere Rerank or local SentenceTransformers).
4. Feed the top sorted contexts to the LLM (like Gemini 1.5 Flash).
python
# Conceptual flow of a multi-stage production RAG pipeline:
# 1. Fetch candidates matching security filters
candidates_bm25 = bm25_retrieve(query, tenant_id, user_id)
candidates_vector = vector_retrieve(query, tenant_id, user_id)

# 2. Merge candidate lists using RRF
fused_candidates = rrf_merge(candidates_bm25, candidates_vector)[:20]

# 3. Rerank candidates using a Cross-Encoder
reranked_chunks = cross_encoder_rerank(query, fused_candidates)
final_context = reranked_chunks[:5]

# 4. Stream response from LLM (Gemini, OpenAI, or OpenRouter)
stream_response(query, final_context)

This completes the full RAG lifecycle. By coordinating ingestion, scanned file OCR fallbacks, vector indexing, keyword scoring, hybrid RRF merging, tenant partition security, and neural rerankers, you can build production-grade knowledge engines that deliver accurate, authenticated responses.

You have completed RAG
You now have the full retrieval mental model:

→ Training data is frozen; RAG bridges the gap with real-time retrieval instead of retraining (Lesson 1)
→ A RAG system runs two pipelines: offline ingestion and indexing, online retrieval and generation (Lesson 2)
→ Vector databases store meaning as coordinates, enabling semantic search over millions of chunks (Lesson 3)
→ BM25 keyword search is a fast, dependency-free alternative that excels at exact terms, IDs, and codes (Lesson 4)
→ Chunk size and overlap control the tradeoff between context and precision (Lesson 5)
→ Hybrid search merges keyword and semantic results with Reciprocal Rank Fusion (Lesson 6)
→ Re-ranking applies a slower, smarter cross-encoder pass to put the best evidence first (Lesson 7)

Next up: MCP, the Model Context Protocol. Now that models can retrieve documents, the next question is how they connect to live tools, databases, and APIs using one universal, standardised interface.