RAG
2 / 7
RAG 6 min

The RAG Pipeline: Index, Retrieve, Generate

Walk through every stage of a working RAG system with clear diagrams and real examples.

A production-grade RAG application separates concerns into two distinct workflows: Ingestion & Indexing (an offline preprocessing pipeline that structures documents) and Retrieval & Generation (an online, real-time loop executed for every user question).

The library that's built at night and used during the day

Think of a library that operates in two shifts. Overnight, a small army of librarians (the ingestion pipeline) receives every new book, slices it into indexed reference cards, and files those cards into the card catalog. Nobody is browsing at 3am, so this shift can take its time and do the work carefully. During the day, a visitor (the user) walks up with a question. A front-desk librarian (the retriever) does not reread every book in the building. They consult the card catalog built overnight, pull the 3-5 most relevant cards, and hand them to a specialist (the LLM) who reads just those cards and answers the question. Two completely different workflows, running on completely different schedules, working together.

Phase 1: Ingestion & Indexing (Offline Pipeline)
📄
Source DocumentUnstructured file (.pdf, .txt, .md)
📝
Extract TextDocument parsing engine
✂️
Chunk TextSlicing into logical blocks
🔢
Generate EmbeddingsConvert text to vector floats
🗄️
Index in Vector DBIndex vectors and payloads
Phase 2: Retrieval & Generation (Online Runtime)
💬
User Question"What is the billing policy?"
🔢
Embed QueryAligning query coordinates
🔍
Vector SearchQuery similarity math
📋
Assemble ContextSystem instructions + Chunks
🤖
Language Model GenSynthesis and generation
Grounded ResponseAnswer with source citations

Ingestion and Parsing: In the offline ingestion phase, documents (PDFs, text files, markdown) are read and parsed. Extracting text from digital PDFs is straightforward, but scanned documents or image-based files require an Optical Character Recognition (OCR) pipeline to prevent blank string ingestion. Once parsed, the text is sliced into chunks, converted to vector embeddings, and saved in a database.

Retrieval and Generation: During the online runtime phase, a user query is converted into a vector using the same embedding model. The database retrieves the closest matching chunks. These chunks are inserted into a structured prompt context template and sent to the LLM, which synthesizes a grounded answer.

Our Project Ingestion: Gemini Multimodal OCR Fallback
In our Smart File Cabinet project (`rag/enterprise-search/ingestion.py`), we parse digital PDFs. However, if a PDF is a scanned image (returning fewer than 50 characters of text), the pipeline automatically triggers an OCR Fallback:

1. It uploads the document to the Google Files API.
2. It calls Gemini 1.5 Flash to perform layout-aware OCR, formatting tables to Markdown.
3. It deletes the temp file from the API hosting, then index the resulting text.
python
def ocr_pdf_with_gemini(file_path: str) -> str:
    """Upload a scanned PDF to Google's Files API and extract text using Gemini."""
    if not os.getenv("GEMINI_API_KEY"):
        raise ValueError("GEMINI_API_KEY is not set. Cannot run OCR fallback.")
    
    # Upload the file to Gemini Files API
    uploaded_file = genai.upload_file(path=file_path)
    
    try:
        model = genai.GenerativeModel("gemini-1.5-flash-latest")
        prompt = (
            "Perform OCR on this document. Transcribe all text, tables, and handwritten notes page-by-page. "
            "Preserve the original layout as much as possible, converting tables to Markdown format."
        )
        response = model.generate_content([uploaded_file, prompt])
        return response.text
    finally:
        # Clean up the file from the Google files hosting
        uploaded_file.delete()

While API fallbacks are great for prototypes, production systems handling millions of pages often use layout-aware document extraction pipelines like Azure AI Document Intelligence or Unstructured.io to prevent layout-based chunk fragmentation.

What's next
You've seen the two pipelines end to end, but one box in the indexing diagram deserves a closer look: 'Index in Vector DB'. What actually stores those vectors, and how does it find the closest ones out of millions in milliseconds? That's Vector Databases: Your AI's Memory, next.