The RAG Pipeline: Index, Retrieve, Generate
Walk through every stage of a working RAG system with clear diagrams and real examples.
A production-grade RAG application separates concerns into two distinct workflows: Ingestion & Indexing (an offline preprocessing pipeline that structures documents) and Retrieval & Generation (an online, real-time loop executed for every user question).
Think of a library that operates in two shifts. Overnight, a small army of librarians (the ingestion pipeline) receives every new book, slices it into indexed reference cards, and files those cards into the card catalog. Nobody is browsing at 3am, so this shift can take its time and do the work carefully. During the day, a visitor (the user) walks up with a question. A front-desk librarian (the retriever) does not reread every book in the building. They consult the card catalog built overnight, pull the 3-5 most relevant cards, and hand them to a specialist (the LLM) who reads just those cards and answers the question. Two completely different workflows, running on completely different schedules, working together.
Ingestion and Parsing: In the offline ingestion phase, documents (PDFs, text files, markdown) are read and parsed. Extracting text from digital PDFs is straightforward, but scanned documents or image-based files require an Optical Character Recognition (OCR) pipeline to prevent blank string ingestion. Once parsed, the text is sliced into chunks, converted to vector embeddings, and saved in a database.
Retrieval and Generation: During the online runtime phase, a user query is converted into a vector using the same embedding model. The database retrieves the closest matching chunks. These chunks are inserted into a structured prompt context template and sent to the LLM, which synthesizes a grounded answer.
1. It uploads the document to the Google Files API.
2. It calls Gemini 1.5 Flash to perform layout-aware OCR, formatting tables to Markdown.
3. It deletes the temp file from the API hosting, then index the resulting text.
def ocr_pdf_with_gemini(file_path: str) -> str:
"""Upload a scanned PDF to Google's Files API and extract text using Gemini."""
if not os.getenv("GEMINI_API_KEY"):
raise ValueError("GEMINI_API_KEY is not set. Cannot run OCR fallback.")
# Upload the file to Gemini Files API
uploaded_file = genai.upload_file(path=file_path)
try:
model = genai.GenerativeModel("gemini-1.5-flash-latest")
prompt = (
"Perform OCR on this document. Transcribe all text, tables, and handwritten notes page-by-page. "
"Preserve the original layout as much as possible, converting tables to Markdown format."
)
response = model.generate_content([uploaded_file, prompt])
return response.text
finally:
# Clean up the file from the Google files hosting
uploaded_file.delete()While API fallbacks are great for prototypes, production systems handling millions of pages often use layout-aware document extraction pipelines like Azure AI Document Intelligence or Unstructured.io to prevent layout-based chunk fragmentation.