Production RAG System with Retrieval Evaluation
Build a retrieval-augmented generation system over a real document set: chunking, embeddings, hybrid search with a reranker, grounded answers with citations, and a retrieval + faithfulness evaluation that proves it works.
This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master Python, OpenAI API, Vector database and 2 more technologies through hands-on implementation. Rated intermediate level with comprehensive documentation and starter code.
๐ Project: Production RAG System with Retrieval Evaluation
๐ Project Overview
In this project you'll build a retrieval-augmented generation (RAG) system that answers questions over a document collection you choose, grounded in the source text with citations. Crucially, you won't stop at a demo: you'll build a retrieval and answer-quality evaluation so you can prove the system works and measure every change.
RAG is the single most-requested skill in AI engineer job descriptions. This project mirrors what teams ship in production: ingest documents, retrieve relevant context, generate grounded answers, and continuously evaluate retrieval and faithfulness.
๐ฏ Learning Objectives
- Build the full RAG pipeline: load, chunk, embed, index, retrieve, generate
- Implement and compare keyword (BM25), vector, and hybrid search
- Add a reranker to improve retrieval precision
- Generate grounded answers with citations and a groundedness guardrail
- Build a retrieval evaluation (recall@k, MRR) and an answer evaluation (faithfulness, relevance) with a golden dataset
๐งฐ Prerequisites
- Python 3.10+ and a virtual environment
- An API key for OpenAI or Anthropic (for the LLM and embeddings)
- A vector store: start with a local option (Chroma, FAISS, or
pgvector); Pinecone or Qdrant optional - Familiarity with roadmap Steps 5 (Evaluation) and 6 (RAG)
โ Estimated Time
Duration: 8โ12 hours
Difficulty: Intermediate
๐ Dataset Recommendation
Pick a corpus you can ask real questions about. Good options:
- A product's documentation (e.g. a popular open-source library's docs) โ questions have verifiable answers
- A set of PDFs: research papers, policy documents, or a company handbook
- Your own notes or a wiki export
Tip: choose a domain where you can write 20-30 question/answer pairs by hand. That hand-written set becomes your golden evaluation dataset, the most valuable artifact in this project.
๐ฆ Suggested Project Structure
rag-system/
โโโ ingest/
โ โโโ load.py # load and parse source documents
โ โโโ chunk.py # chunking strategies
โ โโโ index.py # embed + write to vector store
โโโ retrieve/
โ โโโ vector.py # vector search
โ โโโ keyword.py # BM25
โ โโโ hybrid.py # fusion + reranker
โโโ generate/
โ โโโ answer.py # prompt construction + grounded generation
โโโ eval/
โ โโโ golden.jsonl # hand-written question/answer/relevant-docs set
โ โโโ retrieval_eval.py
โ โโโ answer_eval.py # Ragas or custom LLM-as-judge
โโโ app.py # CLI or simple UI
โโโ README.md
โโโ .env.example
๐ Step-by-Step Guide
1. ๐งฑ Ingest and index
- Load your documents and split them into chunks. Implement at least two chunking strategies (fixed-size with overlap, and structure-aware by heading/paragraph) so you can compare them later.
- Embed each chunk with an embedding model and store vectors plus metadata (source, section, position) in your vector store.
- Keep the ingest step idempotent and re-runnable.
2. ๐ Retrieval: build three retrievers
- Vector search: embed the query, return top-k nearest chunks.
- Keyword search (BM25): lexical retrieval for exact terms and identifiers.
- Hybrid: fuse vector and keyword results (e.g. reciprocal rank fusion), then apply a reranker (a cross-encoder) to reorder the top candidates.
3. ๐ง Generation with grounding
- Construct the prompt: system instructions + retrieved chunks + the question.
- Require the model to cite which chunks it used and to say "I don't know" when the context doesn't contain the answer.
- Add a groundedness check: a guardrail that flags answers not supported by the retrieved context.
4. ๐ Evaluate retrieval
- Using your golden set (each entry: question + the chunk(s) that contain the answer), compute:
- Recall@k: did the answer-bearing chunk make the top k?
- MRR / Precision@k: how highly ranked were the relevant chunks?
- Compare vector vs keyword vs hybrid, and the effect of the reranker. Record the numbers.
5. ๐งช Evaluate answers
- For each golden question, score the generated answer on:
- Faithfulness / groundedness: is it supported by retrieved context?
- Answer relevance: does it address the question?
- Use Ragas or a custom LLM-as-a-judge with a clear rubric. Spot-check the judge against your own labels on a sample.
6. ๐ง Optimize and re-measure
- Change one variable at a time (chunk size, k, hybrid weights, reranker on/off) and re-run the eval. Keep what improves the numbers. This is the core RAG engineering loop.
โ Deliverables
- A working RAG system (CLI or simple UI) that answers questions with citations
- Three retrievers (vector, keyword, hybrid + reranker) behind a common interface
- A golden dataset of 20-30 hand-written question/answer/relevant-doc entries
- An evaluation report (in the README) with:
- Retrieval metrics (recall@k, MRR) across the three retrievers
- Answer metrics (faithfulness, relevance)
- A before/after table for at least one optimization you made
- A
README.mdwith an architecture diagram, the eval results, and trade-off discussion
๐ Optional Extensions
- Add query rewriting (multi-query or HyDE) and measure the retrieval lift
- Add metadata filtering (e.g. by date or source) and show it reduces irrelevant retrieval
- Add a semantic cache for repeated questions and measure latency/cost savings
- Add prompt-injection defenses for untrusted documents (treat retrieved content as untrusted)
- Run the answer eval on a sample of live queries to simulate production monitoring