📚 Project: Production RAG System with Retrieval Evaluation

📌 Project Overview

In this project you'll build a retrieval-augmented generation (RAG) system that answers questions over a document collection you choose, grounded in the source text with citations. Crucially, you won't stop at a demo: you'll build a retrieval and answer-quality evaluation so you can prove the system works and measure every change.

RAG is the single most-requested skill in AI engineer job descriptions. This project mirrors what teams ship in production: ingest documents, retrieve relevant context, generate grounded answers, and continuously evaluate retrieval and faithfulness.

🎯 Learning Objectives

Build the full RAG pipeline: load, chunk, embed, index, retrieve, generate
Implement and compare keyword (BM25), vector, and hybrid search
Add a reranker to improve retrieval precision
Generate grounded answers with citations and a groundedness guardrail
Build a retrieval evaluation (recall@k, MRR) and an answer evaluation (faithfulness, relevance) with a golden dataset

🧰 Prerequisites

Python 3.10+ and a virtual environment
An API key for OpenAI or Anthropic (for the LLM and embeddings)
A vector store: start with a local option (Chroma, FAISS, or pgvector); Pinecone or Qdrant optional
Familiarity with roadmap Steps 5 (Evaluation) and 6 (RAG)

⌛ Estimated Time

Duration: 8–12 hours
Difficulty: Intermediate

📊 Dataset Recommendation

Pick a corpus you can ask real questions about. Good options:

A product's documentation (e.g. a popular open-source library's docs) — questions have verifiable answers
A set of PDFs: research papers, policy documents, or a company handbook
Your own notes or a wiki export

Tip: choose a domain where you can write 20-30 question/answer pairs by hand. That hand-written set becomes your golden evaluation dataset, the most valuable artifact in this project.

📦 Suggested Project Structure

rag-system/
├── ingest/
│   ├── load.py          # load and parse source documents
│   ├── chunk.py         # chunking strategies
│   └── index.py         # embed + write to vector store
├── retrieve/
│   ├── vector.py        # vector search
│   ├── keyword.py       # BM25
│   ├── hybrid.py        # fusion + reranker
├── generate/
│   └── answer.py        # prompt construction + grounded generation
├── eval/
│   ├── golden.jsonl     # hand-written question/answer/relevant-docs set
│   ├── retrieval_eval.py
│   └── answer_eval.py   # Ragas or custom LLM-as-judge
├── app.py               # CLI or simple UI
├── README.md
└── .env.example

🔄 Step-by-Step Guide

1. 🧱 Ingest and index

Load your documents and split them into chunks. Implement at least two chunking strategies (fixed-size with overlap, and structure-aware by heading/paragraph) so you can compare them later.
Embed each chunk with an embedding model and store vectors plus metadata (source, section, position) in your vector store.
Keep the ingest step idempotent and re-runnable.

2. 🔍 Retrieval: build three retrievers

Vector search: embed the query, return top-k nearest chunks.
Keyword search (BM25): lexical retrieval for exact terms and identifiers.
Hybrid: fuse vector and keyword results (e.g. reciprocal rank fusion), then apply a reranker (a cross-encoder) to reorder the top candidates.

3. 🧠 Generation with grounding

Construct the prompt: system instructions + retrieved chunks + the question.
Require the model to cite which chunks it used and to say "I don't know" when the context doesn't contain the answer.
Add a groundedness check: a guardrail that flags answers not supported by the retrieved context.

4. 📏 Evaluate retrieval

Using your golden set (each entry: question + the chunk(s) that contain the answer), compute:
- Recall@k: did the answer-bearing chunk make the top k?
- MRR / Precision@k: how highly ranked were the relevant chunks?
Compare vector vs keyword vs hybrid, and the effect of the reranker. Record the numbers.

5. 🧪 Evaluate answers

For each golden question, score the generated answer on:
- Faithfulness / groundedness: is it supported by retrieved context?
- Answer relevance: does it address the question?
Use Ragas or a custom LLM-as-a-judge with a clear rubric. Spot-check the judge against your own labels on a sample.

6. 🔧 Optimize and re-measure

Change one variable at a time (chunk size, k, hybrid weights, reranker on/off) and re-run the eval. Keep what improves the numbers. This is the core RAG engineering loop.

✅ Deliverables

A working RAG system (CLI or simple UI) that answers questions with citations
Three retrievers (vector, keyword, hybrid + reranker) behind a common interface
A golden dataset of 20-30 hand-written question/answer/relevant-doc entries
An evaluation report (in the README) with:
- Retrieval metrics (recall@k, MRR) across the three retrievers
- Answer metrics (faithfulness, relevance)
- A before/after table for at least one optimization you made
A README.md with an architecture diagram, the eval results, and trade-off discussion

🚀 Optional Extensions

Add query rewriting (multi-query or HyDE) and measure the retrieval lift
Add metadata filtering (e.g. by date or source) and show it reduces irrelevant retrieval
Add a semantic cache for repeated questions and measure latency/cost savings
Add prompt-injection defenses for untrusted documents (treat retrieved content as untrusted)
Run the answer eval on a sample of live queries to simulate production monitoring

Production RAG System with Retrieval Evaluation