Production RAG System with Retrieval Evaluation

    Build a retrieval-augmented generation system over a real document set: chunking, embeddings, hybrid search with a reranker, grounded answers with citations, and a retrieval + faithfulness evaluation that proves it works.

    โœ“ Expert-Designed Projectโ€ข Industry-Validated Implementationโ€ข Production-Ready Architecture

    This project was designed by data engineering professionals to simulate real-world scenarios used at companies like Netflix, Airbnb, and Spotify. Master Python, OpenAI API, Vector database and 2 more technologies through hands-on implementation. Rated intermediate level with comprehensive documentation and starter code.

    Intermediate
    8-12 hours

    ๐Ÿ“š Project: Production RAG System with Retrieval Evaluation

    ๐Ÿ“Œ Project Overview

    In this project you'll build a retrieval-augmented generation (RAG) system that answers questions over a document collection you choose, grounded in the source text with citations. Crucially, you won't stop at a demo: you'll build a retrieval and answer-quality evaluation so you can prove the system works and measure every change.

    RAG is the single most-requested skill in AI engineer job descriptions. This project mirrors what teams ship in production: ingest documents, retrieve relevant context, generate grounded answers, and continuously evaluate retrieval and faithfulness.


    ๐ŸŽฏ Learning Objectives

    • Build the full RAG pipeline: load, chunk, embed, index, retrieve, generate
    • Implement and compare keyword (BM25), vector, and hybrid search
    • Add a reranker to improve retrieval precision
    • Generate grounded answers with citations and a groundedness guardrail
    • Build a retrieval evaluation (recall@k, MRR) and an answer evaluation (faithfulness, relevance) with a golden dataset

    ๐Ÿงฐ Prerequisites

    • Python 3.10+ and a virtual environment
    • An API key for OpenAI or Anthropic (for the LLM and embeddings)
    • A vector store: start with a local option (Chroma, FAISS, or pgvector); Pinecone or Qdrant optional
    • Familiarity with roadmap Steps 5 (Evaluation) and 6 (RAG)

    โŒ› Estimated Time

    Duration: 8โ€“12 hours
    Difficulty: Intermediate


    ๐Ÿ“Š Dataset Recommendation

    Pick a corpus you can ask real questions about. Good options:

    • A product's documentation (e.g. a popular open-source library's docs) โ€” questions have verifiable answers
    • A set of PDFs: research papers, policy documents, or a company handbook
    • Your own notes or a wiki export

    Tip: choose a domain where you can write 20-30 question/answer pairs by hand. That hand-written set becomes your golden evaluation dataset, the most valuable artifact in this project.


    ๐Ÿ“ฆ Suggested Project Structure

    rag-system/
    โ”œโ”€โ”€ ingest/
    โ”‚   โ”œโ”€โ”€ load.py          # load and parse source documents
    โ”‚   โ”œโ”€โ”€ chunk.py         # chunking strategies
    โ”‚   โ””โ”€โ”€ index.py         # embed + write to vector store
    โ”œโ”€โ”€ retrieve/
    โ”‚   โ”œโ”€โ”€ vector.py        # vector search
    โ”‚   โ”œโ”€โ”€ keyword.py       # BM25
    โ”‚   โ”œโ”€โ”€ hybrid.py        # fusion + reranker
    โ”œโ”€โ”€ generate/
    โ”‚   โ””โ”€โ”€ answer.py        # prompt construction + grounded generation
    โ”œโ”€โ”€ eval/
    โ”‚   โ”œโ”€โ”€ golden.jsonl     # hand-written question/answer/relevant-docs set
    โ”‚   โ”œโ”€โ”€ retrieval_eval.py
    โ”‚   โ””โ”€โ”€ answer_eval.py   # Ragas or custom LLM-as-judge
    โ”œโ”€โ”€ app.py               # CLI or simple UI
    โ”œโ”€โ”€ README.md
    โ””โ”€โ”€ .env.example
    

    ๐Ÿ”„ Step-by-Step Guide

    1. ๐Ÿงฑ Ingest and index

    • Load your documents and split them into chunks. Implement at least two chunking strategies (fixed-size with overlap, and structure-aware by heading/paragraph) so you can compare them later.
    • Embed each chunk with an embedding model and store vectors plus metadata (source, section, position) in your vector store.
    • Keep the ingest step idempotent and re-runnable.

    2. ๐Ÿ” Retrieval: build three retrievers

    • Vector search: embed the query, return top-k nearest chunks.
    • Keyword search (BM25): lexical retrieval for exact terms and identifiers.
    • Hybrid: fuse vector and keyword results (e.g. reciprocal rank fusion), then apply a reranker (a cross-encoder) to reorder the top candidates.

    3. ๐Ÿง  Generation with grounding

    • Construct the prompt: system instructions + retrieved chunks + the question.
    • Require the model to cite which chunks it used and to say "I don't know" when the context doesn't contain the answer.
    • Add a groundedness check: a guardrail that flags answers not supported by the retrieved context.

    4. ๐Ÿ“ Evaluate retrieval

    • Using your golden set (each entry: question + the chunk(s) that contain the answer), compute:
      • Recall@k: did the answer-bearing chunk make the top k?
      • MRR / Precision@k: how highly ranked were the relevant chunks?
    • Compare vector vs keyword vs hybrid, and the effect of the reranker. Record the numbers.

    5. ๐Ÿงช Evaluate answers

    • For each golden question, score the generated answer on:
      • Faithfulness / groundedness: is it supported by retrieved context?
      • Answer relevance: does it address the question?
    • Use Ragas or a custom LLM-as-a-judge with a clear rubric. Spot-check the judge against your own labels on a sample.

    6. ๐Ÿ”ง Optimize and re-measure

    • Change one variable at a time (chunk size, k, hybrid weights, reranker on/off) and re-run the eval. Keep what improves the numbers. This is the core RAG engineering loop.

    โœ… Deliverables

    • A working RAG system (CLI or simple UI) that answers questions with citations
    • Three retrievers (vector, keyword, hybrid + reranker) behind a common interface
    • A golden dataset of 20-30 hand-written question/answer/relevant-doc entries
    • An evaluation report (in the README) with:
      • Retrieval metrics (recall@k, MRR) across the three retrievers
      • Answer metrics (faithfulness, relevance)
      • A before/after table for at least one optimization you made
    • A README.md with an architecture diagram, the eval results, and trade-off discussion

    ๐Ÿš€ Optional Extensions

    • Add query rewriting (multi-query or HyDE) and measure the retrieval lift
    • Add metadata filtering (e.g. by date or source) and show it reduces irrelevant retrieval
    • Add a semantic cache for repeated questions and measure latency/cost savings
    • Add prompt-injection defenses for untrusted documents (treat retrieved content as untrusted)
    • Run the answer eval on a sample of live queries to simulate production monitoring

    Project Details

    Tools & Technologies

    Python
    OpenAI API
    Vector database
    LangChain
    Ragas

    Difficulty Level

    Intermediate

    Estimated Duration

    8-12 hours

    Sign in to submit this project and build a graded public portfolio that proves your skills to hiring companies

    More Projects