Data Engineering Projects

Good data engineering projects are end-to-end builds that mirror real production work: ingesting raw data, transforming it, orchestrating the pipeline, and serving results to a dashboard or warehouse. The best beginner projects are local ETL pipelines using tools like dlt, DuckDB, and Polars; strong intermediate projects add orchestration (Apache Airflow), analytics engineering (dbt), and cloud infrastructure; and advanced projects tackle real-time streaming with Apache Kafka or distributed batch processing with Apache Spark. Below are 14 hands-on projects across three difficulty levels.

Build real-world data engineering experience with hands-on projects. From simple ETL pipelines to complex streaming architectures, master the skills employers are looking for.

Build Your Data Engineering Portfolio

Our project-based learning approach gives you practical experience with real-world data engineering challenges. Each project includes detailed instructions, starter code, and comprehensive solutions to help you learn effectively.

Project Categories:

• ETL/ELT Data Pipelines
• Real-time Stream Processing
• Data Warehouse & Lake Architecture
• Cloud-Native Data Solutions
• Microservices Data Architecture
• Analytics & Monitoring Dashboards

Technologies You'll Master:

PythonOpenAI APIVector databaseLangChainRagasAnthropic APILangGraphFunction callingpytestpandasJupyterdlt

14 projects available across 3 difficulty levels. Perfect for building a portfolio that demonstrates your data engineering expertise to employers.

Showing 14 of 14 projects

Production RAG System with Retrieval Evaluation

Build a retrieval-augmented generation system over a real document set: chunking, embeddings, hybrid search with a reranker, grounded answers with citations, and a retrieval + faithfulness evaluation that proves it works.

Intermediate

8-12 hours

Tools & Technologies:

Python

OpenAI API

Vector database

LangChain

+1 more

LLM Agent with Tools and Failure-Mode Evaluation

Build an agent that plans, calls real tools (function calling), manages memory, and recovers from failures, then evaluate it on its trajectory and failure modes, not just happy-path demos.

Advanced

10-15 hours

Tools & Technologies:

Python

Anthropic API

OpenAI API

LangGraph

+1 more

LLM Evaluation Pipeline with Golden Dataset and LLM-as-a-Judge

Build a reusable evaluation pipeline for LLM applications: a golden dataset, automated scoring with LLM-as-a-judge, and regression testing you can point at any prompt or model change to catch quality drops before users do.

Intermediate

6-10 hours

Tools & Technologies:

Python

OpenAI API

Anthropic API

pytest

+1 more

Local Data Engineering Environment with dlt, DuckDB & Jupyter

Set up a local development environment for data processing and analytics using Jupyter notebooks, dlt, and DuckDB. All tools are open-source and run locally.

Beginner

2-4 hours

Tools & Technologies:

Jupyter

dlt

DuckDB

Python

Scheduled GitHub ETL with Polars, DLT & DuckDB

Build a scheduled ETL pipeline that extracts GitHub repository data, transforms it with Polars, and stores results in DuckDB

Intermediate

4-6 hours

Tools & Technologies:

Polars

DLT

DuckDB

GitHub Actions

+2 more

End-to-End Analytics Platform with DuckDB + Metabase

Build a modern, low-cost analytics stack using DuckDB, Metabase, and GitHub Actions for automated data updates and business-ready dashboards.

Intermediate

6-10 hours

Tools & Technologies:

DuckDB

Metabase

Python

GitHub Actions

+1 more

Infrastructure-as-Code Setup on GCP

Provision a GCP environment using Terraform with BigQuery & Cloud Storage, staying within free tier limits

Intermediate

4-6 hours

Tools & Technologies:

Terraform

GCP

BigQuery

Cloud Storage

+1 more

ETL Pipeline Orchestration with Apache Airflow

Design and implement an orchestrated ETL pipeline using Apache Airflow to extract, transform, and load weather data from a public API into a data warehouse.

Intermediate

8-12 hours

Tools & Technologies:

Airflow

Docker

Python

APIs

+2 more

Analytics Engineering Workflow with dbt + Metabase

Build a production-grade analytics workflow: model, test, and document data with dbt, then visualize insights in Metabase.

Intermediate

6-10 hours

Tools & Technologies:

dbt

BigQuery

SQL

Metabase

GitHub Events Analytics with PySpark

Build a production-style batch data pipeline using Apache Spark to process GitHub event logs

Advanced

10-12 hours

Tools & Technologies:

Apache Spark

Python

PySpark

Docker

+2 more

⚡ Real-Time Data Streaming with Apache Kafka

Build a real-time data pipeline using Kafka (Confluent Cloud), JSON, Python, and Polars. Simulate NYC Taxi data, process in real time, and visualize with Metabase.

Advanced

8-12 hours

Tools & Technologies:

Kafka

Confluent Cloud

Python

Polars

+4 more

CI/CD for Data Pipelines

Build a complete CI/CD pipeline for a data engineering project using GitHub Actions, dbt, Airflow DAG testing, and Terraform infrastructure deployment.

Intermediate

6-10 hours

Tools & Technologies:

GitHub Actions

dbt

Airflow

Terraform

+2 more

Tourism Recovery Dashboard (SQL + Power BI)

Answer a real business question end to end: load Eurostat regional tourism arrivals into DuckDB, model the metrics in SQL, and ship a one-page Power BI dashboard explaining where tourism recovered fastest between 2022 and 2025.

Beginner

6-8 hours

Tools & Technologies:

DuckDB

SQL

Power BI

DAX

+1 more

Airbnb Listings EDA (Python + pandas)

Clean a real, messy Inside Airbnb listings dataset, run an exploratory analysis in a Jupyter notebook, and ship the result on GitHub with a README and one publication-quality chart.

Beginner

5-7 hours

Tools & Technologies:

Python

pandas

Jupyter

seaborn

+1 more

Frequently Asked Questions

What is a good first data engineering project?

A good first project is a local ETL pipeline using dlt, DuckDB, and Jupyter. It takes 2-4 hours, runs entirely on your laptop with open-source tools, and teaches the core extract-transform-load loop without cloud setup or cost.

How many projects do I need for a data engineering portfolio?

Three to five projects spanning different difficulty levels is enough for a strong portfolio: one beginner pipeline, two intermediate builds (orchestration with Airflow and analytics engineering with dbt), and one advanced project such as real-time streaming with Kafka or batch processing with Spark.

What skills do data engineering projects teach?

Hands-on projects build skills in data pipeline design, stream processing (Kafka, Spark), batch ETL/ELT, data warehouse modeling, cloud platforms (AWS, GCP, Azure), container orchestration (Docker), workflow orchestration (Airflow), and CI/CD for data workflows.

Why Choose Project-Based Learning?

Real-World Application

Work with actual datasets and scenarios that mirror production environments. Build solutions that demonstrate your ability to handle complex data challenges.

Portfolio Development

Create a compelling portfolio that showcases your technical skills to potential employers. Each project includes documentation and deployment instructions.

Industry-Relevant Skills

Focus on the tools and technologies that are in high demand in the data engineering job market. Stay current with modern data stack practices.

Data Engineering Projects

Build Your Data Engineering Portfolio

Project Categories:

Technologies You'll Master:

What You'll Learn from Data Engineering Projects

Skills Development

Production RAG System with Retrieval Evaluation

LLM Agent with Tools and Failure-Mode Evaluation

LLM Evaluation Pipeline with Golden Dataset and LLM-as-a-Judge

Local Data Engineering Environment with dlt, DuckDB & Jupyter

Scheduled GitHub ETL with Polars, DLT & DuckDB

End-to-End Analytics Platform with DuckDB + Metabase

Infrastructure-as-Code Setup on GCP

ETL Pipeline Orchestration with Apache Airflow

Analytics Engineering Workflow with dbt + Metabase

GitHub Events Analytics with PySpark

⚡ Real-Time Data Streaming with Apache Kafka

CI/CD for Data Pipelines

Tourism Recovery Dashboard (SQL + Power BI)

Airbnb Listings EDA (Python + pandas)

Frequently Asked Questions

What is a good first data engineering project?

How many projects do I need for a data engineering portfolio?

What skills do data engineering projects teach?

Why Choose Project-Based Learning?

Real-World Application

Portfolio Development

Industry-Relevant Skills