Data Engineering Projects

    Good data engineering projects are end-to-end builds that mirror real production work: ingesting raw data, transforming it, orchestrating the pipeline, and serving results to a dashboard or warehouse. The best beginner projects are local ETL pipelines using tools like dlt, DuckDB, and Polars; strong intermediate projects add orchestration (Apache Airflow), analytics engineering (dbt), and cloud infrastructure; and advanced projects tackle real-time streaming with Apache Kafka or distributed batch processing with Apache Spark. Below are 11 hands-on projects across three difficulty levels.

    Build real-world data engineering experience with hands-on projects. From simple ETL pipelines to complex streaming architectures, master the skills employers are looking for.

    Build Your Data Engineering Portfolio

    Our project-based learning approach gives you practical experience with real-world data engineering challenges. Each project includes detailed instructions, starter code, and comprehensive solutions to help you learn effectively.

    Project Categories:

    • • ETL/ELT Data Pipelines
    • • Real-time Stream Processing
    • • Data Warehouse & Lake Architecture
    • • Cloud-Native Data Solutions
    • • Microservices Data Architecture
    • • Analytics & Monitoring Dashboards

    Technologies You'll Master:

    JupyterdltDuckDBPythonPolarsDLTGitHub ActionsTerraform/OpenTofu/PulumiMetabaseDockerTerraformGCP

    11 projects available across 3 difficulty levels. Perfect for building a portfolio that demonstrates your data engineering expertise to employers.

    Showing 11 of 11 projects

    Local Data Engineering Environment with dlt, DuckDB & Jupyter

    Set up a local development environment for data processing and analytics using Jupyter notebooks, dlt, and DuckDB. All tools are open-source and run locally.

    Beginner
    2-4 hours

    Tools & Technologies:

    Jupyter
    dlt
    DuckDB
    Python

    Scheduled GitHub ETL with Polars, DLT & DuckDB

    Build a scheduled ETL pipeline that extracts GitHub repository data, transforms it with Polars, and stores results in DuckDB

    Intermediate
    4-6 hours

    Tools & Technologies:

    Polars
    DLT
    DuckDB
    GitHub Actions
    +2 more

    End-to-End Analytics Platform with DuckDB + Metabase

    Build a modern, low-cost analytics stack using DuckDB, Metabase, and GitHub Actions for automated data updates and business-ready dashboards.

    Intermediate
    6-10 hours

    Tools & Technologies:

    DuckDB
    Metabase
    Python
    GitHub Actions
    +1 more

    Infrastructure-as-Code Setup on GCP

    Provision a GCP environment using Terraform with BigQuery & Cloud Storage, staying within free tier limits

    Intermediate
    4-6 hours

    Tools & Technologies:

    Terraform
    GCP
    BigQuery
    Cloud Storage
    +1 more

    ETL Pipeline Orchestration with Apache Airflow

    Design and implement an orchestrated ETL pipeline using Apache Airflow to extract, transform, and load weather data from a public API into a data warehouse.

    Intermediate
    8-12 hours

    Tools & Technologies:

    Airflow
    Docker
    Python
    APIs
    +2 more

    Analytics Engineering Workflow with dbt + Metabase

    Build a production-grade analytics workflow: model, test, and document data with dbt, then visualize insights in Metabase.

    Intermediate
    6-10 hours

    Tools & Technologies:

    dbt
    BigQuery
    SQL
    Metabase

    GitHub Events Analytics with PySpark

    Build a production-style batch data pipeline using Apache Spark to process GitHub event logs

    Advanced
    10-12 hours

    Tools & Technologies:

    Apache Spark
    Python
    PySpark
    Docker
    +2 more

    ⚡ Real-Time Data Streaming with Apache Kafka

    Build a real-time data pipeline using Kafka (Confluent Cloud), JSON, Python, and Polars. Simulate NYC Taxi data, process in real time, and visualize with Metabase.

    Advanced
    8-12 hours

    Tools & Technologies:

    Kafka
    Confluent Cloud
    Python
    Polars
    +4 more

    CI/CD for Data Pipelines

    Build a complete CI/CD pipeline for a data engineering project using GitHub Actions, dbt, Airflow DAG testing, and Terraform infrastructure deployment.

    Intermediate
    6-10 hours

    Tools & Technologies:

    GitHub Actions
    dbt
    Airflow
    Terraform
    +2 more

    Tourism Recovery Dashboard (SQL + Power BI)

    Answer a real business question end to end: load Eurostat regional tourism data into DuckDB, model the metrics in SQL, and ship a one-page Power BI dashboard explaining where tourism recovered fastest between 2022 and 2025.

    Beginner
    6-8 hours

    Tools & Technologies:

    DuckDB
    SQL
    Power BI
    DAX
    +1 more

    Airbnb Listings EDA (Python + pandas)

    Clean a real, messy Inside Airbnb listings dataset, run an exploratory analysis in a Jupyter notebook, and ship the result on GitHub with a README and one publication-quality chart.

    Beginner
    5-7 hours

    Tools & Technologies:

    Python
    pandas
    Jupyter
    seaborn
    +1 more

    Frequently Asked Questions

    What is a good first data engineering project?

    A good first project is a local ETL pipeline using dlt, DuckDB, and Jupyter. It takes 2-4 hours, runs entirely on your laptop with open-source tools, and teaches the core extract-transform-load loop without cloud setup or cost.

    How many projects do I need for a data engineering portfolio?

    Three to five projects spanning different difficulty levels is enough for a strong portfolio: one beginner pipeline, two intermediate builds (orchestration with Airflow and analytics engineering with dbt), and one advanced project such as real-time streaming with Kafka or batch processing with Spark.

    What skills do data engineering projects teach?

    Hands-on projects build skills in data pipeline design, stream processing (Kafka, Spark), batch ETL/ELT, data warehouse modeling, cloud platforms (AWS, GCP, Azure), container orchestration (Docker), workflow orchestration (Airflow), and CI/CD for data workflows.

    Why Choose Project-Based Learning?

    Real-World Application

    Work with actual datasets and scenarios that mirror production environments. Build solutions that demonstrate your ability to handle complex data challenges.

    Portfolio Development

    Create a compelling portfolio that showcases your technical skills to potential employers. Each project includes documentation and deployment instructions.

    Industry-Relevant Skills

    Focus on the tools and technologies that are in high demand in the data engineering job market. Stay current with modern data stack practices.