🧱 Modern Data Stack Roadmap

Master the core tools used in modern data teams — from containerization to dbt, BigQuery, and Kafka. Build real projects and get job-ready.

✓ Expert-Designed Learning Path• Industry-Validated Curriculum• Real-World Application Focus

This roadmap was created by data engineering professionals with 34 hands-on tasks covering production-ready skills used by companies like Netflix, Airbnb, and Spotify. Master Docker, Terraform, Airflow and 7 more technologies.

How long does it take? Most learners with SQL and Python basics complete this roadmap in 4-7 months part-time (10-15 hours/week), or about 2-3 months full-time. The 9 sections contain 34 hands-on tasks, ending with a full pipeline project.

The 9 steps: (0) Pre-requisites and fundamentals · (1) Containerization & Infrastructure · (2) Workflow Orchestration with Airflow · (3) Data Ingestion & Loading (Airflow and dlt) · (4) Data Warehousing in BigQuery · (5) Analytics Engineering with dbt · (6) Batch Processing with Spark · (7) Streaming with Kafka · (8) Final Project: Build a Real Data Pipeline.

Intermediate

9 sections • 34 tasks

Skills You'll Learn

Cloud infrastructure
SQL & analytics engineering
ETL & orchestration
Batch & stream processing
Data modeling

Tools You'll Use

Docker
Terraform
Airflow
dlt
BigQuery
dbt
Metabase
Spark
Kafka
GitHub

Projects to Build

Infrastructure-as-Code Setup on GCP
Provision a GCP environment using Terraform with BigQuery & Cloud Storage, staying within free tier limits
ETL Pipeline Orchestration with Apache Airflow
Design and implement an orchestrated ETL pipeline using Apache Airflow to extract, transform, and load weather data from a public API into a data warehouse.
Analytics Engineering Workflow with dbt + Metabase
Build a production-grade analytics workflow: model, test, and document data with dbt, then visualize insights in Metabase.
GitHub Events Analytics with PySpark
Build a production-style batch data pipeline using Apache Spark to process GitHub event logs
⚡ Real-Time Data Streaming with Apache Kafka
Build a real-time data pipeline using Kafka (Confluent Cloud), JSON, Python, and Polars. Simulate NYC Taxi data, process in real time, and visualize with Metabase.
CI/CD for Data Pipelines
Build a complete CI/CD pipeline for a data engineering project using GitHub Actions, dbt, Airflow DAG testing, and Terraform infrastructure deployment.

Learning Resources

Terraform + GCP Quickstart

documentation

Airflow Documentation

Step 0: Pre-requisites and fundamentals

-Learn the fundamentals

-Know basic SQL and Python

Step 1: Containerization & Infrastructure

-Install Docker & Docker Compose

-Run PostgreSQL using Docker locally

-Install Terraform CLI

-Provision GCP infra (BQ dataset + GCS bucket)

Step 2: Workflow Orchestration with Airflow

-Set up Airflow locally with Docker (you can also use https://www.astronomer.io/'s free tier)

-Build a basic flow (CSV file to BigQuery)

-Schedule a flow to run daily

-Add logging and notification features

Step 3: Data Ingestion & Loading (Airflow and dlt)

-Create API ingestion task (e.g., GitHub or OpenWeather) with dlt

-Normalize JSON into flat tables

-Run on a schedule and incrementally with Airflow

Step 4: Data Warehousing in BigQuery

-Load sample data into BigQuery

-Apply partitioning and clustering

-Run SQL queries and optimize costs

Step 5: Analytics Engineering with dbt

-Install and initialize dbt with BigQuery

-Build staging models

-Add documentation and tests

-Deploy with GitHub Actions or dbt Cloud

-Visualize output in Metabase

Step 6: Batch Processing with Spark

-Install Spark locally or via Colab

-Load and transform a CSV with PySpark

-Run groupBy and joins on large datasets

-Explore partitioning and performance tuning

Step 7: Streaming with Kafka

-Install Kafka via Docker or use Confluent Cloud

-Create a simple producer/consumer

-Process events with Kafka Streams or KSQL

-Use Schema Registry with Avro or Protobuf

Final Project: Build a Real Data Pipeline

-Choose a dataset and domain (e.g., finance, sports, ecommerce)

-Ingest the data using Airflow and dlt for batch or Kafka for streaming

-Model and test with dbt

-Load into BigQuery and visualize KPIs

-Publish project on GitHub and write a short case study

Curriculum Reference

A free preview of the learning material in this roadmap — the full reference for every section is available when you sign in. Click any task to expand it.

Step 1: Containerization & Infrastructure

Install Docker & Docker Compose

This section is heavily inspired by the Zoomcamp Docker and Terraform project. It's a great way to learn about Docker and Terraform.
Their course is amazing and free!

Docker Documentation (documentation)
Docker Compose Documentation (documentation)

Run PostgreSQL using Docker locally

Use the following command to run a PostgreSQL container:

docker run --name my-postgres \
  -e POSTGRES_USER=admin \
  -e POSTGRES_PASSWORD=admin123 \
  -e POSTGRES_DB=mydatabase \
  -p 5432:5432 \
  -d postgres

Explanation:

--name my-postgres: sets the container name.

-e POSTGRES_USER=admin: sets the DB username.

-e POSTGRES_PASSWORD=admin123: sets the password.

-e POSTGRES_DB=mydatabase: creates a default DB.

-p 5432:5432: maps the container port to your local machine.

-d: runs in detached mode.

postgres: pulls the official PostgreSQL image.

To persist data even after the container is removed:

docker run --name my-postgres \
  -e POSTGRES_USER=admin \
  -e POSTGRES_PASSWORD=admin123 \
  -e POSTGRES_DB=mydatabase \
  -v pgdata:/var/lib/postgresql/data \
  -p 5432:5432 \
  -d postgres

Use a client like psql, DBeaver, or any Postgres GUI tool. For psql:

psql -h localhost -U admin -d mydatabase

It will prompt for the password (admin123).

Stop container:

docker stop my-postgres

Start container again:

docker start my-postgres

Remove container:

docker rm -f my-postgres

Install Terraform CLI

Install Terraform CLI (Official Docs) (documentation)

Provision GCP infra (BQ dataset + GCS bucket)

Terraform + GCP Quickstart (documentation)
Provision GCP Infra with Terraform (YouTube) (video)

Frequently Asked Questions

What is the modern data stack?

The modern data stack is the set of cloud-native, modular tools that data teams use to ingest, store, transform, and serve data. This roadmap covers Docker, Terraform, Airflow, dlt, BigQuery, dbt, Metabase, Spark, and Kafka across nine sections.

What tools make up the modern data stack?

This roadmap teaches Docker and Terraform for infrastructure, Airflow and dlt for orchestration and ingestion, BigQuery for warehousing, dbt for analytics engineering, Metabase for visualization, and Spark and Kafka for batch and stream processing.

Is the modern data stack still relevant in 2026?

Yes. The core tools in this roadmap, including Airflow, dbt, BigQuery, Spark, and Kafka, remain industry standards for data teams. The roadmap builds five real projects from infrastructure-as-code through batch processing and real-time streaming.

Do I need to know SQL and Python before starting?

Yes. This is an intermediate roadmap and Step 0 expects basic SQL and Python before you begin. From there you move into containerization, orchestration, warehousing, analytics engineering, and batch and stream processing.

What is the difference between batch and stream processing?

Batch processing runs joins and aggregations on large stored datasets, covered in this roadmap with Apache Spark. Stream processing handles continuous real-time events, covered with Kafka, Kafka Streams, KSQL, and Schema Registry. The final project uses both.

Sign up for free courses and get early access to AI-powered grading, quizzes, and curated learning resources for each roadmap step.

Related Resources

How to Become a Data Engineer

A complete guide to launching your data engineering career

SQL Fundamentals

Build a strong foundation in the most essential data skill

Hands-On Projects

Apply what you learn with real-world data engineering projects