🧱 Modern Data Stack Roadmap
Master the core tools used in modern data teams — from containerization to dbt, BigQuery, and Kafka. Build real projects and get job-ready.
This roadmap was created by data engineering professionals with 34 hands-on tasks covering production-ready skills used by companies like Netflix, Airbnb, and Spotify. Master Docker, Terraform, Airflow and 7 more technologies.
How long does it take? Most learners with SQL and Python basics complete this roadmap in 4-7 months part-time (10-15 hours/week), or about 2-3 months full-time. The 9 sections contain 34 hands-on tasks, ending with a full pipeline project.
The 9 steps: (0) Pre-requisites and fundamentals · (1) Containerization & Infrastructure · (2) Workflow Orchestration with Airflow · (3) Data Ingestion & Loading (Airflow and dlt) · (4) Data Warehousing in BigQuery · (5) Analytics Engineering with dbt · (6) Batch Processing with Spark · (7) Streaming with Kafka · (8) Final Project: Build a Real Data Pipeline.
Skills You'll Learn
- Cloud infrastructure
- SQL & analytics engineering
- ETL & orchestration
- Batch & stream processing
- Data modeling
Tools You'll Use
- Docker
- Terraform
- Airflow
- dlt
- BigQuery
- dbt
- Metabase
- Spark
- Kafka
- GitHub
Projects to Build
- Infrastructure-as-Code Setup on GCP
Provision a GCP environment using Terraform with BigQuery & Cloud Storage, staying within free tier limits
- ETL Pipeline Orchestration with Apache Airflow
Design and implement an orchestrated ETL pipeline using Apache Airflow to extract, transform, and load weather data from a public API into a data warehouse.
- Analytics Engineering Workflow with dbt + Metabase
Build a production-grade analytics workflow: model, test, and document data with dbt, then visualize insights in Metabase.
- GitHub Events Analytics with PySpark
Build a production-style batch data pipeline using Apache Spark to process GitHub event logs
- âš¡ Real-Time Data Streaming with Apache Kafka
Build a real-time data pipeline using Kafka (Confluent Cloud), JSON, Python, and Polars. Simulate NYC Taxi data, process in real time, and visualize with Metabase.
- CI/CD for Data Pipelines
Build a complete CI/CD pipeline for a data engineering project using GitHub Actions, dbt, Airflow DAG testing, and Terraform infrastructure deployment.
Learning Resources
Step 1: Containerization & Infrastructure
Step 2: Workflow Orchestration with Airflow
Step 3: Data Ingestion & Loading (Airflow and dlt)
Step 4: Data Warehousing in BigQuery
Step 5: Analytics Engineering with dbt
Step 6: Batch Processing with Spark
Step 7: Streaming with Kafka
Final Project: Build a Real Data Pipeline
Curriculum Reference
A free preview of the learning material in this roadmap — the full reference for every section is available when you sign in. Click any task to expand it.
Step 1: Containerization & Infrastructure
Install Docker & Docker Compose
This section is heavily inspired by the Zoomcamp Docker and Terraform project. It's a great way to learn about Docker and Terraform.
Their course is amazing and free!
- Docker Documentation (documentation)
- Docker Compose Documentation (documentation)
Run PostgreSQL using Docker locally
Use the following command to run a PostgreSQL container:
docker run --name my-postgres \
-e POSTGRES_USER=admin \
-e POSTGRES_PASSWORD=admin123 \
-e POSTGRES_DB=mydatabase \
-p 5432:5432 \
-d postgres
Explanation:
--name my-postgres: sets the container name.
-e POSTGRES_USER=admin: sets the DB username.
-e POSTGRES_PASSWORD=admin123: sets the password.
-e POSTGRES_DB=mydatabase: creates a default DB.
-p 5432:5432: maps the container port to your local machine.
-d: runs in detached mode.
postgres: pulls the official PostgreSQL image.
To persist data even after the container is removed:
docker run --name my-postgres \
-e POSTGRES_USER=admin \
-e POSTGRES_PASSWORD=admin123 \
-e POSTGRES_DB=mydatabase \
-v pgdata:/var/lib/postgresql/data \
-p 5432:5432 \
-d postgres
Use a client like psql, DBeaver, or any Postgres GUI tool. For psql:
psql -h localhost -U admin -d mydatabase
It will prompt for the password (admin123).
Stop container:
docker stop my-postgres
Start container again:
docker start my-postgres
Remove container:
docker rm -f my-postgres
Install Terraform CLI
- Install Terraform CLI (Official Docs) (documentation)
Provision GCP infra (BQ dataset + GCS bucket)
- Terraform + GCP Quickstart (documentation)
- Provision GCP Infra with Terraform (YouTube) (video)
Unlock the learning materials for the remaining 6 sections
Sign in free to open the curated guides, videos and docs for every task — and track your progress as you go.
Sign in to continueFrequently Asked Questions
What is the modern data stack?
The modern data stack is the set of cloud-native, modular tools that data teams use to ingest, store, transform, and serve data. This roadmap covers Docker, Terraform, Airflow, dlt, BigQuery, dbt, Metabase, Spark, and Kafka across nine sections.
What tools make up the modern data stack?
This roadmap teaches Docker and Terraform for infrastructure, Airflow and dlt for orchestration and ingestion, BigQuery for warehousing, dbt for analytics engineering, Metabase for visualization, and Spark and Kafka for batch and stream processing.
Is the modern data stack still relevant in 2026?
Yes. The core tools in this roadmap, including Airflow, dbt, BigQuery, Spark, and Kafka, remain industry standards for data teams. The roadmap builds five real projects from infrastructure-as-code through batch processing and real-time streaming.
Do I need to know SQL and Python before starting?
Yes. This is an intermediate roadmap and Step 0 expects basic SQL and Python before you begin. From there you move into containerization, orchestration, warehousing, analytics engineering, and batch and stream processing.
What is the difference between batch and stream processing?
Batch processing runs joins and aggregations on large stored datasets, covered in this roadmap with Apache Spark. Stream processing handles continuous real-time events, covered with Kafka, Kafka Streams, KSQL, and Schema Registry. The final project uses both.