TL;DR: Apache Airflow is the de-facto open-source orchestrator for batch data pipelines. You define workflows as DAGs (directed acyclic graphs) in Python; the scheduler runs each task on a timetable, resolving dependencies, retries, and backfills for you. Reach for Airflow when you have many interdependent batch jobs that need scheduling, observability, and recovery. Reach for something lighter (cron, Dagster, Prefect) when you only have a handful of simple, independent jobs.
Every data team eventually outgrows cron. You start with one nightly script, then it depends on another, then a third needs to wait for both, and suddenly a single failure at 2 AM cascades into a morning of manual reruns. Apache Airflow exists to make that orchestration problem tractable: it turns "a pile of scripts on a schedule" into observable, recoverable workflows.
This guide covers what Airflow is, its core building blocks, the production patterns that actually matter, and when you should reach for something else.
What Airflow Is and the Problem It Solves
Airflow is a workflow orchestrator, not a data processing engine. It does not move or transform data itself — it tells other systems when and in what order to do so, then tracks whether they succeeded.
The distinction matters. A scheduler (like cron) answers "run this at 3 AM." An orchestrator answers a harder set of questions: run task B only after task A succeeds; retry C three times with backoff; if today's run fails, let me re-run just the failed task; and show me, at a glance, the state of every pipeline. That last part — observability — is half the reason Airflow exists.
Airflow sits naturally above your compute. It triggers a batch Spark job, kicks off a dbt run, waits for a file to land, then loads a warehouse — coordinating tools rather than replacing them.
DAGs, Tasks, Operators, and the Scheduler
Airflow has four concepts you need to internalize.
- DAG (Directed Acyclic Graph): the workflow itself — a set of tasks plus the dependencies between them. "Acyclic" means no loops: data flows forward.
- Task: a single unit of work (a node in the DAG).
- Operator: the template that defines what a task does.
BashOperatorruns a command,PythonOperatorruns a function,KubernetesPodOperatorruns a container, and provider packages add operators for Spark, dbt, BigQuery, S3, and hundreds more. - Scheduler: the process that reads your DAGs, decides which task instances are due, and dispatches them to workers.
A minimal DAG looks like this:
from airflow import DAG
from airflow.operators.bash import BashOperator
from datetime import datetime
with DAG(
dag_id="daily_sales_etl",
start_date=datetime(2026, 1, 1),
schedule="0 3 * * *", # every day at 03:00
catchup=False,
) as dag:
extract = BashOperator(task_id="extract", bash_command="python extract.py {{ ds }}")
transform = BashOperator(task_id="transform", bash_command="dbt run --select sales")
load = BashOperator(task_id="load", bash_command="python publish.py {{ ds }}")
extract >> transform >> load # dependency: extract, then transform, then loadThe {{ ds }} is a templated variable — the execution date of the run. Templating is what makes Airflow pipelines reproducible: every task knows which logical date it is processing.
Production Patterns That Actually Matter
A "hello world" DAG is easy. A DAG that survives production is not. The patterns below are the difference, and they overlap heavily with general data pipeline design patterns.
- Idempotency: a task must produce the same result whether it runs once or five times. Use the execution date to scope work (
WHERE date = '{{ ds }}') and write withMERGE/INSERT OVERWRITE, never blindINSERT. Without this, a retry doubles your data. - Backfills: because each run is tied to a logical date, Airflow can re-run history. Set
catchupdeliberately —Trueto fill a gap,Falseto avoid a thundering herd of past runs on first deploy. - Sensors: tasks that wait for a condition (a file in S3, a partition in a table). Prefer
mode="reschedule"so a waiting sensor frees its worker slot instead of blocking it. - Retries with backoff: set
retriesandretry_delayon tasks that touch flaky external systems. - SLAs and alerting: define SLAs so you hear about a late pipeline before a stakeholder does.
Airflow and the Rest of the Stack
Airflow rarely lives alone. In a modern stack you will typically:
- Run tasks in containers for reproducibility — see Docker for data engineers.
- Deploy DAGs through a CI/CD pipeline for data so changes are tested before they hit the scheduler.
- Orchestrate the transform layer of an ELT pipeline, triggering dbt or warehouse SQL after extraction.
If you want a hands-on walkthrough, build the orchestration with Airflow project, which wires a real DAG end to end.
When NOT to Use Airflow
Airflow is powerful but heavy: a scheduler, a metadata database, and workers to operate. It is overkill when:
- You have a few independent jobs — plain cron or a managed scheduler is simpler.
- You need sub-minute, event-driven latency — Airflow is batch-first; reach for a streaming system.
- You want a more Pythonic, asset-centric model — Dagster and Prefect are worth evaluating. (A dedicated comparison is coming.)
For most teams running many interdependent batch jobs, though, Airflow remains the default — and managed options (AWS MWAA, Google Cloud Composer, Astronomer) remove most of the operational burden.
Frequently Asked Questions
Is Airflow just a fancy cron?
No. Cron schedules a single command at a time with no awareness of dependencies, state, or failure. Airflow models dependencies between tasks, tracks the state of every run, retries failures, supports backfills, and gives you a UI to observe and recover pipelines.
Airflow vs Dagster vs Prefect — which should I use?
Airflow has the largest ecosystem and is the safest default for batch orchestration. Dagster is asset-centric and has strong local development and data-awareness. Prefect is the most Pythonic and lightweight. For a brand-new project with a small team, Dagster or Prefect can be faster to start; for broad integration needs and hiring, Airflow is still the standard.
Is Apache Airflow still relevant in 2026?
Yes. Airflow 2.x and the move toward Airflow 3 modernized the scheduler, the API, and the developer experience, and managed offerings (MWAA, Composer, Astronomer) keep it the most widely deployed orchestrator. It remains the default for batch data pipelines.
Do I need to know Python to use Airflow?
Yes — DAGs are defined in Python. You do not need to be an expert, but comfort with functions, imports, and basic data structures is required. If you are still building that foundation, start with the data engineering roadmap.