Apache Airflow for Data Engineers: DAGs, Operators, and Production Patterns (2026)

    How Apache Airflow works for data engineers: DAGs, operators, the scheduler, and the production patterns (idempotency, backfills, sensors) that keep pipelines reliable.

    By Adriano Sanges--11 min read
    apache airflow
    airflow
    orchestration
    data pipelines
    dags
    workflow orchestration
    data engineering

    TL;DR: Apache Airflow is the de-facto open-source orchestrator for batch data pipelines. You define workflows as DAGs (directed acyclic graphs) in Python; the scheduler runs each task on a timetable, resolving dependencies, retries, and backfills for you. Reach for Airflow when you have many interdependent batch jobs that need scheduling, observability, and recovery. Reach for something lighter (cron, Dagster, Prefect) when you only have a handful of simple, independent jobs.

    Every data team eventually outgrows cron. You start with one nightly script, then it depends on another, then a third needs to wait for both, and suddenly a single failure at 2 AM cascades into a morning of manual reruns. Apache Airflow exists to make that orchestration problem tractable: it turns "a pile of scripts on a schedule" into observable, recoverable workflows.

    This guide covers what Airflow is, its core building blocks, the production patterns that actually matter, and when you should reach for something else.

    What Airflow Is and the Problem It Solves

    Airflow is a workflow orchestrator, not a data processing engine. It does not move or transform data itself — it tells other systems when and in what order to do so, then tracks whether they succeeded.

    The distinction matters. A scheduler (like cron) answers "run this at 3 AM." An orchestrator answers a harder set of questions: run task B only after task A succeeds; retry C three times with backoff; if today's run fails, let me re-run just the failed task; and show me, at a glance, the state of every pipeline. That last part — observability — is half the reason Airflow exists.

    Airflow sits naturally above your compute. It triggers a batch Spark job, kicks off a dbt run, waits for a file to land, then loads a warehouse — coordinating tools rather than replacing them.

    DAGs, Tasks, Operators, and the Scheduler

    Airflow has four concepts you need to internalize.

    • DAG (Directed Acyclic Graph): the workflow itself — a set of tasks plus the dependencies between them. "Acyclic" means no loops: data flows forward.
    • Task: a single unit of work (a node in the DAG).
    • Operator: the template that defines what a task does. BashOperator runs a command, PythonOperator runs a function, KubernetesPodOperator runs a container, and provider packages add operators for Spark, dbt, BigQuery, S3, and hundreds more.
    • Scheduler: the process that reads your DAGs, decides which task instances are due, and dispatches them to workers.

    A minimal DAG looks like this:

    from airflow import DAG
    from airflow.operators.bash import BashOperator
    from datetime import datetime
    
    with DAG(
        dag_id="daily_sales_etl",
        start_date=datetime(2026, 1, 1),
        schedule="0 3 * * *",   # every day at 03:00
        catchup=False,
    ) as dag:
        extract = BashOperator(task_id="extract", bash_command="python extract.py {{ ds }}")
        transform = BashOperator(task_id="transform", bash_command="dbt run --select sales")
        load = BashOperator(task_id="load", bash_command="python publish.py {{ ds }}")
    
        extract >> transform >> load   # dependency: extract, then transform, then load

    The {{ ds }} is a templated variable — the execution date of the run. Templating is what makes Airflow pipelines reproducible: every task knows which logical date it is processing.

    Production Patterns That Actually Matter

    A "hello world" DAG is easy. A DAG that survives production is not. The patterns below are the difference, and they overlap heavily with general data pipeline design patterns.

    • Idempotency: a task must produce the same result whether it runs once or five times. Use the execution date to scope work (WHERE date = '{{ ds }}') and write with MERGE/INSERT OVERWRITE, never blind INSERT. Without this, a retry doubles your data.
    • Backfills: because each run is tied to a logical date, Airflow can re-run history. Set catchup deliberately — True to fill a gap, False to avoid a thundering herd of past runs on first deploy.
    • Sensors: tasks that wait for a condition (a file in S3, a partition in a table). Prefer mode="reschedule" so a waiting sensor frees its worker slot instead of blocking it.
    • Retries with backoff: set retries and retry_delay on tasks that touch flaky external systems.
    • SLAs and alerting: define SLAs so you hear about a late pipeline before a stakeholder does.

    Airflow and the Rest of the Stack

    Airflow rarely lives alone. In a modern stack you will typically:

    If you want a hands-on walkthrough, build the orchestration with Airflow project, which wires a real DAG end to end.

    When NOT to Use Airflow

    Airflow is powerful but heavy: a scheduler, a metadata database, and workers to operate. It is overkill when:

    • You have a few independent jobs — plain cron or a managed scheduler is simpler.
    • You need sub-minute, event-driven latency — Airflow is batch-first; reach for a streaming system.
    • You want a more Pythonic, asset-centric model — Dagster and Prefect are worth evaluating. (A dedicated comparison is coming.)

    For most teams running many interdependent batch jobs, though, Airflow remains the default — and managed options (AWS MWAA, Google Cloud Composer, Astronomer) remove most of the operational burden.

    Frequently Asked Questions

    Is Airflow just a fancy cron?

    No. Cron schedules a single command at a time with no awareness of dependencies, state, or failure. Airflow models dependencies between tasks, tracks the state of every run, retries failures, supports backfills, and gives you a UI to observe and recover pipelines.

    Airflow vs Dagster vs Prefect — which should I use?

    Airflow has the largest ecosystem and is the safest default for batch orchestration. Dagster is asset-centric and has strong local development and data-awareness. Prefect is the most Pythonic and lightweight. For a brand-new project with a small team, Dagster or Prefect can be faster to start; for broad integration needs and hiring, Airflow is still the standard.

    Is Apache Airflow still relevant in 2026?

    Yes. Airflow 2.x and the move toward Airflow 3 modernized the scheduler, the API, and the developer experience, and managed offerings (MWAA, Composer, Astronomer) keep it the most widely deployed orchestrator. It remains the default for batch data pipelines.

    Do I need to know Python to use Airflow?

    Yes — DAGs are defined in Python. You do not need to be an expert, but comfort with functions, imports, and basic data structures is required. If you are still building that foundation, start with the data engineering roadmap.

    About the Author

    Adriano Sanges is a data engineer and the creator of dataskew.io. He builds production data platforms with Airflow, dbt, Spark and cloud warehouses, and writes hands-on guides to help aspiring data engineers advance their careers.

    LinkedIn · Website

    Ready to Apply What You Learned?

    Take the next step in your data engineering journey with structured roadmaps and hands-on projects designed for real-world experience.