Change Data Capture (CDC): How It Works and When to Use It (2026)

    Change Data Capture (CDC) explained for data engineers: log-based vs query-based, Debezium and friends, the pipeline patterns that matter, and when CDC beats batch reloads.

    By Adriano Sanges--10 min read
    change data capture
    cdc
    debezium
    data replication
    streaming
    data pipelines
    data engineering

    TL;DR: Change Data Capture (CDC) identifies and streams row-level changes — inserts, updates, deletes — from a source database to downstream systems in near real time, avoiding expensive full-table reloads. Log-based CDC (reading the database write-ahead log/binlog, e.g. with Debezium) is the gold standard because it catches deletes and adds almost no load on the source. Query-based CDC (polling a timestamp column) is simpler but misses deletes. Use CDC when your source is a transactional database and you need fresh data downstream without re-extracting whole tables.

    The naive way to sync a database into your warehouse is to copy the whole thing every night. It works until the table hits tens of millions of rows, the nightly job runs for hours, and "fresh" data is always a day old. Change Data Capture is the pattern that fixes both problems: move only what changed, as it changes.

    What CDC Is and Why It Beats Full Reloads

    CDC captures the changes to a dataset rather than its full state. Instead of "here are all 50 million orders," CDC emits "order 12345 was updated, order 12346 was inserted, order 99 was deleted."

    The payoff is twofold: you stop re-reading unchanged data (cheaper, faster), and you get near-real-time freshness instead of once-a-day snapshots. This is the same instinct behind incremental models in ELT pipelines — do less work by processing only the delta.

    Log-Based vs Query-Based CDC

    There are two ways to detect changes, and the difference is the most important decision you'll make.

    Query-based CDC polls the source on a schedule: SELECT * FROM orders WHERE updated_at > :last_run. It is trivial to implement and needs no special database privileges. But it has two fatal gaps: it cannot detect hard deletes (a deleted row simply stops appearing), and it depends on a reliable updated_at column that the application must always set.

    Log-based CDC reads the database's transaction log directly — the PostgreSQL WAL, MySQL binlog, or equivalent. Every committed change is already written there for replication, so CDC piggybacks on it. This catches inserts, updates, and deletes, captures the before/after image of each row, and adds almost no query load on the source. The cost is operational complexity: you manage connectors, replication slots, and offsets.

    Query-based Log-based
    Detects deletes No Yes
    Load on source Polling queries Minimal (reads log)
    Latency Poll interval Near real time
    Setup complexity Low Higher (connectors, slots)
    Needs updated_at Yes No

    For anything beyond a small, append-only table, log-based CDC is the right answer.

    The CDC Toolchain

    The dominant open-source engine is Debezium, which reads database logs and publishes changes as events — usually onto Kafka, which is why CDC and streaming go hand in hand (see Apache Kafka for data engineers). Managed options like Fivetran and AWS DMS wrap the same idea with less operational overhead. The output is a stream of change events that downstream systems consume.

    CDC Patterns in the Pipeline

    CDC produces a stream of changes, and streams demand discipline. The same pipeline design patterns apply directly:

    • Idempotency: the same change event may arrive more than once. Apply changes with MERGE keyed on the primary key so reprocessing is safe.
    • Ordering: updates must be applied in commit order, or you'll resurrect stale values. Key your stream by primary key to preserve per-row order.
    • At-least-once delivery: assume duplicates and design for them rather than hoping for exactly-once.

    CDC and Slowly Changing Dimensions

    CDC is a natural source for slowly changing dimensions. Because log-based CDC carries the before/after image of every row, you have exactly what you need to version history in a dimensional model: each captured update becomes a new SCD Type 2 record with valid-from/valid-to timestamps. This is far cleaner than trying to reconstruct history from nightly snapshots.

    When NOT to Use CDC

    CDC is not free. Skip it when:

    • The table is small or changes infrequently — a batch reload is simpler and good enough.
    • You don't control the source and can't enable logical replication.
    • Your team can't yet operate Kafka/Debezium and you have no managed budget.

    Match the tool to the freshness you actually need, not the freshness that sounds impressive.

    Frequently Asked Questions

    Is CDC the same as ETL?

    No. ETL/ELT describes the overall extract-transform-load process; CDC is a specific extraction technique that captures changes instead of full snapshots. You can use CDC as the extract step of an ELT pipeline.

    Does CDC require Kafka?

    Not strictly, but log-based CDC pairs naturally with Kafka because change events are a stream and Kafka is the standard transport for streams. Managed tools (Fivetran, AWS DMS) can land CDC output directly into a warehouse without you operating Kafka.

    What's the difference between CDC for replication and CDC for analytics?

    Replication aims to keep a near-identical copy of the source. Analytics CDC often transforms and models the changes (e.g. into SCD Type 2 history) for reporting. Same capture mechanism, different downstream handling.

    How does CDC handle schema changes on the source?

    Schema drift is the hard part. Log-based tools like Debezium emit schema-change events, but your downstream consumers must handle new/renamed/dropped columns gracefully — which is exactly where data contracts earn their keep.

    About the Author

    Adriano Sanges is a data engineer and the creator of dataskew.io. He builds production data platforms with Airflow, dbt, Spark and cloud warehouses, and writes hands-on guides to help aspiring data engineers advance their careers.

    LinkedIn · Website

    Ready to Apply What You Learned?

    Take the next step in your data engineering journey with structured roadmaps and hands-on projects designed for real-world experience.