TL;DR: Change Data Capture (CDC) identifies and streams row-level changes — inserts, updates, deletes — from a source database to downstream systems in near real time, avoiding expensive full-table reloads. Log-based CDC (reading the database write-ahead log/binlog, e.g. with Debezium) is the gold standard because it catches deletes and adds almost no load on the source. Query-based CDC (polling a timestamp column) is simpler but misses deletes. Use CDC when your source is a transactional database and you need fresh data downstream without re-extracting whole tables.
The naive way to sync a database into your warehouse is to copy the whole thing every night. It works until the table hits tens of millions of rows, the nightly job runs for hours, and "fresh" data is always a day old. Change Data Capture is the pattern that fixes both problems: move only what changed, as it changes.
What CDC Is and Why It Beats Full Reloads
CDC captures the changes to a dataset rather than its full state. Instead of "here are all 50 million orders," CDC emits "order 12345 was updated, order 12346 was inserted, order 99 was deleted."
The payoff is twofold: you stop re-reading unchanged data (cheaper, faster), and you get near-real-time freshness instead of once-a-day snapshots. This is the same instinct behind incremental models in ELT pipelines — do less work by processing only the delta.
Log-Based vs Query-Based CDC
There are two ways to detect changes, and the difference is the most important decision you'll make.
Query-based CDC polls the source on a schedule: SELECT * FROM orders WHERE updated_at > :last_run. It is trivial to implement and needs no special database privileges. But it has two fatal gaps: it cannot detect hard deletes (a deleted row simply stops appearing), and it depends on a reliable updated_at column that the application must always set.
Log-based CDC reads the database's transaction log directly — the PostgreSQL WAL, MySQL binlog, or equivalent. Every committed change is already written there for replication, so CDC piggybacks on it. This catches inserts, updates, and deletes, captures the before/after image of each row, and adds almost no query load on the source. The cost is operational complexity: you manage connectors, replication slots, and offsets.
| Query-based | Log-based | |
|---|---|---|
| Detects deletes | No | Yes |
| Load on source | Polling queries | Minimal (reads log) |
| Latency | Poll interval | Near real time |
| Setup complexity | Low | Higher (connectors, slots) |
Needs updated_at |
Yes | No |
For anything beyond a small, append-only table, log-based CDC is the right answer.
The CDC Toolchain
The dominant open-source engine is Debezium, which reads database logs and publishes changes as events — usually onto Kafka, which is why CDC and streaming go hand in hand (see Apache Kafka for data engineers). Managed options like Fivetran and AWS DMS wrap the same idea with less operational overhead. The output is a stream of change events that downstream systems consume.
CDC Patterns in the Pipeline
CDC produces a stream of changes, and streams demand discipline. The same pipeline design patterns apply directly:
- Idempotency: the same change event may arrive more than once. Apply changes with
MERGEkeyed on the primary key so reprocessing is safe. - Ordering: updates must be applied in commit order, or you'll resurrect stale values. Key your stream by primary key to preserve per-row order.
- At-least-once delivery: assume duplicates and design for them rather than hoping for exactly-once.
CDC and Slowly Changing Dimensions
CDC is a natural source for slowly changing dimensions. Because log-based CDC carries the before/after image of every row, you have exactly what you need to version history in a dimensional model: each captured update becomes a new SCD Type 2 record with valid-from/valid-to timestamps. This is far cleaner than trying to reconstruct history from nightly snapshots.
When NOT to Use CDC
CDC is not free. Skip it when:
- The table is small or changes infrequently — a batch reload is simpler and good enough.
- You don't control the source and can't enable logical replication.
- Your team can't yet operate Kafka/Debezium and you have no managed budget.
Match the tool to the freshness you actually need, not the freshness that sounds impressive.
Frequently Asked Questions
Is CDC the same as ETL?
No. ETL/ELT describes the overall extract-transform-load process; CDC is a specific extraction technique that captures changes instead of full snapshots. You can use CDC as the extract step of an ELT pipeline.
Does CDC require Kafka?
Not strictly, but log-based CDC pairs naturally with Kafka because change events are a stream and Kafka is the standard transport for streams. Managed tools (Fivetran, AWS DMS) can land CDC output directly into a warehouse without you operating Kafka.
What's the difference between CDC for replication and CDC for analytics?
Replication aims to keep a near-identical copy of the source. Analytics CDC often transforms and models the changes (e.g. into SCD Type 2 history) for reporting. Same capture mechanism, different downstream handling.
How does CDC handle schema changes on the source?
Schema drift is the hard part. Log-based tools like Debezium emit schema-change events, but your downstream consumers must handle new/renamed/dropped columns gracefully — which is exactly where data contracts earn their keep.