Idempotent Pipelines
An idempotent pipeline produces the same result whether it runs once or multiple times with the same input. This property is essential for reliable data pipelines — you should be able to re-run any pipeline without worrying about duplicated data, incorrect counts, or corrupted tables.
Why Idempotency Matters
Pipelines re-run more often than you might expect:
- Scheduler catch-up — after a
ratdrestart, the catch-up-once policy fires a run - Manual retries — you re-run a failed pipeline after fixing the root cause
- Trigger storms — a rapid series of file uploads fires the same pipeline multiple times
- Testing — during development, you run the same pipeline repeatedly
If your pipeline is not idempotent, these re-runs can cause:
- Duplicate rows — the same data inserted multiple times
- Incorrect aggregates — counts and sums inflated by duplicates
- Data inconsistency — different results depending on how many times the pipeline ran
Merge Strategy Idempotency
Not all merge strategies in RAT are naturally idempotent. Here is a summary:
| Merge Strategy | Idempotent? | Why |
|---|---|---|
full_refresh | Yes | Completely replaces the table every run. Same input = same output. |
incremental | Yes | Uses unique_key to upsert (insert-or-update). Running twice with same data updates existing rows, does not duplicate. |
delete_insert | Yes | Deletes matching rows by unique_key, then inserts. Same input = same result. |
scd2 | Yes | Tracks changes via valid_from/valid_to. Reprocessing the same data does not create new history records. |
snapshot | Conditional | Idempotent if the partition_column value is deterministic (e.g., a date from the data). Not idempotent if using current_timestamp. |
append_only | No | Every run appends rows regardless of whether they already exist. |
The Danger of append_only
append_only is the only merge strategy that is inherently not idempotent. Each run appends all rows from the query, even if those exact rows were already appended in a previous run.
-- @merge_strategy: append_only
SELECT *
FROM {{ landing_zone('raw_events') }}If this pipeline runs twice on the same landing zone files, every event is duplicated.
When is append_only safe?
append_only is appropriate when:
- Source data is guaranteed unique per run — each run processes new files that are then archived (using
archive_landing_zones: true) - Downstream consumers handle deduplication — the Silver layer deduplicates with
incrementalmerge - You want an immutable event log — and duplicates are acceptable or handled elsewhere
If you use append_only for Bronze ingestion, always pair it with a downstream Silver pipeline that deduplicates using incremental or delete_insert with a unique_key.
Common Idempotency Patterns
Pattern 1: Full Refresh (Simplest)
The safest approach. Every run replaces the entire table with fresh data from the query.
-- @merge_strategy: full_refresh
SELECT
order_id,
customer_id,
total_amount,
status,
updated_at
FROM {{ ref('bronze.raw_orders') }}
WHERE order_id IS NOT NULLTrade-off: For large tables, full refresh is expensive because it rewrites all data every run. Use for Gold tables (typically smaller) or when the source data is small enough.
Pattern 2: Incremental with Unique Key
The most common pattern for Silver tables. Uses unique_key to identify rows and watermark_column to process only new/changed data.
-- @merge_strategy: incremental
-- @unique_key: order_id
-- @watermark_column: updated_at
SELECT
order_id,
customer_id,
total_amount,
status,
updated_at
FROM {{ ref('bronze.raw_orders') }}
{% if is_incremental() %}
WHERE updated_at > '{{ watermark_value }}'
{% endif %}How it stays idempotent: If the pipeline runs twice with the same watermark_value, it reads the same rows and upserts them. The second run updates existing rows to the same values — the result is identical.
Pattern 3: Delete-Insert with Unique Key
Similar to incremental, but explicitly deletes matching rows before inserting. Useful when you want to guarantee no stale data from partial updates.
-- @merge_strategy: delete_insert
-- @unique_key: order_id
-- @watermark_column: updated_at
SELECT
order_id,
customer_id,
total_amount,
status,
updated_at
FROM {{ ref('bronze.raw_orders') }}
{% if is_incremental() %}
WHERE updated_at > '{{ watermark_value }}'
{% endif %}Pattern 4: Append-Only with Deduplication Downstream
When you need Bronze to capture every raw event (including potential duplicates), use append_only in Bronze and deduplicate in Silver.
-- @merge_strategy: append_only
-- @archive_landing_zones: true
SELECT *, current_timestamp AS _ingested_at
FROM {{ landing_zone('events') }}-- @merge_strategy: incremental
-- @unique_key: event_id
-- @watermark_column: event_timestamp
SELECT DISTINCT ON (event_id)
event_id,
event_type,
user_id,
payload,
event_timestamp
FROM {{ ref('bronze.raw_events') }}
{% if is_incremental() %}
WHERE event_timestamp > '{{ watermark_value }}'
{% endif %}
ORDER BY event_id, _ingested_at DESCPattern 5: Snapshot with Deterministic Partition
Snapshot is idempotent when the partition column value comes from the data, not from runtime.
-- @merge_strategy: snapshot
-- @partition_column: report_date
SELECT
report_date,
region,
total_sales,
total_orders
FROM {{ ref('silver.regional_sales') }}
WHERE report_date = CURRENT_DATE - INTERVAL '1 day'Re-running this pipeline for the same report_date replaces the same partition with the same data.
-- @merge_strategy: snapshot
-- @partition_column: snapshot_ts
SELECT
current_timestamp AS snapshot_ts, -- different every run!
region,
total_sales
FROM {{ ref('silver.regional_sales') }}This creates a new partition every run because snapshot_ts changes each time.
Testing Idempotency
The simplest test: run your pipeline twice on the same data and compare the results.
Manual test
- Run the pipeline once
- Note the row count and a few key metrics
- Run the pipeline again (same data, no changes)
- Compare: row count should be identical, metrics should be identical
Quality test
Create a quality test that detects duplicates on your unique key:
-- @severity: error
-- @description: Ensure no duplicate order_ids exist
SELECT order_id, COUNT(*) AS cnt
FROM {{ ref('silver.clean_orders') }}
GROUP BY order_id
HAVING COUNT(*) > 1This test returns violation rows (duplicate order_ids). If it returns zero rows, the pipeline is idempotent with respect to uniqueness.
Checklist
Use this checklist to verify your pipeline’s idempotency:
- Does the pipeline have a
unique_keyif usingincremental,delete_insert, orscd2? - If using
append_only, is there a downstream deduplication step? - If using
snapshot, is thepartition_columndeterministic (from data, not runtime)? - Can the pipeline be run twice on the same data without changing the output?
- Is there a quality test that checks for duplicate keys?
- Does the pipeline avoid
current_timestampin the result columns (or is it used only for metadata)?