Merge Strategies
Merge strategies control how RAT combines new pipeline results with existing data in your Iceberg tables. Choosing the right strategy is critical for data correctness, idempotency, and performance.
RAT supports 6 merge strategies, each designed for a specific pattern.
Decision Tree
Not sure which strategy to use? Follow this flowchart:
Strategy Comparison
| Strategy | Unique Key | Dedup | Idempotent | Use Case |
|---|---|---|---|---|
full_refresh | - | - | Yes | Small tables, aggregations, snapshots |
incremental | Required | Yes (last wins) | Yes | Large tables with upserts |
append_only | - | - | No | Event logs, audit trails |
delete_insert | Required | No | Yes | Batch loads keyed by date/source |
scd2 | Required | - | Yes | Dimension tables with history |
snapshot | - | - | Yes | Partitioned tables with partial refresh |
full_refresh
Drop everything and recreate. The entire target table is replaced with the pipeline’s output on every run.
How It Works
Configuration
No special configuration needed — this is the default strategy.
-- @merge_strategy: full_refresh
-- or simply omit @merge_strategy entirely
SELECT
order_date,
COUNT(*) AS order_count,
SUM(total_amount) AS revenue
FROM {{ ref('silver.clean_orders') }}
GROUP BY order_dateWhen to Use
- Aggregation / Gold tables — Summary tables that are recomputed from source
- Small tables — Reference data, lookup tables, dimension tables that are small enough to fully reload
- Development — Quick iteration when you do not care about history
Properties
- Idempotent: Yes. Running twice produces the same result.
- Required annotations: None
- Performance: Rewrites the entire table every run. Fine for small tables, expensive for large ones.
full_refresh is the safest strategy. When in doubt, start here. You can switch to incremental later when the table grows large enough to warrant it.
incremental
Upsert new rows into the existing table. New rows are deduplicated by unique_key (last row wins via ROW_NUMBER), then merged into the target table. Existing rows with matching keys are updated; new keys are inserted.
How It Works
Configuration
-- @merge_strategy: incremental
-- @unique_key: order_id
-- @watermark_column: updated_at
SELECT
order_id,
customer_id,
status,
total_amount,
updated_at
FROM {{ ref('bronze.raw_orders') }}
{% if is_incremental() %}
WHERE updated_at > '{{ watermark_value }}'
{% endif %}Dedup Behavior
When multiple rows in the pipeline output share the same unique_key, RAT keeps only the last row using ROW_NUMBER() OVER (PARTITION BY unique_key ORDER BY unique_key). This means the row that appears last in the result set wins.
For composite keys, separate column names with commas:
-- @unique_key: order_id, product_idWhen to Use
- Large fact tables — Orders, events, transactions that grow over time
- CDC (Change Data Capture) — Processing change feeds with upsert semantics
- Any table with a natural key — Where you want “last state wins”
Properties
- Idempotent: Yes. Running the same data twice updates existing rows to the same values.
- Required annotations:
unique_key - Optional annotations:
watermark_column(for filtering new rows) - Performance: Only processes new/changed rows when used with watermarks.
Without the {% if is_incremental() %} watermark filter, every run scans all source data. The merge is still correct (idempotent), but you lose the performance benefit of incremental processing.
append_only
Unconditionally append rows to the table. No dedup, no merge, no key matching. Every run adds new rows.
How It Works
Configuration
-- @merge_strategy: append_only
-- @description: Raw event log — append all incoming events
SELECT
event_id,
event_type,
user_id,
payload,
created_at,
'{{ run_started_at }}' AS _loaded_at
FROM read_csv_auto(
'{{ landing_zone("events") }}',
header = true
)When to Use
- Event logs — Clickstream, audit logs, IoT sensor readings
- Landing zone ingestion — Raw data that should be appended as-is
- Immutable data — Records that never change once created
Properties
- Idempotent: No. Running the same data twice creates duplicate rows. This is the only non-idempotent strategy.
- Required annotations: None
- Performance: Very fast — just appends, no key matching.
Be careful with retries. If a pipeline run fails after writing but before completion, re-running it will append the same data again. Use append_only only when duplicates are acceptable or when upstream data is guaranteed to be unique per run (e.g., new files in a landing zone with archive_landing_zones: true).
delete_insert
Delete all existing rows that match the incoming keys, then insert all new rows without dedup. Unlike incremental, there is no ROW_NUMBER dedup — all incoming rows for a key are inserted.
How It Works
Configuration
-- @merge_strategy: delete_insert
-- @unique_key: batch_date
SELECT
batch_date,
product_id,
warehouse_id,
quantity_on_hand,
last_counted_at
FROM {{ ref('bronze.raw_inventory') }}
WHERE batch_date = CURRENT_DATEHow It Differs from Incremental
| Aspect | incremental | delete_insert |
|---|---|---|
| Dedup | Yes (ROW_NUMBER, last wins) | No dedup |
| Multiple rows per key | Keeps 1 per key | Keeps all rows per key |
| Operation | MERGE (upsert) | DELETE matching keys + INSERT |
| Use case | One row per entity | Multiple rows per key value |
When to Use
- Batch reloads — Replace all data for a given date/batch/source
- One-to-many keys — When a key (like
batch_date) maps to many rows - Idempotent batch processing — Re-running a batch replaces it cleanly
Properties
- Idempotent: Yes. Re-running with the same data deletes and re-inserts the same rows.
- Required annotations:
unique_key - Performance: Deletes can be expensive if the key matches many rows.
A common pattern is using batch_date or source_file as the unique_key in delete_insert. This lets you reprocess a specific day’s data without affecting other days.
scd2
Slowly Changing Dimension Type 2. Tracks full history of changes. When a record changes, the old version is “closed” (its valid_to is set to the current timestamp) and a new version is inserted with valid_to = NULL (or 9999-12-31).
How It Works
Configuration
-- @merge_strategy: scd2
-- @unique_key: customer_id
-- @scd_valid_from: valid_from
-- @scd_valid_to: valid_to
-- @description: Customer dimension with full change history
SELECT
customer_id,
name,
email,
segment,
region,
updated_at
FROM {{ ref('bronze.raw_customers') }}The scd_valid_from and scd_valid_to annotations are optional. They default to valid_from and valid_to respectively. RAT manages these columns automatically — you do not need to include them in your SELECT.
Resulting Table Structure
After several runs, the target table looks like this:
| customer_id | name | segment | valid_from | valid_to |
|---|---|---|---|---|
| 42 | Bob | Basic | 2024-01-15 | 2024-06-01 |
| 42 | Bob Smith | Premium | 2024-06-01 | 2024-11-20 |
| 42 | Bob Smith | Enterprise | 2024-11-20 | NULL |
The row with valid_to = NULL is the current version. To query only current records:
SELECT * FROM {{ ref('silver.dim_customers') }}
WHERE valid_to IS NULLWhen to Use
- Dimension tables — Customers, products, employees where you need change history
- Compliance — Regulatory requirements to track when values changed
- Historical analysis — “What was the customer’s segment when they placed this order?”
Properties
- Idempotent: Yes. Re-running with unchanged data has no effect. Changed data correctly closes and re-opens records.
- Required annotations:
unique_key - Optional annotations:
scd_valid_from(default:valid_from),scd_valid_to(default:valid_to)
snapshot
Partition-aware overwrite. Only replaces partitions that are present in the new data. Partitions not in the pipeline output are left untouched.
How It Works
Configuration
-- @merge_strategy: snapshot
-- @partition_column: region
-- @description: Regional sales data — overwrite only affected regions
SELECT
region,
product_id,
sale_date,
quantity,
revenue
FROM {{ ref('bronze.raw_sales') }}
WHERE region IN ('US', 'EU')When to Use
- Regional data — Process one region at a time without affecting others
- Time-partitioned tables — Reprocess a specific month without touching other months
- Partial refreshes — Update only the partitions that have new source data
Properties
- Idempotent: Yes. Re-running with the same data for the same partitions produces the same result.
- Required annotations:
partition_column - Performance: Only rewrites affected partitions, leaving others untouched.
The partition_column must exist in both the pipeline output and the target table. RAT determines which partitions to overwrite by looking at the distinct values of partition_column in the pipeline output.
Common Mistakes and Gotchas
1. Using append_only when you mean incremental
-- WRONG: append_only with a unique key — you'll get duplicates
-- @merge_strategy: append_only
-- @unique_key: order_id <-- this annotation is ignored!
-- RIGHT: use incremental for dedup by key
-- @merge_strategy: incremental
-- @unique_key: order_idappend_only ignores unique_key. If you want dedup, use incremental or delete_insert.
2. Forgetting watermark filter in incremental
-- WRONG: no watermark filter — scans all data every run
-- @merge_strategy: incremental
-- @unique_key: order_id
-- @watermark_column: updated_at
SELECT * FROM {{ ref('bronze.raw_orders') }}
-- Missing: {% if is_incremental() %} WHERE ... {% endif %}This is correct but slow. The merge is still idempotent, but every run reprocesses all rows. Add the watermark filter for performance.
3. Using delete_insert when you want incremental
delete_insert does not dedup incoming rows. If your pipeline output has 3 rows with order_id = 42, all 3 will be inserted. Use incremental if you want only the latest row per key.
4. Forgetting partition_column for snapshot
-- WRONG: snapshot without partition_column
-- @merge_strategy: snapshot
-- RIGHT: snapshot requires a partition column
-- @merge_strategy: snapshot
-- @partition_column: regionWithout partition_column, RAT does not know which partitions to overwrite and the pipeline will fail.
5. SCD2 columns in SELECT
-- WRONG: including valid_from/valid_to in your SELECT
SELECT customer_id, name, valid_from, valid_to FROM ...
-- RIGHT: RAT manages SCD2 columns automatically
SELECT customer_id, name FROM ...RAT adds the valid_from and valid_to columns automatically during the SCD2 merge process. Including them in your SELECT may cause conflicts.
6. Assuming append_only is idempotent
Re-running an append_only pipeline with the same source data will insert duplicate rows. Use archive_landing_zones: true with landing zone sources to ensure files are processed only once:
-- @merge_strategy: append_only
-- @archive_landing_zones: true
SELECT * FROM read_csv_auto('{{ landing_zone("events") }}')After a successful run, the source files are moved to _processed/, so the next run picks up only new files.