GuidesMerge Strategies

Merge Strategies

Merge strategies control how RAT combines new pipeline results with existing data in your Iceberg tables. Choosing the right strategy is critical for data correctness, idempotency, and performance.

RAT supports 6 merge strategies, each designed for a specific pattern.

Decision Tree

Not sure which strategy to use? Follow this flowchart:

Strategy Comparison

StrategyUnique KeyDedupIdempotentUse Case
full_refresh--YesSmall tables, aggregations, snapshots
incrementalRequiredYes (last wins)YesLarge tables with upserts
append_only--NoEvent logs, audit trails
delete_insertRequiredNoYesBatch loads keyed by date/source
scd2Required-YesDimension tables with history
snapshot--YesPartitioned tables with partial refresh

full_refresh

Drop everything and recreate. The entire target table is replaced with the pipeline’s output on every run.

How It Works

Configuration

No special configuration needed — this is the default strategy.

pipeline.sql
-- @merge_strategy: full_refresh
-- or simply omit @merge_strategy entirely
 
SELECT
    order_date,
    COUNT(*) AS order_count,
    SUM(total_amount) AS revenue
FROM {{ ref('silver.clean_orders') }}
GROUP BY order_date

When to Use

  • Aggregation / Gold tables — Summary tables that are recomputed from source
  • Small tables — Reference data, lookup tables, dimension tables that are small enough to fully reload
  • Development — Quick iteration when you do not care about history

Properties

  • Idempotent: Yes. Running twice produces the same result.
  • Required annotations: None
  • Performance: Rewrites the entire table every run. Fine for small tables, expensive for large ones.

full_refresh is the safest strategy. When in doubt, start here. You can switch to incremental later when the table grows large enough to warrant it.


incremental

Upsert new rows into the existing table. New rows are deduplicated by unique_key (last row wins via ROW_NUMBER), then merged into the target table. Existing rows with matching keys are updated; new keys are inserted.

How It Works

Configuration

pipeline.sql
-- @merge_strategy: incremental
-- @unique_key: order_id
-- @watermark_column: updated_at
 
SELECT
    order_id,
    customer_id,
    status,
    total_amount,
    updated_at
FROM {{ ref('bronze.raw_orders') }}
{% if is_incremental() %}
WHERE updated_at > '{{ watermark_value }}'
{% endif %}

Dedup Behavior

When multiple rows in the pipeline output share the same unique_key, RAT keeps only the last row using ROW_NUMBER() OVER (PARTITION BY unique_key ORDER BY unique_key). This means the row that appears last in the result set wins.

For composite keys, separate column names with commas:

-- @unique_key: order_id, product_id

When to Use

  • Large fact tables — Orders, events, transactions that grow over time
  • CDC (Change Data Capture) — Processing change feeds with upsert semantics
  • Any table with a natural key — Where you want “last state wins”

Properties

  • Idempotent: Yes. Running the same data twice updates existing rows to the same values.
  • Required annotations: unique_key
  • Optional annotations: watermark_column (for filtering new rows)
  • Performance: Only processes new/changed rows when used with watermarks.
⚠️

Without the {% if is_incremental() %} watermark filter, every run scans all source data. The merge is still correct (idempotent), but you lose the performance benefit of incremental processing.


append_only

Unconditionally append rows to the table. No dedup, no merge, no key matching. Every run adds new rows.

How It Works

Configuration

pipeline.sql
-- @merge_strategy: append_only
-- @description: Raw event log — append all incoming events
 
SELECT
    event_id,
    event_type,
    user_id,
    payload,
    created_at,
    '{{ run_started_at }}' AS _loaded_at
FROM read_csv_auto(
    '{{ landing_zone("events") }}',
    header = true
)

When to Use

  • Event logs — Clickstream, audit logs, IoT sensor readings
  • Landing zone ingestion — Raw data that should be appended as-is
  • Immutable data — Records that never change once created

Properties

  • Idempotent: No. Running the same data twice creates duplicate rows. This is the only non-idempotent strategy.
  • Required annotations: None
  • Performance: Very fast — just appends, no key matching.
🚫

Be careful with retries. If a pipeline run fails after writing but before completion, re-running it will append the same data again. Use append_only only when duplicates are acceptable or when upstream data is guaranteed to be unique per run (e.g., new files in a landing zone with archive_landing_zones: true).


delete_insert

Delete all existing rows that match the incoming keys, then insert all new rows without dedup. Unlike incremental, there is no ROW_NUMBER dedup — all incoming rows for a key are inserted.

How It Works

Configuration

pipeline.sql
-- @merge_strategy: delete_insert
-- @unique_key: batch_date
 
SELECT
    batch_date,
    product_id,
    warehouse_id,
    quantity_on_hand,
    last_counted_at
FROM {{ ref('bronze.raw_inventory') }}
WHERE batch_date = CURRENT_DATE

How It Differs from Incremental

Aspectincrementaldelete_insert
DedupYes (ROW_NUMBER, last wins)No dedup
Multiple rows per keyKeeps 1 per keyKeeps all rows per key
OperationMERGE (upsert)DELETE matching keys + INSERT
Use caseOne row per entityMultiple rows per key value

When to Use

  • Batch reloads — Replace all data for a given date/batch/source
  • One-to-many keys — When a key (like batch_date) maps to many rows
  • Idempotent batch processing — Re-running a batch replaces it cleanly

Properties

  • Idempotent: Yes. Re-running with the same data deletes and re-inserts the same rows.
  • Required annotations: unique_key
  • Performance: Deletes can be expensive if the key matches many rows.

A common pattern is using batch_date or source_file as the unique_key in delete_insert. This lets you reprocess a specific day’s data without affecting other days.


scd2

Slowly Changing Dimension Type 2. Tracks full history of changes. When a record changes, the old version is “closed” (its valid_to is set to the current timestamp) and a new version is inserted with valid_to = NULL (or 9999-12-31).

How It Works

Configuration

pipeline.sql
-- @merge_strategy: scd2
-- @unique_key: customer_id
-- @scd_valid_from: valid_from
-- @scd_valid_to: valid_to
-- @description: Customer dimension with full change history
 
SELECT
    customer_id,
    name,
    email,
    segment,
    region,
    updated_at
FROM {{ ref('bronze.raw_customers') }}

The scd_valid_from and scd_valid_to annotations are optional. They default to valid_from and valid_to respectively. RAT manages these columns automatically — you do not need to include them in your SELECT.

Resulting Table Structure

After several runs, the target table looks like this:

customer_idnamesegmentvalid_fromvalid_to
42BobBasic2024-01-152024-06-01
42Bob SmithPremium2024-06-012024-11-20
42Bob SmithEnterprise2024-11-20NULL

The row with valid_to = NULL is the current version. To query only current records:

SELECT * FROM {{ ref('silver.dim_customers') }}
WHERE valid_to IS NULL

When to Use

  • Dimension tables — Customers, products, employees where you need change history
  • Compliance — Regulatory requirements to track when values changed
  • Historical analysis — “What was the customer’s segment when they placed this order?”

Properties

  • Idempotent: Yes. Re-running with unchanged data has no effect. Changed data correctly closes and re-opens records.
  • Required annotations: unique_key
  • Optional annotations: scd_valid_from (default: valid_from), scd_valid_to (default: valid_to)

snapshot

Partition-aware overwrite. Only replaces partitions that are present in the new data. Partitions not in the pipeline output are left untouched.

How It Works

Configuration

pipeline.sql
-- @merge_strategy: snapshot
-- @partition_column: region
-- @description: Regional sales data — overwrite only affected regions
 
SELECT
    region,
    product_id,
    sale_date,
    quantity,
    revenue
FROM {{ ref('bronze.raw_sales') }}
WHERE region IN ('US', 'EU')

When to Use

  • Regional data — Process one region at a time without affecting others
  • Time-partitioned tables — Reprocess a specific month without touching other months
  • Partial refreshes — Update only the partitions that have new source data

Properties

  • Idempotent: Yes. Re-running with the same data for the same partitions produces the same result.
  • Required annotations: partition_column
  • Performance: Only rewrites affected partitions, leaving others untouched.
⚠️

The partition_column must exist in both the pipeline output and the target table. RAT determines which partitions to overwrite by looking at the distinct values of partition_column in the pipeline output.


Common Mistakes and Gotchas

1. Using append_only when you mean incremental

-- WRONG: append_only with a unique key — you'll get duplicates
-- @merge_strategy: append_only
-- @unique_key: order_id   <-- this annotation is ignored!
 
-- RIGHT: use incremental for dedup by key
-- @merge_strategy: incremental
-- @unique_key: order_id

append_only ignores unique_key. If you want dedup, use incremental or delete_insert.

2. Forgetting watermark filter in incremental

-- WRONG: no watermark filter — scans all data every run
-- @merge_strategy: incremental
-- @unique_key: order_id
-- @watermark_column: updated_at
 
SELECT * FROM {{ ref('bronze.raw_orders') }}
-- Missing: {% if is_incremental() %} WHERE ... {% endif %}

This is correct but slow. The merge is still idempotent, but every run reprocesses all rows. Add the watermark filter for performance.

3. Using delete_insert when you want incremental

delete_insert does not dedup incoming rows. If your pipeline output has 3 rows with order_id = 42, all 3 will be inserted. Use incremental if you want only the latest row per key.

4. Forgetting partition_column for snapshot

-- WRONG: snapshot without partition_column
-- @merge_strategy: snapshot
 
-- RIGHT: snapshot requires a partition column
-- @merge_strategy: snapshot
-- @partition_column: region

Without partition_column, RAT does not know which partitions to overwrite and the pipeline will fail.

5. SCD2 columns in SELECT

-- WRONG: including valid_from/valid_to in your SELECT
SELECT customer_id, name, valid_from, valid_to FROM ...
 
-- RIGHT: RAT manages SCD2 columns automatically
SELECT customer_id, name FROM ...

RAT adds the valid_from and valid_to columns automatically during the SCD2 merge process. Including them in your SELECT may cause conflicts.

6. Assuming append_only is idempotent

Re-running an append_only pipeline with the same source data will insert duplicate rows. Use archive_landing_zones: true with landing zone sources to ensure files are processed only once:

-- @merge_strategy: append_only
-- @archive_landing_zones: true
 
SELECT * FROM read_csv_auto('{{ landing_zone("events") }}')

After a successful run, the source files are moved to _processed/, so the next run picks up only new files.