Getting StartedTutorialPart 4: Merge Strategies

Part 4: Merge Strategies & Incremental Loading ⚡

Time: ~15 minutes | Prerequisites: Part 3 completed (Silver pipeline running)

So far, every time you run a pipeline, RAT drops the table and recreates it from scratch. That works fine for small datasets, but imagine ingesting millions of rows per day — you don’t want to reprocess everything on every run.

In this part, you’ll learn all 6 merge strategies RAT supports and make your mission_log pipeline incremental so it only processes new rows.


The 6 Merge Strategies

RAT supports six strategies for how new data is merged into existing tables. Each one fits a different data pattern:

StrategyBehaviorBest For
full_refreshDrop and recreate the entire tableSmall/static datasets, lookup tables
incrementalAppend only rows newer than the watermarkEvent logs, time-series, append-heavy data
append_onlyAlways append, no deduplicationImmutable event streams, audit logs
delete_insertDelete rows matching key, then insert newDimension updates, upsert patterns
scd2Slowly Changing Dimension Type 2 — track historical changesCustomer records, product catalogs
snapshotFull snapshot comparison — diff old vs newAudit trails, compliance reporting

The default strategy is full_refresh. If you don’t specify a merge strategy, RAT will drop and recreate the table on every run. This is the safest default — you opt into smarter strategies explicitly.


Hands-On: Make mission_log Incremental

Let’s modify the mission_log bronze pipeline from Part 2 to use the incremental strategy. This way, on subsequent runs, it will only ingest rows with a launch_date newer than the last processed row.

Open the pipeline editor

Navigate to Pipelines → click on mission_log → go to the Code tab.

Update the SQL

Replace the existing SQL with:

mission_log.sql
-- @merge_strategy: incremental
-- @unique_key: launch_id
-- @watermark_column: launch_date
-- @description: Space launch mission data from landing zone
 
SELECT
    launch_id,
    mission_name,
    launch_date,
    vehicle,
    launch_site,
    country,
    orbit,
    outcome,
    payload_mass_kg,
    mission_type
FROM read_csv_auto('{{ landing_zone("launch-data") }}')
 
{% if is_incremental() and watermark_value %}
WHERE launch_date > '{{ watermark_value }}'
{% endif %}

Understand the annotations

The annotations at the top of the SQL configure the merge behavior:

AnnotationValuePurpose
@merge_strategyincrementalUse the incremental merge strategy
@unique_keylaunch_idThe column that uniquely identifies each row
@watermark_columnlaunch_dateThe column used to track “how far” we’ve processed
@description(text)Human-readable description shown in the Portal

Understand the Jinja conditionals

The {% if %} block at the bottom is the magic of incremental loading:

{% if is_incremental() and watermark_value %}
WHERE launch_date > '{{ watermark_value }}'
{% endif %}

Here’s what happens:

  • First run: is_incremental() returns false because the table doesn’t exist yet. The WHERE clause is skipped, and all 25 rows are loaded.
  • Subsequent runs: is_incremental() returns true and watermark_value contains the maximum launch_date from the existing table. Only rows newer than that date are processed.

Publish the changes

Click Publish to save the updated pipeline definition.

Run it — first time

Click Run. Since this is the first run with the incremental strategy, RAT loads all 25 rows (the table is being created fresh).

Check the run logs — you should see something like:

Merge strategy: incremental
Watermark value: None (first run)
Rows written: 25

Run it — second time

Click Run again. This time, RAT checks the watermark:

Merge strategy: incremental
Watermark value: 2024-09-15
Rows written: 0 (no new data)

Run logs showing incremental behavior

Since no new rows in the CSV have a launch_date after the watermark, zero rows are processed. This is exactly the behavior you want — no wasted compute on data you’ve already ingested.

⚠️

The watermark_value is the maximum value of the watermark column from the existing table. Make sure your watermark column is monotonically increasing (like timestamps or sequential IDs), otherwise you might miss rows.


Annotations vs Settings Tab

There are two ways to configure merge strategy settings in RAT:

Annotations live directly in your SQL file as comments:

pipeline.sql
-- @merge_strategy: incremental
-- @unique_key: launch_id
-- @watermark_column: launch_date
 
SELECT ...

Pros: Version-controlled with your SQL, visible at a glance, portable.

Config Merge Rules

When both annotations and config.yaml define the same field, annotations win. The merge logic is:

  1. config.yaml is the base configuration
  2. SQL annotations overlay per-field — each annotation overrides the corresponding config field
  3. If a field is set only in config.yaml, it’s used. If set only in annotations, it’s used. If set in both, the annotation takes precedence.

Most teams standardize on annotations for everything — it keeps the configuration co-located with the SQL, making code review easier. Use the Settings tab for quick experiments, then codify into annotations.


Jinja Helpers Reference

RAT provides several Jinja helpers for conditional logic in your SQL pipelines:

HelperReturnsDescription
is_incremental()boolTrue if the table already exists and strategy is incremental
watermark_valuestr | NoneMax value of the watermark column from the existing table
is_scd2()boolTrue if the merge strategy is scd2
is_snapshot()boolTrue if the merge strategy is snapshot
is_append_only()boolTrue if the merge strategy is append_only
is_delete_insert()boolTrue if the merge strategy is delete_insert
ref('layer.name')strResolves a pipeline reference to its Iceberg table path
landing_zone('name')strResolves a landing zone to its S3 path
thisstrThe current pipeline’s own Iceberg table path

These helpers let you write a single SQL file that behaves differently depending on the pipeline’s state and configuration.


SCD2 Example (Reference)

For pipelines that need to track historical changes to dimension data — like a vehicle registry where specs might change over time — RAT supports Slowly Changing Dimension Type 2 (SCD2).

Here’s what an SCD2 pipeline would look like:

vehicle_history.sql
-- @merge_strategy: scd2
-- @unique_key: vehicle_name
-- @scd_valid_from: valid_from
-- @scd_valid_to: valid_to
-- @description: Vehicle specifications with full change history
 
SELECT
    vehicle_name,
    manufacturer,
    height_m,
    thrust_kn,
    stages,
    mass_kg,
    active
FROM read_csv_auto('{{ landing_zone("vehicle-catalog") }}')

When you run this pipeline:

  • New rows (vehicle names not seen before) are inserted with valid_from = now() and valid_to = NULL
  • Changed rows (same vehicle name, different values) close the old record (valid_to = now()) and insert a new one
  • Unchanged rows are left as-is

We won’t build this hands-on in this tutorial, but it’s good to know it’s available. SCD2 is essential for data warehouses that need audit-quality historical tracking.


What You Built

In this part, you:

  • ✅ Learned all 6 merge strategies and when to use each
  • ✅ Made mission_log incremental with watermark-based loading
  • ✅ Understood annotations vs config.yaml and the merge rules
  • ✅ Explored the full Jinja helpers reference
  • ✅ Saw an SCD2 example for historical dimension tracking

Your mission_log pipeline now skips already-processed data on re-runs. In the next part, you’ll add quality tests that catch bad data before it reaches downstream pipelines.


Next: In Part 5, you’ll create quality tests that gate pipeline merges and prevent bad data from propagating.