Data Lineage

Data lineage shows you how data flows through your pipelines — which tables feed into which pipelines, and which pipelines produce which tables. RAT builds a lineage graph automatically by scanning your pipeline SQL for ref() and landing_zone() calls, then renders it as an interactive DAG (Directed Acyclic Graph).

How RAT Builds Lineage

RAT does not require you to manually declare dependencies. Instead, it parses your pipeline SQL to extract references:

ref('layer.name') — a reference to another table in the same namespace
ref('ns.layer.name') — a cross-namespace reference to a table
landing_zone('zone_name') — a reference to a landing zone file drop area

From these references, RAT constructs a directed graph with three node types and three edge types.

Node Types

Node Type	Description	Example
Pipeline	A SQL or Python pipeline	`silver.clean_orders`
Table	An Iceberg table (output of a pipeline)	`silver.clean_orders` (table)
Landing Zone	A file drop area for raw data	`raw_orders`

Pipelines and tables share the same name because each pipeline produces exactly one table. In the lineage graph, they are displayed as separate nodes to clearly show the transformation boundary.

Edge Types

Edge Type	Direction	Meaning
ref	Table —> Pipeline	Pipeline reads from this table
produces	Pipeline —> Table	Pipeline writes to this table
landing_input	Landing Zone —> Pipeline	Pipeline ingests files from this landing zone

Example

Given these three pipelines:

bronze/ingest_orders/pipeline.sql

SELECT *
FROM {{ landing_zone('raw_orders') }}

silver/clean_orders/pipeline.sql

SELECT
    order_id,
    customer_id,
    CAST(total_amount AS DECIMAL(10,2)) AS total_amount,
    status,
    updated_at
FROM {{ ref('bronze.ingest_orders') }}
WHERE order_id IS NOT NULL

gold/daily_revenue/pipeline.sql

SELECT
    DATE_TRUNC('day', updated_at) AS revenue_date,
    SUM(total_amount) AS total_revenue,
    COUNT(DISTINCT order_id) AS order_count
FROM {{ ref('silver.clean_orders') }}
GROUP BY revenue_date

RAT generates this lineage graph:

Using the Lineage Page

The Lineage page is accessible from the Portal sidebar under Lineage.

Open the Lineage page

Click Lineage in the sidebar. The full DAG for all pipelines loads.

Filter by namespace

Use the namespace dropdown at the top to filter the graph to a specific namespace. This is helpful when you have many pipelines across multiple namespaces.

Explore the graph

Pan by clicking and dragging the background
Zoom with the scroll wheel or pinch gesture
Click a node to highlight its upstream and downstream connections
Double-click a node to navigate to that pipeline or table

Understand the colors

Nodes are color-coded by layer:

Color	Layer
Copper/Orange	Bronze
Silver/Gray	Silver
Gold/Yellow	Gold
Green	Landing Zone

Cross-Namespace References

Pipelines can reference tables in other namespaces using three-part references:

pipeline.sql

-- Reference a table in the "shared" namespace
SELECT *
FROM {{ ref('shared.silver.customer_master') }}

Cross-namespace edges appear in the lineage graph as connections that span namespace boundaries. When filtering by namespace, cross-namespace references are shown as external nodes at the edge of the graph.

Cross-namespace references create implicit dependencies. If the referenced table does not exist (because the upstream pipeline in the other namespace has not run yet), the pipeline will fail at execution time.

Lineage for Python Pipelines

Python pipelines use ref() within their SQL queries (executed through DuckDB). RAT scans the Python source for string patterns matching ref('...') to extract dependencies. This works reliably for inline SQL strings:

pipeline.py

# @merge_strategy: full_refresh
 
import duckdb
 
def run(conn: duckdb.DuckDBPyConnection, ref, landing_zone, **kwargs):
    orders = conn.sql(f"""
        SELECT * FROM {ref('silver.clean_orders')}
    """).arrow()
 
    customers = conn.sql(f"""
        SELECT * FROM {ref('silver.clean_customers')}
    """).arrow()
 
    return conn.sql("""
        SELECT
            o.order_id,
            c.customer_name,
            o.total_amount
        FROM orders o
        JOIN customers c ON o.customer_id = c.customer_id
    """).arrow()

⚠️

If you construct ref() calls dynamically (e.g., from a list or variable), RAT’s static scanner cannot detect them. In these cases, the lineage graph will be incomplete. Prefer explicit ref() calls with string literals for accurate lineage.

Lineage and Pipeline Execution

While lineage is primarily a visualization tool, it also serves operational purposes:

Impact analysis — Before modifying a Silver pipeline, check which Gold pipelines depend on it downstream
Root cause analysis — When a Gold table has bad data, trace upstream through the lineage to find the source
Trigger planning — Use the lineage graph to design pipeline_success and cron_dependency trigger chains

How Lineage is Computed

Lineage is computed on-demand when you open the Lineage page or request it via API. The process is:

List all pipelines in the target namespace(s)
Read each pipeline’s SQL/Python source from S3
Extract ref() and landing_zone() calls via regex parsing
Build the node and edge graph
Render the graph in the Portal using ReactFlow

This means lineage always reflects the current state of pipeline source files. If you edit a pipeline to add a new ref(), the lineage updates on next page load.

Lineage via API

Terminal

# Get lineage for a namespace
curl http://localhost:8080/api/v1/lineage?namespace=default
 
# Get lineage for a specific pipeline (upstream + downstream)
curl http://localhost:8080/api/v1/lineage?namespace=default&layer=silver&name=clean_orders

Response:

Response

{
  "nodes": [
    {"id": "pipeline:default.silver.clean_orders", "type": "pipeline", "namespace": "default", "layer": "silver", "name": "clean_orders"},
    {"id": "table:default.bronze.ingest_orders", "type": "table", "namespace": "default", "layer": "bronze", "name": "ingest_orders"},
    {"id": "table:default.silver.clean_orders", "type": "table", "namespace": "default", "layer": "silver", "name": "clean_orders"}
  ],
  "edges": [
    {"source": "table:default.bronze.ingest_orders", "target": "pipeline:default.silver.clean_orders", "type": "ref"},
    {"source": "pipeline:default.silver.clean_orders", "target": "table:default.silver.clean_orders", "type": "produces"}
  ]
}

Limitations

Static analysis only — lineage is built from parsing source code, not runtime execution. Dynamically constructed SQL that builds ref() calls at runtime will not be detected.
No column-level lineage — RAT tracks table-level dependencies. It does not trace individual columns through transformations.
Landing zones without landing_zone() — if a pipeline reads from S3 directly (e.g., using a hardcoded read_csv_auto('s3://...') path instead of the landing_zone() function), that dependency will not appear in lineage.

Query Console Data Retention