Data Lineage
Data lineage shows you how data flows through your pipelines — which tables feed into which pipelines, and which pipelines produce which tables. RAT builds a lineage graph automatically by scanning your pipeline SQL for ref() and landing_zone() calls, then renders it as an interactive DAG (Directed Acyclic Graph).
How RAT Builds Lineage
RAT does not require you to manually declare dependencies. Instead, it parses your pipeline SQL to extract references:
ref('layer.name')— a reference to another table in the same namespaceref('ns.layer.name')— a cross-namespace reference to a tablelanding_zone('zone_name')— a reference to a landing zone file drop area
From these references, RAT constructs a directed graph with three node types and three edge types.
Node Types
| Node Type | Description | Example |
|---|---|---|
| Pipeline | A SQL or Python pipeline | silver.clean_orders |
| Table | An Iceberg table (output of a pipeline) | silver.clean_orders (table) |
| Landing Zone | A file drop area for raw data | raw_orders |
Pipelines and tables share the same name because each pipeline produces exactly one table. In the lineage graph, they are displayed as separate nodes to clearly show the transformation boundary.
Edge Types
| Edge Type | Direction | Meaning |
|---|---|---|
| ref | Table —> Pipeline | Pipeline reads from this table |
| produces | Pipeline —> Table | Pipeline writes to this table |
| landing_input | Landing Zone —> Pipeline | Pipeline ingests files from this landing zone |
Example
Given these three pipelines:
SELECT *
FROM {{ landing_zone('raw_orders') }}SELECT
order_id,
customer_id,
CAST(total_amount AS DECIMAL(10,2)) AS total_amount,
status,
updated_at
FROM {{ ref('bronze.ingest_orders') }}
WHERE order_id IS NOT NULLSELECT
DATE_TRUNC('day', updated_at) AS revenue_date,
SUM(total_amount) AS total_revenue,
COUNT(DISTINCT order_id) AS order_count
FROM {{ ref('silver.clean_orders') }}
GROUP BY revenue_dateRAT generates this lineage graph:
Using the Lineage Page
The Lineage page is accessible from the Portal sidebar under Lineage.
Open the Lineage page
Click Lineage in the sidebar. The full DAG for all pipelines loads.
Filter by namespace
Use the namespace dropdown at the top to filter the graph to a specific namespace. This is helpful when you have many pipelines across multiple namespaces.
Explore the graph
- Pan by clicking and dragging the background
- Zoom with the scroll wheel or pinch gesture
- Click a node to highlight its upstream and downstream connections
- Double-click a node to navigate to that pipeline or table
Understand the colors
Nodes are color-coded by layer:
| Color | Layer |
|---|---|
| Copper/Orange | Bronze |
| Silver/Gray | Silver |
| Gold/Yellow | Gold |
| Green | Landing Zone |
Cross-Namespace References
Pipelines can reference tables in other namespaces using three-part references:
-- Reference a table in the "shared" namespace
SELECT *
FROM {{ ref('shared.silver.customer_master') }}Cross-namespace edges appear in the lineage graph as connections that span namespace boundaries. When filtering by namespace, cross-namespace references are shown as external nodes at the edge of the graph.
Cross-namespace references create implicit dependencies. If the referenced table does not exist (because the upstream pipeline in the other namespace has not run yet), the pipeline will fail at execution time.
Lineage for Python Pipelines
Python pipelines use ref() within their SQL queries (executed through DuckDB). RAT scans the Python source for string patterns matching ref('...') to extract dependencies. This works reliably for inline SQL strings:
# @merge_strategy: full_refresh
import duckdb
def run(conn: duckdb.DuckDBPyConnection, ref, landing_zone, **kwargs):
orders = conn.sql(f"""
SELECT * FROM {ref('silver.clean_orders')}
""").arrow()
customers = conn.sql(f"""
SELECT * FROM {ref('silver.clean_customers')}
""").arrow()
return conn.sql("""
SELECT
o.order_id,
c.customer_name,
o.total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
""").arrow()If you construct ref() calls dynamically (e.g., from a list or variable), RAT’s static scanner cannot detect them. In these cases, the lineage graph will be incomplete. Prefer explicit ref() calls with string literals for accurate lineage.
Lineage and Pipeline Execution
While lineage is primarily a visualization tool, it also serves operational purposes:
- Impact analysis — Before modifying a Silver pipeline, check which Gold pipelines depend on it downstream
- Root cause analysis — When a Gold table has bad data, trace upstream through the lineage to find the source
- Trigger planning — Use the lineage graph to design
pipeline_successandcron_dependencytrigger chains
How Lineage is Computed
Lineage is computed on-demand when you open the Lineage page or request it via API. The process is:
- List all pipelines in the target namespace(s)
- Read each pipeline’s SQL/Python source from S3
- Extract
ref()andlanding_zone()calls via regex parsing - Build the node and edge graph
- Render the graph in the Portal using ReactFlow
This means lineage always reflects the current state of pipeline source files. If you edit a pipeline to add a new ref(), the lineage updates on next page load.
Lineage via API
# Get lineage for a namespace
curl http://localhost:8080/api/v1/lineage?namespace=default
# Get lineage for a specific pipeline (upstream + downstream)
curl http://localhost:8080/api/v1/lineage?namespace=default&layer=silver&name=clean_ordersResponse:
{
"nodes": [
{"id": "pipeline:default.silver.clean_orders", "type": "pipeline", "namespace": "default", "layer": "silver", "name": "clean_orders"},
{"id": "table:default.bronze.ingest_orders", "type": "table", "namespace": "default", "layer": "bronze", "name": "ingest_orders"},
{"id": "table:default.silver.clean_orders", "type": "table", "namespace": "default", "layer": "silver", "name": "clean_orders"}
],
"edges": [
{"source": "table:default.bronze.ingest_orders", "target": "pipeline:default.silver.clean_orders", "type": "ref"},
{"source": "pipeline:default.silver.clean_orders", "target": "table:default.silver.clean_orders", "type": "produces"}
]
}Limitations
- Static analysis only — lineage is built from parsing source code, not runtime execution. Dynamically constructed SQL that builds
ref()calls at runtime will not be detected. - No column-level lineage — RAT tracks table-level dependencies. It does not trace individual columns through transformations.
- Landing zones without
landing_zone()— if a pipeline reads from S3 directly (e.g., using a hardcodedread_csv_auto('s3://...')path instead of thelanding_zone()function), that dependency will not appear in lineage.