ContributingArchitectureOverview

Architecture Overview

RAT is a self-hostable data platform that runs as 8 containers orchestrated by Docker Compose. It follows a strict separation of concerns: a Go API server handles orchestration and state, Python services handle data processing, and a Next.js frontend provides the user interface.


Design Philosophy

RAT’s architecture is guided by five principles:

  1. Single docker compose up --- The entire platform starts with one command. No external dependencies, no cloud accounts, no license keys for the Community Edition.
  2. Separation of compute and state --- Data lives in S3 (MinIO) in open Apache Iceberg format. Metadata lives in Postgres. Compute (DuckDB) is ephemeral and stateless.
  3. Git-like isolation --- Every pipeline run creates an isolated Nessie branch. Bad data never reaches the production catalog. Quality tests gate merges.
  4. Plugin extensibility --- The Community Edition ships with no-op implementations for auth, sharing, and enforcement. Pro plugins slot in without changing the core platform.
  5. No vendor lock-in --- All storage is open format (Iceberg + Parquet). The catalog is Nessie (open source). You can query your data with any tool that speaks Iceberg.

System Block Diagram

The following diagram shows all 8 containers (7 services + 1 init job), their communication protocols, network zones, and exposed ports.

The portal and ratd containers live on both the frontend and backend networks. The portal needs frontend access (user-facing port 3000) and backend access (to reach ratd internally for server-side rendering). All other services live exclusively on the backend network and are not reachable from outside Docker.


Containers at a Glance

ContainerLanguageRoleExposed PortsNetwork
portalNext.js (TypeScript)Web IDE --- the only user interface3000 (HTTP)frontend + backend
ratdGoAPI server, scheduler, plugin host, catalog ops8080 (REST), 8081 (gRPC, internal)frontend + backend
ratqPythonRead-only DuckDB query service50051 (gRPC, internal)backend
runnerPythonPipeline execution engine50052 (gRPC, internal)backend
postgresPostgreSQL 16.4Platform state (16 tables)5432 (localhost only)backend
minioMinIOS3-compatible object storage9000 (S3 API), 9001 (Console)backend
nessieJava (Quarkus)Iceberg REST catalog with git-like branching19120 (REST, localhost only)backend
minio-initMinIO Client (mc)One-shot: creates bucket, enables versioning--- (exits after setup)backend

Communication Protocols

Protocol Choices

PathProtocolWhy
Portal to ratdREST (HTTP)Browser-native. SWR data fetching. No gRPC-Web complexity.
ratd to runner/ratqConnectRPC (gRPC)Type-safe, streaming support (SSE logs), HTTP/1.1 compatible for easier debugging.
ratd to PostgresSQL via pgxPure Go driver. Connection pooling. Type-safe queries via sqlc.
ratd to MinIOS3 APIStandard S3 protocol via MinIO Go SDK. Swappable with any S3-compatible store.
ratd to NessieIceberg REST APIStandard catalog protocol. Not Nessie-specific --- works with any Iceberg REST catalog.
runner to ratdHTTP callbackPush-based status reporting. Runner POSTs terminal status on completion with 60s poll fallback.

Network Zones

RAT uses two Docker networks to enforce network segmentation:

Frontend Network (rat_frontend)

The user-facing network. Only two containers are attached:

  • portal --- serves the web IDE on port 3000
  • ratd --- serves the REST API on port 8080

This is the only network accessible from the host machine (via published ports).

Backend Network (infra_default)

The internal network where all inter-service communication happens. All 8 containers are attached to this network, but only portal and ratd are also on the frontend network.

Services like runner, ratq, postgres, minio, and nessie are never directly accessible from the host (except via localhost-bound debug ports for postgres, minio, and nessie).

⚠️

In production, the localhost-bound debug ports for Postgres (:5432), MinIO (:9000, :9001), and Nessie (:19120) should be removed or firewalled. They exist for development convenience only.


Startup Sequence

Services start in dependency order enforced by Docker Compose health checks:

Infrastructure layer starts first

Postgres and MinIO start simultaneously. Both have health checks (pg_isready and mc ready). Nothing else starts until both are healthy.

minio-init runs (one-shot)

Once MinIO is healthy, the minio-init container runs. It creates the rat bucket, enables S3 versioning, and configures a 7-day lifecycle policy for non-current object versions. It exits after completion.

Nessie starts

Nessie depends on Postgres (it persists catalog metadata via JDBC). It starts once Postgres is healthy and exposes the Iceberg REST catalog on port 19120.

Python services start

runner and ratq start once MinIO and Nessie are healthy. They initialize their DuckDB engines with S3 and Iceberg extensions.

ratd starts

ratd depends on Postgres, MinIO, and Nessie. On startup it runs database migrations, initializes the scheduler, starts the reaper daemon, connects to runner and ratq via gRPC, and loads any configured plugins.

Portal starts last

portal depends on ratd being healthy. It needs the API to be available for both server-side rendering and client-side data fetching.


Resource Allocation

Every container has explicit memory and CPU limits to prevent runaway consumption:

ContainerMemory LimitCPU LimitPIDs LimitNotes
ratd512 MB1.0100Lightweight Go binary
ratq1 GB1.0100Single persistent DuckDB, in-memory catalog cache
runner2 GB2.0100One DuckDB per concurrent run, up to 10 concurrent
portal512 MB1.0100Standalone Next.js, mostly static
postgres1 GB1.0100Metadata only, low volume
minio1 GB1.0100Data file storage
nessie512 MB1.0100Catalog metadata, Quarkus runtime
minio-init256 MB0.5100One-shot, exits immediately

Total minimum: ~7.25 GB RAM. Recommended: 8 GB+ available for Docker.


Security Posture

Every container is hardened with defense-in-depth:

  • Read-only filesystem (read_only: true) on ratd, ratq, runner, portal --- with /tmp as tmpfs
  • Drop all Linux capabilities (cap_drop: [ALL])
  • No privilege escalation (no-new-privileges:true)
  • PID limits (100 per container) to prevent fork bombs
  • JSON body size limit (1 MB) on ratd to prevent request flooding
  • Rate limiting (50 req/s per IP, burst 100) on all API endpoints

See the Security page for the complete security model.


Where Data Lives

Data TypeStorageFormat
Pipeline source code (SQL/Python)MinIO (S3)Plain text, S3-versioned
Pipeline configMinIO (S3)YAML
Quality test SQLMinIO (S3)Plain text
Raw uploaded filesMinIO (S3)CSV, Parquet, JSON
Transformed data (tables)MinIO (S3)Apache Iceberg (Parquet + metadata)
Table catalogNessieGit-like refs pointing to Iceberg metadata
Platform statePostgres16 relational tables
Run logsPostgres (JSONB column)Structured log entries

See Storage Layout for the full S3 directory structure and Database Schema for the Postgres tables.


Next Steps