Most lineage tools are doing archaeology
The standard approach to column lineage works like this: your pipeline runs, query logs land somewhere, a lineage tool parses those logs, and a graph gets built showing which columns fed which downstream tables. Tools like DataHub, Atlan, and OpenLineage all follow some variation of this pattern. The graph is useful. It answers questions about data flow. But it answers them after the damage is done.
The fundamental problem is timing. A post-hoc lineage graph tells you that orders.customer_id flowed into revenue_by_segment.segment_key — past tense. It cannot tell you that renaming customer_id to cust_id in the source will break three downstream models before you push that change to production. You discover that when the pipeline fails at 2am, or worse, when it succeeds but joins on a NULL column and nobody notices for days.
Rocky, an open-source Rust project that surfaced on Hacker News this week (202 stars, Apache 2.0 licensed), takes a different approach. It compiles column lineage from the SQL graph at build time, the same way a Rust compiler traces type ownership through function calls. The lineage is not reconstructed from logs. It is computed from the declared transformations before execution begins. The distinction sounds academic until you consider what it prevents: queries that parse correctly, execute without error, and produce silently wrong results because the semantic relationship between columns changed upstream.
rocky lineage --column traces ancestry before execution
Rocky's CLI exposes lineage as a first-class compile-time operation. Running rocky lineage --column revenue_by_segment.total_revenue walks the dependency graph backwards through joins, CTEs, and window functions, returning every source column that contributes to the output. This happens against the compiled model definitions, not against warehouse metadata or historical query logs.
The practical difference shows up in three scenarios. First, blast-radius analysis: before renaming a column in a seed or source table, you run rocky lineage-diff HEAD~1 to see exactly which downstream columns are affected across the entire project. Second, contract enforcement: Rocky's compiler emits diagnostic code E010 when a model references a column that does not exist in its declared inputs, and E013 when a protected column is dropped or its type changes unsafely. Both fire during rocky compile, not during rocky run. Third, AI-generated model validation: Rocky can generate models from natural language, but every generated query passes through the same compiler. A hallucinated join on a nonexistent column triggers E010 before the SQL ever reaches your warehouse.
The architecture that makes this possible is a 21-crate Rust workspace that parses Rocky's DSL (.rocky files), resolves types, and builds a full column-level dependency graph at compile time. Models are defined in TOML-based configuration with explicit type annotations, giving the compiler enough information to trace lineage without executing anything. Storage and compute remain in your warehouse — Databricks (production), Snowflake (beta), BigQuery (beta), or DuckDB for local testing.
Schema drift detection drops and recreates instead of corrupting silently
Schema drift is the lineage failure that does not look like a failure. A source system changes a column from INTEGER to VARCHAR. Your pipeline keeps running. Downstream aggregations start summing string representations or, depending on the warehouse, casting silently with truncation. The numbers in your Power BI report change by amounts small enough that nobody investigates for weeks.
Rocky handles this by diffing source schemas against the compiled model graph on every run. When it detects that a source column type has changed upstream, it does not attempt a graceful migration. It drops the affected target tables and recreates them. This is aggressive, and intentionally so — the alternative is propagating corrupted data through every downstream model that touches the changed column.
The branching system provides a safety mechanism for testing schema changes before they hit production. rocky branch create experiment-1 creates a logical copy of the pipeline's tables using schema-prefix isolation (with native Delta SHALLOW CLONE and Snowflake zero-copy cloning on the roadmap). You run your pipeline against the branch, inspect the results, then promote or discard. The branch operates against a separate schema, so production tables are untouched during experimentation.
There is a notable gap here: snapshot models on branches start with empty tables and accumulate from the first branch run. They inherit no history from main. Hugo Correia, Rocky's creator, acknowledged this limitation directly in the Hacker News discussion. For teams relying heavily on slowly-changing dimension patterns, this means branch testing of snapshot models will not reflect production behavior until snapshots are fully supported.
Compile-time contracts versus runtime surprises
The dbt model popularized the idea that SQL transformations should be version-controlled and testable. Rocky extends this by enforcing data contracts at compile time rather than catching violations after execution. The difference is analogous to the gap between TypeScript and JavaScript: both let you write the same logic, but one catches type errors before your code ships.
Rocky's .rocky model files declare explicit types for inputs and outputs. The compiler validates these declarations against actual warehouse schemas during rocky compile. If a model expects order_date TIMESTAMP but the source has drifted to order_date DATE, the compiler flags the mismatch before rocky run ever sends SQL to your warehouse. Anders Brownworth from dbt Labs commented on the Hacker News thread: "cool project -- I especially love the branching and budgeting options you've built in. both are things that I'd love for the dbt standard to include one day."
The cost attribution feature reinforces this compile-time philosophy. Rocky tracks bytes scanned and execution duration per model, with configurable budget thresholds (max_usd, max_duration_ms) that can gate CI pipelines. A model that suddenly scans 10x more data after a schema change gets caught in CI, not in your cloud bill three weeks later. The enforcement is not yet complete — bytes-scanned thresholds are logged but not gatable as hard CI blockers — but the per-model cost visibility alone changes how teams reason about schema changes.
SQL remains the dominant data language across every experience level in the 2024 Stack Overflow survey, used by 51% of all respondents and 54.1% of professional developers. Tools that enforce correctness on SQL pipelines before execution address a real gap: SQL has no built-in type system across transformation boundaries.
Replay reconstructs the exact inputs that produced a result
Debugging a data quality issue three days after it happened typically means reading logs, guessing at the state of source tables at execution time, and hoping nobody truncated and reloaded anything in between. Rocky's rocky replay command reconstructs which SQL statements ran against which inputs for a specific historical execution, giving you a reproducible record of what the pipeline actually did.
This is not time-travel querying against the warehouse. It is metadata replay: Rocky's embedded state store records the exact SQL, the watermark positions for incremental models, and the schema signatures at execution time. When you replay a run, you see the precise conditions that produced the output, without needing to maintain warehouse-level snapshot retention policies.
For incremental models, this solves a particularly painful debugging scenario. A model configured with strategy = "incremental" and a timestamp_column tracks its high-water mark in Rocky's state store. Subsequent runs only process rows where the timestamp exceeds the stored watermark. When an incremental model produces unexpected results, rocky replay shows you the exact watermark position and input window for that run, rather than forcing you to reconstruct it from warehouse query history.
The combination of replay and compile-time lineage creates an audit trail that works in both directions: forward from source columns to downstream outputs via rocky lineage, and backward from a specific run's results to the exact inputs and transformations that produced them via rocky replay. MetricSign complements this by monitoring the operational layer — detecting when scheduled pipeline runs fail, grouping related failures across dependent models, and surfacing root-cause context that connects warehouse errors back to the specific model or source that triggered the cascade.
Where Rocky fits and where the gaps remain
Rocky is a control plane, not a warehouse. It does not store data or execute queries itself. Your Databricks, Snowflake, or BigQuery cluster handles compute and storage. Rocky owns the graph: dependencies, types, drift detection, lineage, cost attribution, and governance policies like column-level masking via rocky compliance --env prod --fail-on exception.
The project's Rust foundation gives it practical advantages for CI integration. A single static binary means no Python environment to manage, no dependency conflicts with your existing dbt or Airflow installations. The compile step is fast enough to run on every pull request, catching lineage breaks and type mismatches before code review begins.
The gaps are real. There is no hosted UI or managed scheduler — Rocky recommends Dagster for orchestration. The snapshot limitation on branches means teams with SCD Type 2 patterns cannot fully validate branch changes. Bytes-scanned budget enforcement exists in logs but not as a hard CI gate. The project is early (400 commits, 10 branches, beta connectors for Snowflake and BigQuery), and the DSL is Rocky-specific rather than standard SQL, which means a migration cost for existing dbt projects.
But the core insight is sound: column lineage computed from the transformation graph at compile time catches a class of errors that post-hoc lineage tools structurally cannot. The query that parses, executes, and produces wrong numbers because a semantic relationship changed upstream — that is the failure mode that costs teams the most time and credibility. Catching it before execution, with a diagnostic code and a clear blast-radius report, is worth the migration cost for teams that have been burned by silent schema drift one too many times.