Data Lineage for AI Agents: Why Provenance Is Now a Governance Requirement

Lineage has always been important in data engineering. But for most organizations, it's been a nice-to-have: something you care about when onboarding new engineers, when debugging data quality issues, or when someone questions a number in a board deck. For human-driven analytics, the absence of lineage is an inconvenience. For agentic analytics, it's a governance failure.

The reason is accountability. When a human analyst produces an incorrect KPI, there's a clear chain of responsibility: the analyst made a mistake, you can ask them to explain their methodology, and you can trace the error through their logic. When an agent produces an incorrect KPI, that chain doesn't exist unless you've built it. The agent's reasoning is not available for inspection the way a human's is. Lineage is the mechanism that creates accountability for agent-generated numbers: the ability to trace any metric back to its source data, through every transformation, so that an error can be diagnosed, a correction can be made, and a regulator can be satisfied.

This article walks through what end-to-end lineage requires for agentic systems: the difference between column-level and table-level lineage, how dbt and catalog tools contribute different pieces, how automated impact analysis prevents downstream breakage, and what it means to make lineage accessible programmatically to agents themselves.

Why Lineage Was Nice-to-Have for Humans But Non-Negotiable for Agents

A human analyst working on a quarterly revenue report carries a mental model of the data pipeline. They know which source system feeds which table, which transformations are applied in dbt, and which filters are applied in the BI layer. This mental model is imperfect and informal, but it exists. When something goes wrong, the analyst can use this knowledge to navigate backwards through the pipeline and find the problem.

An AI agent has no such mental model. It can read documentation if it's written and accessible, but it can't draw on years of experience with the pipeline. More importantly, an agent can be running dozens of queries simultaneously across different parts of the data warehouse, without any awareness of how those queries relate to each other or what the dependencies between the underlying tables are. When an agent query produces a wrong answer, you can't ask the agent to trace its reasoning through the pipeline. You need the lineage to exist as structured data that can be queried independently.

Regulatory requirements are accelerating this shift. GDPR, CCPA, and emerging AI governance regulations increasingly require organizations to demonstrate that AI-generated outputs can be traced back to source data. This isn't a theoretical future requirement. It's a present one for organizations in regulated industries. Lineage is the infrastructure that makes that demonstration possible.

Column-Level vs. Table-Level Lineage: Why the Difference Matters

Table-level lineage tells you which tables depend on which other tables. Model B depends on table A. Dashboard C depends on model B. This is useful for understanding the broad shape of the pipeline and for knowing that if table A changes, something downstream might break. But it doesn't tell you which columns in model B use which columns from table A, which makes it inadequate for PII tracking, impact analysis, and compliance.

Column-level lineage traces individual columns from source to destination. It tells you that the customer_email column in mart_customers comes from raw_crm.contacts.email, was hashed in the staging model, and is exposed (still hashed) in the final mart. That full column-level path is what GDPR's right-to-erasure requirement asks you to produce: when a user requests deletion of their data, you need to know every column in every table that contains a derivation of their data.

For agents, column-level lineage also enables better error diagnosis. When an agent returns an incorrect metric, knowing that the metric's component columns trace back to specific source columns, and knowing which of those source columns recently changed or started failing quality checks, lets you pinpoint the root cause at the column level rather than having to investigate the entire upstream pipeline.

dbt's Built-In Lineage: What It Gives You and Its Limits

dbt generates table-level lineage automatically from the ref() and source() functions in your model definitions. Every time a model uses ref('other_model'), dbt records a dependency edge. The resulting lineage graph is visible in dbt docs and accessible via dbt's metadata API. For organizations that have their full transformation layer in dbt, this provides automatic, always-current table-level lineage at no additional operational cost.

dbt's lineage has important limitations for agentic use cases. It covers only what's in your dbt project: models, sources, seeds, exposures. It doesn't know about transformations that happen outside of dbt (in Fivetran, in Python notebooks, in stored procedures, or in other tools). And it's table-level, not column-level: dbt can tell you that model B depends on model A, but not that column X in model B comes from column Y in model A. Column-level lineage in dbt requires either parsing the SQL in each model (which dbt does not do natively) or using a catalog tool that adds this capability.

dbt Cloud's Discovery API exposes lineage data programmatically, which is significant for agentic use cases. An agent that can query the dbt metadata API can retrieve the lineage for any model before constructing a query, allowing it to understand the provenance of the data it's working with and surface that provenance in its responses. This is the foundation for agents that can explain "here's how this number was calculated and where the data came from."

Data Catalog Tools: What They Add and When You Need Them

Data catalog tools (Atlan, DataHub, Alation, and OpenMetadata are the major options) extend lineage beyond what dbt provides by collecting lineage from all the tools in your data stack and providing a unified view. A catalog connected to your data warehouse, dbt, Fivetran, Looker, and Tableau can trace a metric from a dashboard cell back through the BI layer, through the dbt transformation, through the Fivetran sync, to the source system, all in a single lineage graph.

The most important capability for agentic use cases is programmatic lineage access via API. A catalog's UI is useful for humans who want to browse lineage interactively. Agents need the same information available as structured data they can query. Atlan, DataHub, and OpenMetadata all provide GraphQL or REST APIs for lineage queries. An agent that can query these APIs can retrieve the full upstream lineage for any metric before generating a response, enabling it to include provenance information in its answers and to flag metrics whose upstream models have recent quality issues.

Column-level lineage is the differentiating capability among catalog tools, and it varies significantly in depth and accuracy. Column-level lineage requires parsing the SQL of every transformation in the pipeline and inferring column-level dependencies from the parse tree. Some tools do this for dbt SQL, some for warehouse SQL, and some require separate instrumentation. Evaluating catalog tools for column-level lineage coverage is the most important evaluation criterion for teams with agentic use cases: specifically, which sources and transformation tools each tool can parse.

Impact Analysis: Knowing What Breaks Before It Breaks

Automated impact analysis uses lineage data to answer the question: if I change this table or column, what else will break? This has always been valuable for data engineers making schema changes. For agentic systems, it becomes critical: an agent that queries a metric whose upstream model has just been modified in a breaking way will return incorrect results, and those results may propagate into reports, decisions, or automated actions before anyone notices.

In dbt, the dbt ls command with the --select flag can identify all models downstream of a given model, enabling a manual impact assessment before a change is deployed. Catalog tools automate this into a pre-change impact report: before you merge a PR that drops a column, the catalog shows you every downstream model, report, and metric that references that column. Some catalog tools can also trigger automated tests on downstream models when an upstream change is detected.

For organizations with agents running in production, impact analysis should be part of the deployment pipeline. A schema change to an upstream table should trigger an automated check of all downstream agent queries and metrics, with a notification to the relevant owners if a breaking change is detected. This prevents the scenario where an agent runs successfully for days after a schema change, producing subtly wrong numbers that nobody catches until the error compounds into a significant business problem.

Documentation and Tests in dbt: Why Undocumented Models Are Ungoverned Models

Lineage without documentation is incomplete. A lineage graph tells you that metric A depends on model B which depends on table C. But it doesn't tell you what model B is supposed to do, what invariants it's supposed to maintain, or what a "correct" row in model B looks like. Documentation fills this gap: it makes the intent of each model explicit, so that when lineage traces a metric back to a model, there's also a description of what that model is doing.

dbt's documentation system ties documentation directly to lineage. When you write a description for a model or column in schema.yml, that description appears in the lineage graph in dbt docs and is accessible via the metadata API. Agents that can query the dbt metadata API can retrieve not just the lineage structure but also the documentation for each node in the graph, understanding not just where the data came from but what each transformation is supposed to do.

Tests in dbt are the enforcement counterpart to documentation. A test asserts that a specific constraint holds: that a column has no nulls, that values are unique, that foreign keys are valid, that revenue is always positive. When tests fail, they signal that something has gone wrong in the data. For agentic systems, test coverage is a proxy for data reliability: a model with comprehensive tests and a known test pass rate is much safer for agents to query than an undocumented, untested model. Treating test coverage as a governance metric rather than just a quality metric changes how organizations prioritize it.

Root Cause Time: Observability Tools and Near-Instant Diagnosis

Data observability tools (Monte Carlo, Elementary, re_data, and Datafold) monitor your data pipeline continuously and alert on anomalies before they affect downstream consumers. They detect schema changes, null rate increases, value distribution shifts, and row count anomalies by comparing current metrics against historical baselines. For human-driven analytics, these tools reduce the time between a data quality incident and its detection from days to minutes. For agentic analytics, they're a prerequisite for safe operation.

The reason is that agents operate continuously. A human analyst runs a report once a week; if the data was wrong this week, the error is contained to one report. An agent may query the same metric hundreds of times per day and distribute the results across many downstream consumers. If the underlying data develops a quality issue and the agent isn't stopped, it will propagate that error at scale before anyone notices. Observability tools that alert immediately on quality degradation allow you to pause agent queries to affected metrics while the issue is investigated.

Elementary and re_data are open-source tools that run as dbt packages, generating quality metric tables and dashboards from within your existing dbt project. Monte Carlo is the most feature-complete commercial option, with ML-based anomaly detection and deep lineage integration. For teams just starting with observability, Elementary is an excellent starting point: it requires no additional infrastructure and produces immediately useful quality dashboards from your existing dbt tests and models.

Programmatic Lineage Access: Why Agents Need to Query Lineage via API

A lineage graph that's only visible in a UI is useful for human exploration but inaccessible to agents. For lineage to provide value in agentic systems, it needs to be queryable programmatically: an agent should be able to ask "what is the full upstream lineage of the monthly_revenue metric?" and receive a structured response that lists every upstream model and source table, with documentation and freshness status for each.

This capability enables a class of agent behaviors that aren't possible without it. An agent can check lineage before answering a question, flagging metrics whose upstream models have recent quality issues. An agent can include provenance information in its answers, explaining not just what the number is but where it came from. An agent can refuse to answer queries about metrics that are flagged as unreliable, rather than silently returning a stale or incorrect value.

Building programmatic lineage access typically involves connecting your agent framework to the API of your catalog tool (DataHub's GraphQL API, Atlan's REST API) or dbt Cloud's Discovery API. This is an integration project, not a configuration change, but it's one of the highest-leverage investments you can make in agentic reliability. Agents that know about data quality are much safer than agents that don't.

The Audit Scenario

A regulator asks you to trace an AI-generated KPI ("gross margin by product category for Q4 2025") back to its source data. The KPI was generated by your analytics agent and appeared in a board report. Without lineage, this is a multi-day investigation: you need to reverse-engineer the query the agent ran, trace each column through the transformation pipeline manually, and document every intermediate step.

With full lineage infrastructure in place, the answer is available in minutes. Your agent's audit log contains the exact SQL it ran. Your catalog tool can trace every column in that SQL back to its source table, through every dbt model in between, showing exactly which source systems contributed to the number and what transformations were applied. The full provenance chain is a structured API response, not a manual reconstruction.

How does your lineage score?

The Semantic Layer Readiness Scorecard assesses lineage and traceability alongside four other dimensions of agentic readiness. Takes 5 minutes.

Take the Scorecard →