Data Catalogs Are Only Useful If They Match Real Workflows

Pitching a metadata catalog is almost always reasonable: if people could just find the right data, everything would get better.

Data teams love the idea of fewer Slack questions, mystery tables etc.

The hope is not wrong, but it is incomplete.

A catalog that helps people find tables is useful, and a catalog that helps people make decisions is better. The difference is whether it matches the work people are already doing.

The catalog is not the workflow

DataHub's overview describes the usual bundle well: structure, location, ownership, quality, lineage, and search, but none of them are the point.

The questions we should care about are messier:

I found 3 tables. Which one should I use?
This metric changed. What broke downstream?
Can I trust this field for a production workflow?
Who understands the weird edge cases?
Is this dataset safe to expose to an agent?
If I backfill this column, what systems will notice?

Those are workflow questions, and a catalog that cannot answer them is just a nicer UI for confusion.

Search is the shallow problem

Search is usually the first thing people ask for because the pain is easy to describe, but discovery almost never ends at search.

The harder problem starts after the user finds something plausible. Now, they need to know whether the dataset is right for their use case. That requires context of:

freshness
ownership
semantic meaning
known caveats
downstream consumers
access constraints
lineage and transformations
examples of real usage

A catalog that returns a table name without this context has only moved the confusion one click deeper.

Metadata systems should feel less like search engines and more like decision support on what to do next.

Usage is metadata too

The most underrated signal is usage.

Which datasets do people query repeatedly? Which tables power production jobs? Which fields appear in important dashboards? Which datasets are technically available but socially abandoned?

Static metadata tells you what exists, but usage metadata tells you what the organization actually trusts.

This distinction matters becuase a table can have a perfect owner field, a nice description, and a clean schema, but still be the wrong source for most work. On the flip side, another table may look ugly and still be the one every serious workflow depends on.

If the catalog ignores usage, it will often recommend the wrong kind of cleanliness.

Lineage has to answer operational questions

Lineage is one of those features everyone says they want, but lineage diagrams become theater if they do not answer specific questions.

Useful lineage helps with decisions like:

can I delete this field?
what happens if this upstream source is late?
which dashboards will change if this definition changes?
what production systems depend on this model?
where did this wrong number enter the graph?

The diagram is not the product. The answer is the product.

In AI systems, this gets more important. If an agent answers a question from a dataset, you need to know where the answer came from, which definitions were used, and whether the user even had access. Provenance stops being an audit nicety and becomes part of runtime correctness.

Governance should live in the path of work

Governance often fails by being adjacent to the workflow.

There is a policy page over here, a catalog over there, a Slack thread with the real caveats somewhere else, and application code enforcing permissions in a fourth place.

People do not ignore governance because they hate rules. They ignore governance because the correct path is slower than the path that works.

A good catalog shortens the correct path and should make ownership obvious. It should make access constraints visible before someone builds on data they cannot use. It should make certified definitions easy to choose and uncertified ones appropriately suspicious. It should connect the person asking the question to the person who knows the trap.

Governance works better when it is on that path, not posted up next to it.

Evaluate catalogs against jobs, not features

Start the evaluation with jobs people actually need to do.

For example:

an analyst needs to find the right revenue metric
an engineer needs to understand whether a field can be removed
a product team needs to expose warehouse-derived data in a production workflow
an agent needs to answer a question without crossing permission boundaries
a platform team needs to understand what breaks if a source changes

Then, test whether the catalog makes those jobs easier. That test is much harder to fake.

Where this lands

A data catalog is not valuable because it contains metadata. It is valuable because it reduces the amount of institutional memory required to do careful work.

A useful catalog is not the one with the most complete inventory. It is the one that matches the decisions users are already making: what to trust, what to use, what to change, what to avoid, and who to ask when the system stops being obvious.

Metadata is only useful when it positively influences behavior.