The first pitch for a metadata catalog is almost always reasonable: if people could just find the right data, everything would get better.
Fewer Slack questions. Fewer mystery tables. Fewer dashboards built on stale definitions. Fewer people asking the one analyst who somehow knows where every number lives.
The hope is not wrong. It is incomplete.
A catalog that helps people find tables is useful. A catalog that helps people make decisions is better. The difference is whether it matches the work people are already doing.
The catalog is not the workflow
Search quality. Column-level lineage. Ownership fields. Tags. Glossaries. Certification badges. Access request flows. All of these matter. DataHub's overview describes the usual bundle well: structure, location, ownership, quality, lineage, and search.
But none of them are the point.
The questions I care about are messier:
- I found three tables. Which one should I use?
- This metric changed. What broke downstream?
- Can I trust this field for a production workflow?
- Who understands the weird edge cases?
- Is this dataset safe to expose to an agent?
- If I backfill this column, what systems will notice?
Those are workflow questions. A catalog that cannot answer them is just a nicer UI for confusion.
Search is the shallow problem
Search is usually the first thing people ask for, because the pain is easy to describe. "I cannot find the table" is an easy complaint to write down.
But discovery almost never ends at search.
The harder problem starts after the user finds something plausible. Now they need to know whether the dataset is right for their use case. That requires context:
- freshness
- ownership
- semantic meaning
- known caveats
- downstream consumers
- access constraints
- lineage and transformations
- examples of real usage
A catalog that returns a table name without this context has only moved the confusion one click deeper.
The metadata systems I trust feel less like search engines and more like decision support. They help the user decide what to do next.
Usage is metadata too
One underrated signal is actual usage.
Which datasets do people query repeatedly? Which tables power production jobs? Which fields appear in important dashboards? Which datasets are technically available but socially abandoned?
Static metadata tells you what exists. Usage metadata tells you what the organization actually trusts.
This distinction matters. A table can have a perfect owner field, a nice description, and a clean schema, while still being the wrong source for most work. Another table can look ugly and still be the one every serious workflow depends on.
If the catalog ignores usage, it will often recommend the wrong kind of cleanliness.
Lineage has to answer operational questions
Lineage is one of those features everyone says they want.
They are right to want it. But lineage diagrams become theater if they do not answer specific questions.
Useful lineage helps with decisions like:
- can I delete this field?
- what happens if this upstream source is late?
- which dashboards will change if this definition changes?
- what production systems depend on this model?
- where did this wrong number enter the graph?
The diagram is not the product. The answer is the product.
For AI systems, this gets more important. If an agent answers a question from a dataset, you need to know where the answer came from, which definitions were used, and whether the user was allowed to see the underlying data. Provenance stops being an audit nicety and becomes part of runtime correctness.
Governance should live in the path of work
Governance often fails by being adjacent to the workflow.
There is a policy page over here, a catalog over there, a Slack thread with the real caveats somewhere else, and application code enforcing permissions in a fourth place.
People do not ignore governance because they hate rules. They ignore governance because the correct path is slower than the path that works.
A good catalog shortens the correct path.
It should make ownership obvious. It should make access constraints visible before someone builds on data they cannot use. It should make certified definitions easy to choose and uncertified ones appropriately suspicious. It should connect the person asking the question to the person who knows the trap.
Governance works better when it is on the path, not posted next to it.
Evaluate catalogs against jobs, not features
Start the evaluation with jobs people actually need to do.
For example:
- an analyst needs to find the canonical revenue metric
- an engineer needs to understand whether a field can be removed
- a product team needs to expose warehouse-derived data in a production workflow
- an AI agent needs to answer a question without crossing permission boundaries
- a platform team needs to understand what breaks if a source changes
Then test whether the catalog makes those jobs easier.
Not whether the demo looks good. Not whether the feature matrix is impressive. Whether a real user can complete a real workflow with less guessing.
That test is much harder to fake.
Where this lands
A data catalog is not valuable because it contains metadata. It is valuable because it reduces the amount of institutional memory required to do careful work.
A useful catalog is not the one with the most complete inventory. It is the one that matches the decisions users are already making: what to trust, what to use, what to change, what to avoid, and who to ask when the system stops being obvious.
Metadata is only useful when it changes behavior.