Chasing this at a big tech company for roughly two years now. The unintuitive thing I learnt: building a semantic layer top-down is a fool's errand. You always pick the wrong metrics first, because you're guessing what matters instead of letting the data decide.
What actually worked? Instrument the LLM agent. Log every natural language query it gets, the SQL it spits out, and which dimensions or measures it had to guess. After two weeks of logs, the top 20 ambiguous metrics emerged from actual usage, not from architecture diagrams and whiteboard sessions. Then we locked those down - one VP per metric, written sign-off, single-sentence definition pinned in the codebase.
Q1 results? Nine of the top 20 formally defined. Coverage of incoming queries hitting those defined metrics jumped from about 23% to 71%. Hallucination complaints dropped sharply, but not to zero - and the failures clustered on the still-undefined 30%, making prioritisation a no-brainer.
The cultural shift was bigger than the technical one. Before, "semantic layer" meant a four-team architecture meeting where nothing shipped. After we showed the actual query distribution, those meetings ended in fifteen minutes - the data made it impossible to argue that your edge case matters more than the actual usage patterns.
Three tactics that compound:
- Log queries and guessed dimensions first, let the layer be whatever the queries demand.
- One VP per disputed metric, single-sentence written definition, pinned in code.
- When the LLM has to guess, make it refuse and ask back rather than return a confident wrong number. That was controversial internally, but it's the only behaviour change that ever moved trust with execs.
One more meta-point: track the LLM's "would have guessed" cases separately from "answered confidently." That single distinction unlocks the conversation with finance and sales without needing a technical lecture.