AI giving different numbers for same metric?

sophiasaas

Been running AI queries on our warehouse and hitting a familiar pain point - ask for revenue one way, then rephrase it slightly, and the number shifts. Not wildly off, but enough that you can't trust it without manual checks. SQL looks fine on the surface, but feels like the AI is redefining the metric each time instead of referencing a fixed definition. Gets worse when logic lives across dashboards and teams - each one ends up with its own version.

Curious how others are solving this. Someone pointed me toward semantic layers - Kyvos came up in conversation - where metric definitions sit on top of the warehouse with filters and aggregations baked in, so the AI calls that definition instead of rebuilding logic from scratch. Sounds promising for keeping numbers consistent regardless of how you phrase the question. Cross-team fragmentation with dbt, dashboards, agents all having slightly different flavours of the same metric is a real drag - one central place for business logic seems like the fix.

Are you relying on prompts, curated datasets, or a full semantic layer? Keen to hear what's actually working.

liamcopy

You think a LLM will magically define "revenue" for you?

If your own analysts can't agree on what it means without digging into a Tableau workbook, you're already lost.

Curated datasets help. A centralised semantic layer is the only thing that kills metric drift.

ctrjunkie

Why not just define your metric in the prompt or pop it into your reference docs?

davidsearch

Exactly the same fragmentation issue I see with attribution models across different Google Ads accounts. rephrase the question and the number drifts because there's no single source of truth for conversion logic. We centralised that with a unified metric layer for our bidding algorithms-one definition for ROAS, one for CPA, regardless of which campaign or script calls it. stops optimisation teams fighting over conflicting numbers and keeps the data consistent. Cross-team silos are a killer, one central business logic cleaned it up for us too.

olivia-roas

Scaling budgets taught me one thing - data access is a fucking bottleneck if you don't lock it down.

Here's what actually moved the needle for our internal BI agents:

Build isolated data marts per team. No raw tables, no recalculated vanity metrics. If the agent can't touch the source, it can't fuck up the math.
Use TVFs as enforced calculation interfaces. Weird approach, but it forces the agent to pass parameters through pre-defined filters. The logic stays locked, the outputs stay consistent.
Bake a feedback loop into Slack. Let users correct the agent mid-conversation. Assign metric owners so only certain people can update its memory. The agent tags the right human when it's stuck.

A semantic layer helps, sure. But programmatic access control beats maintaining yet another abstraction that rots the second someone changes a definition.

emmareach

Could you elaborate a bit more on how you're applying AI here? What specific use case are you targeting?

I've run into similar headaches, especially around data inconsistency. In my own work, sales and accounting had completely different definitions of what constituted a "qualified lead" - it took weeks to reconcile and ended up causing chaos in reporting until we aligned on a single source of truth.

Curious what your semantic layer is meant to solve - is it real-time BI queries, or something like agentic workflows?

noahcontent

You have to define it in context, hard-code it, or hard-prompt it - no workaround. Every company I've worked with defines ARR differently, so any AI or semantic layer that tries to infer that without a solid business logic framework is going to fail. That's why my go‑to approach is to first map the metric definitions into a structured taxonomy before even thinking about automation. Get the standard right, then let the system run

miaaffiliate

metric views

ethan-pr

I've done exactly this in-house. created a metrics dictionary living inside the knowledge base, explicitly mapping each key business metric back to its underlying tables and calculations. If the AI can't find a metric definition, we instruct it to clearly flag that it's making assumptions and to request verification from the user before proceeding. Keeps the output honest and stops the model inventing definitions that don't match reality.

analyticsjunkie

Instead of "AI agent -> MCP for executing SQL" use "AI agent -> BI tool MCP (structured queries in terms of dimensions/measures) -> BI tool generates correct SQL".

avadata

Prompts won't save you here - the problem's upstream. If your metric definitions live inside the LLM's context window, you're just praying the model doesn't hallucinate the wrong COUNT(DISTINCT). I've been burned enough times to know that's not a strategy.

You need a semantic layer that enforces fixed definitions before the AI touches SQL. dbt Metrics or Dremio both do this properly - they treat the metric as a first-class object with versioned logic, so when the LLM generates a query it's constrained by actual business rules, not whatever prompt engineering happens to stick that week. Anything less is just fancy guesswork dressed up as intelligence.

lucasvid

You already running a BI tool with a semantic layer? Then hook your agents into it - why overcomplicate things? If there's no defined business logic in place yet, that's the real drop-off point. Honestly, feels like trying to optimise retention without even looking at the audience graph. First 3 seconds, mate - you need the foundation sorted before layering anything else on top.

isa-ecom

Honestly, traditional semantic layers drive me mad for DTC retention use cases. They're just static dashboards under a different name - fine for a monthly performance review, useless when you need agents to dig into cart abandonment patterns and spit back actionable cohort splits in real time. The human curation bottleneck kills agility. We ended up building our own lightweight solution too, partly because we needed the agents to model customer lifetime value on the fly and persist those learnings without waiting on an analyst. The bonus is our team now spends less time wrangling metrics and more time optimising loyalty loops. for simpler setups, honestly a clean markdown knowledge base does the job for quick attribution questions. But if you're running sophisticated retention flows, you need that agent-driven iteration.

adcraft

Chasing this at a big tech company for roughly two years now. The unintuitive thing I learnt: building a semantic layer top-down is a fool's errand. You always pick the wrong metrics first, because you're guessing what matters instead of letting the data decide.

What actually worked? Instrument the LLM agent. Log every natural language query it gets, the SQL it spits out, and which dimensions or measures it had to guess. After two weeks of logs, the top 20 ambiguous metrics emerged from actual usage, not from architecture diagrams and whiteboard sessions. Then we locked those down - one VP per metric, written sign-off, single-sentence definition pinned in the codebase.

Q1 results? Nine of the top 20 formally defined. Coverage of incoming queries hitting those defined metrics jumped from about 23% to 71%. Hallucination complaints dropped sharply, but not to zero - and the failures clustered on the still-undefined 30%, making prioritisation a no-brainer.

The cultural shift was bigger than the technical one. Before, "semantic layer" meant a four-team architecture meeting where nothing shipped. After we showed the actual query distribution, those meetings ended in fifteen minutes - the data made it impossible to argue that your edge case matters more than the actual usage patterns.

Three tactics that compound:

Log queries and guessed dimensions first, let the layer be whatever the queries demand.
One VP per disputed metric, single-sentence written definition, pinned in code.
When the LLM has to guess, make it refuse and ask back rather than return a confident wrong number. That was controversial internally, but it's the only behaviour change that ever moved trust with execs.

One more meta-point: track the LLM's "would have guessed" cases separately from "answered confidently." That single distinction unlocks the conversation with finance and sales without needing a technical lecture.

masoncollab

Semantic layer is half the battle. The real headache? Making sure LLMs don't go rogue writing SQL you can't verify. They're so over-trained on generating SQL that they'll happily produce something that looks correct but has zero grounding in your actual data schema - nightmare when you're trying to pull creator performance metrics or campaign ROI.

I run into this constantly when sourcing UGC data for brand partners. You need that grounding layer so every query is auditable, traceable back to real metrics - engagement rates, conversion lift, not just vanity numbers. If an LLM chucks out a random join on creator ID and you can't verify it, your negotiation with that influencer is built on sand.

Happy to swap notes if you're tackling this for creator-driven analytics. Always good to connect with people solving the data reliability piece - it's the foundation for any honest influencer deal.

zoecreative

I'm pretty convinced a semantic layer is where most teams will land on this, even if it takes different shapes. Pre-aggregated tables or curated datasets are essentially lightweight versions of the same thing - you're just building guardrails for the agent before it's let loose.

from a creative and campaign perspective, it mirrors exactly what we do when we constrain ad formats to specific aspect ratios or copy-length limits. Left to its own devices, an agent will sample the entire output space eventually - non-determinism aside, tool responses and inputs are inherently variable. if you want reliability that actually sits in the high 90s accuracy-wise, you need to shrink that potential output space. Same reason we don't let junior creatives run wild with every headline variant under the sun.

that said, there's also the pragmatic angle: how many of these questions would have gotten different answers if you'd asked different analysts in the past? if the imprecision doesn't meaningfully change the decision you're making, it's probably fine. Some wobble is acceptable in most real-world use cases.

Implementation opinions aside, the real test is just building your agent harness - raw CC, something else - and running evals on the actual questions you care about. update inputs until the outputs hit your threshold. it's all vibes unless you're tracking percentage-accurate outputs over time, properly logged and measured. Same as campaign performance without a consistent attribution model

sleepermode

We've taken a similar approach - built a semantic layer on top of our main data warehouse that holds curated definitions and business logic. The trick isn't just storing the metadata, it's how you expose it to the AI layer.

In our setup, each metric (e.g. "lead score", "pipeline value", "MQL") has a canonical definition stored as YAML in the semantic layer, along with its SQL transformation and any rollup rules. We then feed that context into the "AI skills" - essentially a set of vectorised prompts that tell the LLM "when the user asks about conversion rate, here's exactly which fields to query and how to join them". Without that guardrail, the LLM hallucinates column names or uses the wrong date filter.

Here's a stripped-down example of how we structure one definition:

metric: mql_count
definition: "Count of contacts who met MQL criteria within a given date range"
source: hubspot.contacts
filter: "mql_date IS NOT NULL AND mql_date BETWEEN {start} AND {end}"
aggregation: COUNT(DISTINCT contact_id)

The key is keeping that semantic layer separately version-controlled and then injecting it into the AI prompt at query time. It's not perfect - you still get edge cases where the LLM misinterprets a join - but it beats letting the model guess at your schema.

ctrjunkie

Honestly, we've found the same thing. Plain text docs do the job just fine without the headache. Building out semantic models for every domain is a massive lift, and most tools just lock you into their own proprietary semantic language anyway - exactly the kind of lock-in everyone's been trying to dodge for years. Why bother?

pixelpusher

honestly, the whole 'semantic layer' obsession is just a way for people to avoid reading the existing documentation. i've seen teams burn months in meetings redefining what 'customer' means in some proprietary spreadsheet because the old definition didn't stroke their ego. The real pros do exactly what that data engineer did - column A, column B, and a link to the actual definition. Anything else is just corporate theatre