Spend Visibility
You cannot manage what you cannot see at the right grain.
Per call, per workflow, per customer. If a single power user, a re-embedding job, or a paid tool call is driving the bill, the data should already be there to see it.
1Do you log token usage (prompt, completion, cached, reasoning) for every model call your agents make, in a durable store you can query later?
Pass
Every call to OpenAI, Anthropic, Google, Bedrock, OpenRouter, or self-hosted models writes a structured record with prompt_tokens, completion_tokens, and (when supported) cached_input_tokens / reasoning_tokens, plus model id and timestamp.
Fail mode
Token counts only exist in the provider's billing console, aggregated by day, with no way to attribute a spike to a specific workflow.
2Can you split spend by workflow, agent role, or feature, not just by API key?
Pass
Every call carries a tag, span, or metadata field (workflow, role, customer, environment) so you can group by it. Most provider SDKs accept a metadata argument; observability tools propagate it.
Fail mode
One API key is shared across "support reply", "code review", and "lead enrichment" and the bill is a single undifferentiated number.
3Can you attribute spend to a specific customer, tenant, or account when one customer pulls disproportionately?
Pass
A customer_id or tenant_id is on every call record, and a per-tenant spend report exists.
Fail mode
A single power user drives 40% of last month's bill and you cannot identify them without manually re-running logs.
4Are paid tool calls (search APIs, scraping, image generation, vector DB queries, sandboxes, browser sessions, third-party SaaS) tracked alongside model spend?
Pass
Every paid call an agent makes is logged with the cost source and a per-unit price, even if the price is a rough estimate.
Fail mode
The model bill looks fine but a search API or browser-automation provider is quietly costing more than the LLM itself.
5Do you track cached vs uncached prompt tokens separately, and reasoning tokens separately, where the provider exposes them?
Pass
Prompt-caching savings (Anthropic, OpenAI, Google) are visible per workflow, and reasoning-token cost (OpenAI o-series, Anthropic extended-thinking) is broken out from output cost.
Fail mode
A "cheap" cached workflow degrades to fully uncached after a prompt change and nobody notices for two weeks.
6Are embedding calls and vector store reads/writes tracked as a separate cost line from chat/completion calls?
Pass
Embedding spend, re-embedding events, and vector reads/writes are logged separately from chat completions.
Fail mode
A re-indexing job re-embeds your entire corpus on a routine cron and triples the monthly embedding line.
Budget Caps and Runaway Prevention
Visibility without limits is a slower way to lose money.
Caps and kill-switches sit in code, not in someone's head. The goal is that no stuck loop or anomalous tenant can exceed a defined ceiling before a human decides.
7Do you have a hard per-task or per-loop spend ceiling that auto-kills any workflow that exceeds it?
Pass
A configured max-cost or max-token budget per task; on breach, the agent halts and the task is marked unsafe-to-resume.
Fail mode
A stuck planner loop runs for hours, calling the most expensive model in your stack each time. A planning step priced at $0.50 should not silently consume $80.
8Do you have per-agent-role daily and monthly budget caps with alerting at 50% / 80% / 100%?
Pass
Explicit caps per role (e.g., "support-replier: $40/day, $800/month"), with alerts wired to a channel a human reads.
Fail mode
Caps live in someone's head, and the only alert is the next provider invoice.
9Do you enforce a retry / loop guard with cost in mind, not just attempt count?
Pass
A max-attempts limit AND a max-cost-per-task limit; whichever fires first stops the workflow.
Fail mode
An agent retries the same broken tool call 400 times because retry was set to "10 attempts" but each attempt now costs 8x what it did six months ago.
10Do you monitor provider rate-limit headroom and have a defined fallback when you approach a wall?
Pass
TPM and RPM usage are tracked against the provider's tier limit; on threshold, traffic queues, downgrades to a cheaper model, or fails fast with a clear status. Most providers return x-ratelimit-* response headers.
Fail mode
A 429 response cascades into silent agent failure during a paid customer demo.
11Are model upgrades, prompt changes, and tool-set changes treated as cost-impacting events with a recorded before/after on a representative sample?
Pass
A small evaluation harness re-runs a fixed set of inputs after any model swap or prompt change and reports cost-per-task delta and quality delta together.
Fail mode
A routine "use the new flagship" upgrade silently triples cost-per-task and degrades a structured-output format, with the regression discovered weeks later by a customer.
12Do you have a documented kill-switch or per-customer pause for any agent that starts producing anomalous spend on a single tenant?
Pass
A one-command pause per agent role or per customer that stops new model calls without restarting the rest of the system.
Fail mode
The only way to stop an out-of-control loop is to rotate the API key or bring down the whole worker pool.
Unit Economics and Pricing Discipline
If you do not know what an agent task costs you, you cannot price it.
The numbers that defend a pricing decision: cost per finished task, cost of failure, margin per workflow, top cost drivers, and clean separation of internal usage from customer-facing usage.
13Do you know the dollar cost per finished task, or per unit of customer value, for your top 3 workflows?
Pass
A rough but defensible per-unit cost computed from observed usage over at least 50 runs, refreshed at least monthly.
Fail mode
Pricing decisions are based on intuition or "what competitors charge."
14Do you know the cost of a "bad" run (a retry, cancelled task, wrong-answer that had to be re-done) versus a "good" run?
Pass
Failed and retried tasks are tagged in your spend records and rolled into an effective cost-per-successful-task figure.
Fail mode
Published unit cost looks healthy because it ignores everything that did not finish cleanly.
15Do you measure margin per workflow on a defined cadence (weekly or monthly), not only at quarter close?
Pass
Each top workflow has a margin line: revenue (or proxy) minus model cost minus tool-call cost minus a fixed allocation for human review. Trend is visible.
Fail mode
Company-wide gross margin is fine, but one specific feature is structurally underwater and nobody has caught it.
16Are heavy-cost workflows tagged in code so a future model swap, prompt rewrite, or caching change can target them first?
Pass
The top 5 spend-driving workflows are explicitly identified and reviewed each release cycle.
Fail mode
Optimization effort is spread thin across cosmetic changes while the actual cost driver is one chain that was written first and never revisited.
17Do you track spend on internal/development usage separately from production / customer-facing usage?
Pass
Separate API keys, projects, or metadata.environment tags, so engineering experiments do not contaminate customer-cost analytics.
Fail mode
A developer's evaluation run shows up as customer spend and distorts margin reporting.
Forecasting, Anomaly Detection, and Reporting
The point of cost tracking is that you can plan and act early.
A daily artifact anyone can read in under a minute, a rule that fires on a 2x jump, and a recurring evidence report that pairs spend movement with the change that caused it.
18Do you have a daily spend dashboard or report that anyone on the team can read in under 60 seconds?
Pass
A daily refresh by workflow, by model, and by customer, viewable without logging into the provider console.
Fail mode
Cost is checked when the bill arrives or when somebody asks "is the bill weird?"
19Do you have anomaly detection on daily or per-customer spend that fires when today is materially different from the rolling baseline?
Pass
A rule, alert, or scheduled check that flags a 2x or larger jump in daily spend, per-tenant spend, or per-workflow spend, and routes it to a human channel.
Fail mode
The bill ships 4x larger than expected and the only signal is a Slack message from the CFO.
20Does the team get a recurring evidence report (weekly is typical) that includes spend, top-3 workflows by cost, top-3 customers by cost, model and prompt changes shipped that week, and one cost-related decision the operator should approve or revisit?
Pass
A regular artifact survives even if the loudest channel went quiet, and pairs cost movement with the change that caused it.
Fail mode
Cost narrative is whatever the most recent founder DM happened to say; price increases lag the underlying trend by a quarter.