Free self-diagnostic

Cost & Spend Tracking Starter Kit

Twenty questions across four sections. Score your team in under fifteen minutes, then decide which spend gaps actually cost you margin, runway, or trust.

No signup, no email gate. Companion to the Always-On AgentOps Implementation pilot.

How to use it

Score 0, 1, or 2 per question. Maximum 40.

0 means absent or accidental. 1 means partial or manual. 2 means reliable and routine. Most teams shipping agent or LLM features today score 8 to 18 on a first pass — that is the starting point, not a grade.

Score 0 – 13

Invisible

You are pricing on intuition. A single runaway loop or model swap can wipe a month of margin. Fix Section A and B first.

Score 14 – 27

Workable

You have spend visibility, but unit economics and forecasting are not yet defensible to a CFO or an investor. Move on Section C and D.

Score 28 – 40

Production-grade

You can price confidently, raise prices when needed, and survive a 5x spike or a provider price change without an emergency.

Section A

Spend Visibility

You cannot manage what you cannot see at the right grain.

Per call, per workflow, per customer. If a single power user, a re-embedding job, or a paid tool call is driving the bill, the data should already be there to see it.

  1. 1

    Do you log token usage (prompt, completion, cached, reasoning) for every model call your agents make, in a durable store you can query later?

    Pass

    Every call to OpenAI, Anthropic, Google, Bedrock, OpenRouter, or self-hosted models writes a structured record with prompt_tokens, completion_tokens, and (when supported) cached_input_tokens / reasoning_tokens, plus model id and timestamp.

    Fail mode

    Token counts only exist in the provider's billing console, aggregated by day, with no way to attribute a spike to a specific workflow.

  2. 2

    Can you split spend by workflow, agent role, or feature, not just by API key?

    Pass

    Every call carries a tag, span, or metadata field (workflow, role, customer, environment) so you can group by it. Most provider SDKs accept a metadata argument; observability tools propagate it.

    Fail mode

    One API key is shared across "support reply", "code review", and "lead enrichment" and the bill is a single undifferentiated number.

  3. 3

    Can you attribute spend to a specific customer, tenant, or account when one customer pulls disproportionately?

    Pass

    A customer_id or tenant_id is on every call record, and a per-tenant spend report exists.

    Fail mode

    A single power user drives 40% of last month's bill and you cannot identify them without manually re-running logs.

  4. 4

    Are paid tool calls (search APIs, scraping, image generation, vector DB queries, sandboxes, browser sessions, third-party SaaS) tracked alongside model spend?

    Pass

    Every paid call an agent makes is logged with the cost source and a per-unit price, even if the price is a rough estimate.

    Fail mode

    The model bill looks fine but a search API or browser-automation provider is quietly costing more than the LLM itself.

  5. 5

    Do you track cached vs uncached prompt tokens separately, and reasoning tokens separately, where the provider exposes them?

    Pass

    Prompt-caching savings (Anthropic, OpenAI, Google) are visible per workflow, and reasoning-token cost (OpenAI o-series, Anthropic extended-thinking) is broken out from output cost.

    Fail mode

    A "cheap" cached workflow degrades to fully uncached after a prompt change and nobody notices for two weeks.

  6. 6

    Are embedding calls and vector store reads/writes tracked as a separate cost line from chat/completion calls?

    Pass

    Embedding spend, re-embedding events, and vector reads/writes are logged separately from chat completions.

    Fail mode

    A re-indexing job re-embeds your entire corpus on a routine cron and triples the monthly embedding line.

Section B

Budget Caps and Runaway Prevention

Visibility without limits is a slower way to lose money.

Caps and kill-switches sit in code, not in someone's head. The goal is that no stuck loop or anomalous tenant can exceed a defined ceiling before a human decides.

  1. 7

    Do you have a hard per-task or per-loop spend ceiling that auto-kills any workflow that exceeds it?

    Pass

    A configured max-cost or max-token budget per task; on breach, the agent halts and the task is marked unsafe-to-resume.

    Fail mode

    A stuck planner loop runs for hours, calling the most expensive model in your stack each time. A planning step priced at $0.50 should not silently consume $80.

  2. 8

    Do you have per-agent-role daily and monthly budget caps with alerting at 50% / 80% / 100%?

    Pass

    Explicit caps per role (e.g., "support-replier: $40/day, $800/month"), with alerts wired to a channel a human reads.

    Fail mode

    Caps live in someone's head, and the only alert is the next provider invoice.

  3. 9

    Do you enforce a retry / loop guard with cost in mind, not just attempt count?

    Pass

    A max-attempts limit AND a max-cost-per-task limit; whichever fires first stops the workflow.

    Fail mode

    An agent retries the same broken tool call 400 times because retry was set to "10 attempts" but each attempt now costs 8x what it did six months ago.

  4. 10

    Do you monitor provider rate-limit headroom and have a defined fallback when you approach a wall?

    Pass

    TPM and RPM usage are tracked against the provider's tier limit; on threshold, traffic queues, downgrades to a cheaper model, or fails fast with a clear status. Most providers return x-ratelimit-* response headers.

    Fail mode

    A 429 response cascades into silent agent failure during a paid customer demo.

  5. 11

    Are model upgrades, prompt changes, and tool-set changes treated as cost-impacting events with a recorded before/after on a representative sample?

    Pass

    A small evaluation harness re-runs a fixed set of inputs after any model swap or prompt change and reports cost-per-task delta and quality delta together.

    Fail mode

    A routine "use the new flagship" upgrade silently triples cost-per-task and degrades a structured-output format, with the regression discovered weeks later by a customer.

  6. 12

    Do you have a documented kill-switch or per-customer pause for any agent that starts producing anomalous spend on a single tenant?

    Pass

    A one-command pause per agent role or per customer that stops new model calls without restarting the rest of the system.

    Fail mode

    The only way to stop an out-of-control loop is to rotate the API key or bring down the whole worker pool.

Section C

Unit Economics and Pricing Discipline

If you do not know what an agent task costs you, you cannot price it.

The numbers that defend a pricing decision: cost per finished task, cost of failure, margin per workflow, top cost drivers, and clean separation of internal usage from customer-facing usage.

  1. 13

    Do you know the dollar cost per finished task, or per unit of customer value, for your top 3 workflows?

    Pass

    A rough but defensible per-unit cost computed from observed usage over at least 50 runs, refreshed at least monthly.

    Fail mode

    Pricing decisions are based on intuition or "what competitors charge."

  2. 14

    Do you know the cost of a "bad" run (a retry, cancelled task, wrong-answer that had to be re-done) versus a "good" run?

    Pass

    Failed and retried tasks are tagged in your spend records and rolled into an effective cost-per-successful-task figure.

    Fail mode

    Published unit cost looks healthy because it ignores everything that did not finish cleanly.

  3. 15

    Do you measure margin per workflow on a defined cadence (weekly or monthly), not only at quarter close?

    Pass

    Each top workflow has a margin line: revenue (or proxy) minus model cost minus tool-call cost minus a fixed allocation for human review. Trend is visible.

    Fail mode

    Company-wide gross margin is fine, but one specific feature is structurally underwater and nobody has caught it.

  4. 16

    Are heavy-cost workflows tagged in code so a future model swap, prompt rewrite, or caching change can target them first?

    Pass

    The top 5 spend-driving workflows are explicitly identified and reviewed each release cycle.

    Fail mode

    Optimization effort is spread thin across cosmetic changes while the actual cost driver is one chain that was written first and never revisited.

  5. 17

    Do you track spend on internal/development usage separately from production / customer-facing usage?

    Pass

    Separate API keys, projects, or metadata.environment tags, so engineering experiments do not contaminate customer-cost analytics.

    Fail mode

    A developer's evaluation run shows up as customer spend and distorts margin reporting.

Section D

Forecasting, Anomaly Detection, and Reporting

The point of cost tracking is that you can plan and act early.

A daily artifact anyone can read in under a minute, a rule that fires on a 2x jump, and a recurring evidence report that pairs spend movement with the change that caused it.

  1. 18

    Do you have a daily spend dashboard or report that anyone on the team can read in under 60 seconds?

    Pass

    A daily refresh by workflow, by model, and by customer, viewable without logging into the provider console.

    Fail mode

    Cost is checked when the bill arrives or when somebody asks "is the bill weird?"

  2. 19

    Do you have anomaly detection on daily or per-customer spend that fires when today is materially different from the rolling baseline?

    Pass

    A rule, alert, or scheduled check that flags a 2x or larger jump in daily spend, per-tenant spend, or per-workflow spend, and routes it to a human channel.

    Fail mode

    The bill ships 4x larger than expected and the only signal is a Slack message from the CFO.

  3. 20

    Does the team get a recurring evidence report (weekly is typical) that includes spend, top-3 workflows by cost, top-3 customers by cost, model and prompt changes shipped that week, and one cost-related decision the operator should approve or revisit?

    Pass

    A regular artifact survives even if the loudest channel went quiet, and pairs cost movement with the change that caused it.

    Fail mode

    Cost narrative is whatever the most recent founder DM happened to say; price increases lag the underlying trend by a quarter.

Scoring reference

Total your score across all four sections.

Sections are weighted by question count: A and B carry 12 points each, C carries 10, D carries 6. A combined score below 14 usually means the highest-leverage move is not buying a cost dashboard — it is tagging two workflows, wiring one cap, and standing up one weekly evidence report.

SectionTopicScore
ASpend Visibility___ / 12
BBudget Caps and Runaway Prevention___ / 12
CUnit Economics and Pricing Discipline___ / 10
DForecasting, Anomaly Detection, and Reporting___ / 6
Total___ / 40

What this kit is

  • An operator-grade self-diagnostic you can run on a small team in fifteen minutes.
  • The cost & spend tracking layer of AgentOps: tokens, tool calls, and provider time treated as production line items.
  • A structure you can carry into a 2-week pilot conversation without revealing internals.

What it is not

  • Not a buying guide for any specific cost or observability vendor. The patterns can be implemented with general-purpose tools, custom logging, open-source observability platforms, or commercial offerings.
  • Not financial advice or a budgeting framework for the business as a whole. It is operator-grade hygiene for the LLM/agent line specifically.
  • Not a compliance certification. Not a substitute for SOC 2, ISO 42001, AIUC-1, or AI-system audit work.
  • Not a replacement for provider-side guardrails. Set spend limits at the provider account level too. This kit covers what you do above the provider.

Bring your worst-scoring section.

If your score is below 14, the highest-leverage move is rarely buying a cost dashboard. Pick the top two workflows by suspected spend, add metadata tags on every model and tool call inside them, wire one per-task cap and one daily anomaly alert, and stand up a one-page weekly evidence report. That is the exact shape of the Always-On AgentOps Implementation pilot.

This kit is shared as-is for internal use, team self-assessment, and client conversations. Attribution to Empyer / AgentOps is appreciated, not required.

AgentOps Cost & Spend Tracking Starter Kit (20 Items, 4 Sections) | Empyer