Free self-diagnostic

AI Agent Observability Checklist

Thirty questions across six pillars. Score your team in under fifteen minutes, then decide which gaps actually cost you time, money, or trust.

No signup, no email gate. Companion to the Always-On AgentOps Implementation pilot.

How to use it

Score 0, 1, or 2 per question. Maximum 60.

0 means absent or accidental. 1 means partial or manual. 2 means reliable and routine. Most teams running real agent work today score 15 to 30 on a first pass — that is the starting point, not a grade.

Score 0 – 19

Brittle

One bad day burns a week. Fix Pillars 1, 2, and 5 first.

Score 20 – 39

Workable

You ship, but the founder is the load-bearing layer. Move on Pillars 3, 4, and 6.

Score 40 – 60

Production-grade

You can delegate sensitive workflows without surrendering control.

Pillar 1

Inventory and Authorization

You cannot observe what you cannot list.

If a new operator joined tomorrow, could they list every running agent and what each one is allowed to touch?

  1. 1

    Can you list every agent currently running against your business in under 60 seconds, with name, owner, model, and purpose?

    Pass

    A single source of truth (file, dashboard, ticket, or spreadsheet) returns the answer without paging anyone.

    Fail mode

    Nobody can confirm whether yesterday's experimental agent is still calling APIs.

  2. 2

    For each agent, do you know which credentials, accounts, repositories, or customer data it is authorized to touch?

    Pass

    An explicit allow-list per agent role, not "whatever was in .env at the time."

    Fail mode

    A debug agent quietly retains production write access months after it was last useful.

  3. 3

    Does every agent have a single human owner with a clear escalation path if the agent misbehaves?

    Pass

    Name and contact path live next to the agent's config or run record.

    Fail mode

    A stuck agent runs for hours before anyone realizes there is no on-call for it.

  4. 4

    Can you kill any agent in under 30 seconds without restarting the rest of your system?

    Pass

    A documented stop command, hot-key, or admin button per role.

    Fail mode

    Shutting down "the bad one" requires bouncing the whole stack.

  5. 5

    Do you have a written policy for what kinds of actions an agent must NEVER take without a human?

    Pass

    Explicit list (production deploys, customer email, payments, credentials, legal text, model upgrades) reviewed quarterly.

    Fail mode

    The policy lives only in the founder's head.

Pillar 2

Task Memory and Continuity

If your agent forgets the work the moment the chat closes, every Monday is Monday morning.

Agent work is durable only when the next agent (or human) can pick it up without restarting from scratch.

  1. 6

    Does every agent task survive past the chat session it was started in?

    Pass

    Tasks live in a durable store with status (pending, in progress, in review, done) and timestamps.

    Fail mode

    Half-finished work disappears when the tab is closed, and a different agent starts it again from scratch.

  2. 7

    Can the next agent that picks up a task see what the previous agent already did, decided, or rejected?

    Pass

    A brief structured handoff note attached to the task or workflow, not "scroll up and read the whole conversation."

    Fail mode

    Agents repeat the same dead end three times in three days.

  3. 8

    Are recurring workflows captured as named loops with a defined start, expected outcome, and acceptance criteria?

    Pass

    The workflow is described once and reused, not recreated as a fresh prompt each Monday.

    Fail mode

    Each cycle re-invents the brief and produces inconsistent output.

  4. 9

    When an agent produces an artifact (code change, message draft, decision), is the artifact attached to the task it came from?

    Pass

    Artifact and task share a stable link both ways.

    Fail mode

    The founder finds a half-finished PR and cannot tell which conversation produced it or why.

  5. 10

    Is there a deliberate boundary between "agent scratch space" and "facts the rest of the team should trust"?

    Pass

    Durable knowledge (decisions, runbooks, customer context) is written to a separate, reviewed surface.

    Fail mode

    A mistaken agent guess propagates into next quarter's onboarding doc because nobody curated it.

Pillar 3

Run Logs, Evidence, and QA

If you cannot reconstruct what your agent did yesterday, you are guessing.

Customer-visible work needs an audit trail you would actually be willing to share in an incident review.

  1. 11

    For any agent run in the last 7 days, can you retrieve the prompt(s), tools used, inputs, and final output?

    Pass

    Structured run records, not "look in the terminal scrollback if you still have it."

    Fail mode

    A customer asks why their data was changed and nobody can answer.

  2. 12

    Are tool calls logged with arguments, return values, latency, and which agent invoked them?

    Pass

    Every external call (filesystem, API, browser, database, payment) is captured with enough detail to reproduce.

    Fail mode

    A billing API was called twice for the same customer and you cannot tell whether the agent retried.

  3. 13

    Do you capture and preserve QA evidence for tasks that ship to a customer or to production?

    Pass

    Tests run, screenshots taken, validation outputs, and review approvals are linked to the task before close.

    Fail mode

    Agent-generated work goes live without an artifact you could show in an incident review.

  4. 14

    Can you tell the difference, in your records, between an agent action and a human action on the same workflow?

    Pass

    Every change is attributed to a specific actor, agent role, or human, with timestamp.

    Fail mode

    A customer-facing message cannot be traced back to who actually wrote it.

  5. 15

    Do you have a routine, even a manual one, for spot-checking agent output that nobody reviewed at the time?

    Pass

    Weekly or per-batch sampling produces a documented pass/fail signal.

    Fail mode

    Regressions go unnoticed until a customer complains.

Pillar 4

Cost, Rate Limits, and Budget

If you do not know what your agents cost, you cannot price your offer or your runway.

Token spend, model swaps, and provider rate limits should all be visible before they become surprises on a credit card statement.

  1. 16

    Do you know how many tokens or API calls your agents consumed yesterday, by role or workflow?

    Pass

    A daily or per-run number you can read in seconds, broken down meaningfully.

    Fail mode

    The bill arrives larger than expected and nobody can explain which loop overspent.

  2. 17

    Do you know the dollar cost per finished task, or per unit of customer value, for your top 3 workflows?

    Pass

    A rough but defensible per-unit cost computed from observed usage.

    Fail mode

    Pricing decisions are based on intuition, not unit economics.

  3. 18

    Do you have a budget cap or rate limit per agent role that prevents a runaway loop from burning your quota?

    Pass

    A configured cap, alarm, or kill-switch per role.

    Fail mode

    A stuck "auto-reply" agent burns through the monthly model quota in one afternoon.

  4. 19

    Are model upgrades or version changes tracked alongside their effect on cost and quality?

    Pass

    A record of which model each role is using, when it last changed, and what the swap did to outputs.

    Fail mode

    A routine model bump silently triples your bill or breaks an output format.

  5. 20

    Do you have visibility into provider rate-limit headroom and a defined behavior when you hit a wall?

    Pass

    Monitored usage against limits, plus a documented fallback (queue, downgrade, pause).

    Fail mode

    A 429 response cascades into silent agent failure during a paid customer demo.

Pillar 5

Failure Detection and Recovery

Agents fail. The question is whether you notice in minutes or days.

Stuck workers, repeating loops, and silent malformed outputs all need to surface before a customer sees them.

  1. 21

    Do you detect a stuck or zombie agent within minutes, not hours?

    Pass

    Heartbeat, last-progress-at, or watchdog signal that fires when an agent stops making progress.

    Fail mode

    A worker has been "running" for 14 hours producing nothing.

  2. 22

    Do you detect an agent that loops on the same step or repeats the same failed action?

    Pass

    Loop guard, retry cap, or pattern detection that breaks runaway behavior.

    Fail mode

    An agent retries the same broken API call 400 times before someone notices the cost.

  3. 23

    When an agent crashes or is killed, do its in-flight tasks get re-queued or marked unsafe-to-resume?

    Pass

    Explicit recovery rule per workflow.

    Fail mode

    A half-finished customer email is partially sent because the recovery state was undefined.

  4. 24

    Are silent failures (no output, empty output, malformed output) treated the same as loud errors?

    Pass

    Validation on output shape and content, with an explicit "unknown failure" status if it does not match.

    Fail mode

    An agent returns an empty string and the workflow records it as success.

  5. 25

    Do you have a documented post-incident routine for any time an agent caused a customer-visible problem?

    Pass

    Short, repeatable write-up with timeline, blast radius, fix, and the controls that should have caught it.

    Fail mode

    Each incident is fixed in a panic and the lesson does not survive the week.

Pillar 6

Approval, Escalation, and Weekly Evidence

The point of observability is that the operator can step back without losing the wheel.

Routine status stays inside the agent layer. Sensitive decisions reach the operator with enough context to act on.

  1. 26

    Is there a written list of decisions the founder or operator must approve, separate from work an agent or worker can ship on its own?

    Pass

    Explicit gates for spend, customer commitments, legal, credentials, production unlocks, and brand-voice content.

    Fail mode

    Agents either ask the operator about everything (noisy) or about nothing (dangerous).

  2. 27

    Do escalations to the operator carry enough context to act on without re-reading raw tool output?

    Pass

    Short summary, the specific decision needed, the recommended option, and what happens if no answer.

    Fail mode

    The founder is paged at midnight with a wall of logs and no clear ask.

  3. 28

    Is there a documented "do not page the operator for this" list that absorbs routine status, retries, and internal coordination?

    Pass

    Explicit rules for what stays inside the agent layer, with examples reviewed monthly.

    Fail mode

    Every retry, every minor blocker, and every internal hand-off becomes a notification.

  4. 29

    Does the team get a recurring evidence report (weekly is typical) that summarizes shipped work, blocked work, risks, costs, and next decisions?

    Pass

    A regular artifact that survives even if the loudest channel went quiet.

    Fail mode

    Status is whatever the most recent founder DM happened to say.

  5. 30

    Could a new operator, joining tomorrow, understand what your agents do and what they are forbidden to do, in under one hour, from your written artifacts?

    Pass

    Yes. Inventory, authorization rules, escalation policy, and weekly report make the system legible.

    Fail mode

    The system depends on a single founder remembering everything.

Scoring reference

Total your score across all six pillars.

Each pillar holds a possible 10 points. A combined score below 20 usually means the highest-leverage move is not buying more agent tooling — it is picking one recurring workflow that already stalls and installing the basics around it.

PillarTopicScore
1Inventory and Authorization___ / 10
2Task Memory and Continuity___ / 10
3Run Logs, Evidence, and QA___ / 10
4Cost, Rate Limits, and Budget___ / 10
5Failure Detection and Recovery___ / 10
6Approval, Escalation, and Weekly Evidence___ / 10
Total___ / 60

What this checklist is

  • An operator-grade self-diagnostic you can run on a small team in fifteen minutes.
  • A shared vocabulary for arguing about which AgentOps gap to fix next.
  • A structure you can carry into a 2-week pilot conversation without revealing internals.

What it is not

  • Not a buying guide for any specific observability vendor. The patterns can be implemented with general-purpose tools, custom tracing, or commercial AgentOps platforms.
  • Not a compliance certification. It is hygiene, not a substitute for SOC 2, ISO 42001, AIUC-1, or regulated AI audits.
  • Not a replacement for a security review. Pillar 1 covers authorization at a high level; production deployment needs threat-model-level work.
  • Not a generic AI maturity model. Every question is something a small team can act on this month.

Bring your worst-scoring pillar.

If your score is below 20, the highest-leverage move is rarely more agent tooling. It is picking one recurring workflow that already stalls and installing durable tasks, run logs, approval gates, and a weekly evidence report around it. That is the exact shape of the Always-On AgentOps Implementation pilot.

This checklist is shared as-is for internal use, team self-assessment, and client conversations. Attribution to Empyer / AgentOps is appreciated, not required.

AI Agent Observability Checklist (30 Items, 6 Pillars) | Empyer