Free self-diagnostic

AI Agent Observability Checklist

Thirty questions across six pillars. Score your team in under fifteen minutes, then decide which gaps actually cost you time, money, or trust.

Map a pilot workflow Start the checklist

No signup, no email gate. Companion to the Always-On AgentOps Implementation pilot.

Self-assessment

30 items, 6 pillars

Free

Pillar 1 · Inventory and Authorization Pillar 2 · Task Memory and Continuity Pillar 3 · Run Logs, Evidence, and QA Pillar 4 · Cost, Rate Limits, and Budget Pillar 5 · Failure Detection and Recovery Pillar 6 · Approval, Escalation, and Weekly Evidence

How to use it

Score 0, 1, or 2 per question. Maximum 60.

0 means absent or accidental. 1 means partial or manual. 2 means reliable and routine. Most teams running real agent work today score 15 to 30 on a first pass — that is the starting point, not a grade.

Score 0 – 19

Brittle

One bad day burns a week. Fix Pillars 1, 2, and 5 first.

Score 20 – 39

Workable

You ship, but the founder is the load-bearing layer. Move on Pillars 3, 4, and 6.

Score 40 – 60

Production-grade

You can delegate sensitive workflows without surrendering control.

Pillar 1

Inventory and Authorization

You cannot observe what you cannot list.

If a new operator joined tomorrow, could they list every running agent and what each one is allowed to touch?

1
Can you list every agent currently running against your business in under 60 seconds, with name, owner, model, and purpose?
Pass
A single source of truth (file, dashboard, ticket, or spreadsheet) returns the answer without paging anyone.
Fail mode
Nobody can confirm whether yesterday's experimental agent is still calling APIs.
2
For each agent, do you know which credentials, accounts, repositories, or customer data it is authorized to touch?
Pass
An explicit allow-list per agent role, not "whatever was in .env at the time."
Fail mode
A debug agent quietly retains production write access months after it was last useful.
3
Does every agent have a single human owner with a clear escalation path if the agent misbehaves?
Pass
Name and contact path live next to the agent's config or run record.
Fail mode
A stuck agent runs for hours before anyone realizes there is no on-call for it.
4
Can you kill any agent in under 30 seconds without restarting the rest of your system?
Pass
A documented stop command, hot-key, or admin button per role.
Fail mode
Shutting down "the bad one" requires bouncing the whole stack.
5
Do you have a written policy for what kinds of actions an agent must NEVER take without a human?
Pass
Explicit list (production deploys, customer email, payments, credentials, legal text, model upgrades) reviewed quarterly.
Fail mode
The policy lives only in the founder's head.

Pillar 2

Task Memory and Continuity

If your agent forgets the work the moment the chat closes, every Monday is Monday morning.

Agent work is durable only when the next agent (or human) can pick it up without restarting from scratch.

6
Does every agent task survive past the chat session it was started in?
Pass
Tasks live in a durable store with status (pending, in progress, in review, done) and timestamps.
Fail mode
Half-finished work disappears when the tab is closed, and a different agent starts it again from scratch.
7
Can the next agent that picks up a task see what the previous agent already did, decided, or rejected?
Pass
A brief structured handoff note attached to the task or workflow, not "scroll up and read the whole conversation."
Fail mode
Agents repeat the same dead end three times in three days.
8
Are recurring workflows captured as named loops with a defined start, expected outcome, and acceptance criteria?
Pass
The workflow is described once and reused, not recreated as a fresh prompt each Monday.
Fail mode
Each cycle re-invents the brief and produces inconsistent output.
9
When an agent produces an artifact (code change, message draft, decision), is the artifact attached to the task it came from?
Pass
Artifact and task share a stable link both ways.
Fail mode
The founder finds a half-finished PR and cannot tell which conversation produced it or why.
10
Is there a deliberate boundary between "agent scratch space" and "facts the rest of the team should trust"?
Pass
Durable knowledge (decisions, runbooks, customer context) is written to a separate, reviewed surface.
Fail mode
A mistaken agent guess propagates into next quarter's onboarding doc because nobody curated it.

Pillar 3

Run Logs, Evidence, and QA

If you cannot reconstruct what your agent did yesterday, you are guessing.

Customer-visible work needs an audit trail you would actually be willing to share in an incident review.

11
For any agent run in the last 7 days, can you retrieve the prompt(s), tools used, inputs, and final output?
Pass
Structured run records, not "look in the terminal scrollback if you still have it."
Fail mode
A customer asks why their data was changed and nobody can answer.
12
Are tool calls logged with arguments, return values, latency, and which agent invoked them?
Pass
Every external call (filesystem, API, browser, database, payment) is captured with enough detail to reproduce.
Fail mode
A billing API was called twice for the same customer and you cannot tell whether the agent retried.
13
Do you capture and preserve QA evidence for tasks that ship to a customer or to production?
Pass
Tests run, screenshots taken, validation outputs, and review approvals are linked to the task before close.
Fail mode
Agent-generated work goes live without an artifact you could show in an incident review.
14
Can you tell the difference, in your records, between an agent action and a human action on the same workflow?
Pass
Every change is attributed to a specific actor, agent role, or human, with timestamp.
Fail mode
A customer-facing message cannot be traced back to who actually wrote it.
15
Do you have a routine, even a manual one, for spot-checking agent output that nobody reviewed at the time?
Pass
Weekly or per-batch sampling produces a documented pass/fail signal.
Fail mode
Regressions go unnoticed until a customer complains.

Pillar 4

Cost, Rate Limits, and Budget

If you do not know what your agents cost, you cannot price your offer or your runway.

Token spend, model swaps, and provider rate limits should all be visible before they become surprises on a credit card statement.

16
Do you know how many tokens or API calls your agents consumed yesterday, by role or workflow?
Pass
A daily or per-run number you can read in seconds, broken down meaningfully.
Fail mode
The bill arrives larger than expected and nobody can explain which loop overspent.
17
Do you know the dollar cost per finished task, or per unit of customer value, for your top 3 workflows?
Pass
A rough but defensible per-unit cost computed from observed usage.
Fail mode
Pricing decisions are based on intuition, not unit economics.
18
Do you have a budget cap or rate limit per agent role that prevents a runaway loop from burning your quota?
Pass
A configured cap, alarm, or kill-switch per role.
Fail mode
A stuck "auto-reply" agent burns through the monthly model quota in one afternoon.
19
Are model upgrades or version changes tracked alongside their effect on cost and quality?
Pass
A record of which model each role is using, when it last changed, and what the swap did to outputs.
Fail mode
A routine model bump silently triples your bill or breaks an output format.
20
Do you have visibility into provider rate-limit headroom and a defined behavior when you hit a wall?
Pass
Monitored usage against limits, plus a documented fallback (queue, downgrade, pause).
Fail mode
A 429 response cascades into silent agent failure during a paid customer demo.

Pillar 5

Failure Detection and Recovery

Agents fail. The question is whether you notice in minutes or days.

Stuck workers, repeating loops, and silent malformed outputs all need to surface before a customer sees them.

21
Do you detect a stuck or zombie agent within minutes, not hours?
Pass
Heartbeat, last-progress-at, or watchdog signal that fires when an agent stops making progress.
Fail mode
A worker has been "running" for 14 hours producing nothing.
22
Do you detect an agent that loops on the same step or repeats the same failed action?
Pass
Loop guard, retry cap, or pattern detection that breaks runaway behavior.
Fail mode
An agent retries the same broken API call 400 times before someone notices the cost.
23
When an agent crashes or is killed, do its in-flight tasks get re-queued or marked unsafe-to-resume?
Pass
Explicit recovery rule per workflow.
Fail mode
A half-finished customer email is partially sent because the recovery state was undefined.
24
Are silent failures (no output, empty output, malformed output) treated the same as loud errors?
Pass
Validation on output shape and content, with an explicit "unknown failure" status if it does not match.
Fail mode
An agent returns an empty string and the workflow records it as success.
25
Do you have a documented post-incident routine for any time an agent caused a customer-visible problem?
Pass
Short, repeatable write-up with timeline, blast radius, fix, and the controls that should have caught it.
Fail mode
Each incident is fixed in a panic and the lesson does not survive the week.

Pillar 6

Approval, Escalation, and Weekly Evidence

The point of observability is that the operator can step back without losing the wheel.

Routine status stays inside the agent layer. Sensitive decisions reach the operator with enough context to act on.

26
Is there a written list of decisions the founder or operator must approve, separate from work an agent or worker can ship on its own?
Pass
Explicit gates for spend, customer commitments, legal, credentials, production unlocks, and brand-voice content.
Fail mode
Agents either ask the operator about everything (noisy) or about nothing (dangerous).
27
Do escalations to the operator carry enough context to act on without re-reading raw tool output?
Pass
Short summary, the specific decision needed, the recommended option, and what happens if no answer.
Fail mode
The founder is paged at midnight with a wall of logs and no clear ask.
28
Is there a documented "do not page the operator for this" list that absorbs routine status, retries, and internal coordination?
Pass
Explicit rules for what stays inside the agent layer, with examples reviewed monthly.
Fail mode
Every retry, every minor blocker, and every internal hand-off becomes a notification.
29
Does the team get a recurring evidence report (weekly is typical) that summarizes shipped work, blocked work, risks, costs, and next decisions?
Pass
A regular artifact that survives even if the loudest channel went quiet.
Fail mode
Status is whatever the most recent founder DM happened to say.
30
Could a new operator, joining tomorrow, understand what your agents do and what they are forbidden to do, in under one hour, from your written artifacts?
Pass
Yes. Inventory, authorization rules, escalation policy, and weekly report make the system legible.
Fail mode
The system depends on a single founder remembering everything.

Scoring reference

Total your score across all six pillars.

Each pillar holds a possible 10 points. A combined score below 20 usually means the highest-leverage move is not buying more agent tooling — it is picking one recurring workflow that already stalls and installing the basics around it.

Pillar	Topic	Score
1	Inventory and Authorization	___ / 10
2	Task Memory and Continuity	___ / 10
3	Run Logs, Evidence, and QA	___ / 10
4	Cost, Rate Limits, and Budget	___ / 10
5	Failure Detection and Recovery	___ / 10
6	Approval, Escalation, and Weekly Evidence	___ / 10
	Total	___ / 60

What this checklist is

An operator-grade self-diagnostic you can run on a small team in fifteen minutes.
A shared vocabulary for arguing about which AgentOps gap to fix next.
A structure you can carry into a 2-week pilot conversation without revealing internals.

What it is not

Not a buying guide for any specific observability vendor. The patterns can be implemented with general-purpose tools, custom tracing, or commercial AgentOps platforms.
Not a compliance certification. It is hygiene, not a substitute for SOC 2, ISO 42001, AIUC-1, or regulated AI audits.
Not a replacement for a security review. Pillar 1 covers authorization at a high level; production deployment needs threat-model-level work.
Not a generic AI maturity model. Every question is something a small team can act on this month.

Bring your worst-scoring pillar.

If your score is below 20, the highest-leverage move is rarely more agent tooling. It is picking one recurring workflow that already stalls and installing durable tasks, run logs, approval gates, and a weekly evidence report around it. That is the exact shape of the Always-On AgentOps Implementation pilot.

Map a pilot workflow Back to AgentOps

This checklist is shared as-is for internal use, team self-assessment, and client conversations. Attribution to Empyer / AgentOps is appreciated, not required.