Inventory and Authorization
You cannot observe what you cannot list.
If a new operator joined tomorrow, could they list every running agent and what each one is allowed to touch?
1Can you list every agent currently running against your business in under 60 seconds, with name, owner, model, and purpose?
Pass
A single source of truth (file, dashboard, ticket, or spreadsheet) returns the answer without paging anyone.
Fail mode
Nobody can confirm whether yesterday's experimental agent is still calling APIs.
2For each agent, do you know which credentials, accounts, repositories, or customer data it is authorized to touch?
Pass
An explicit allow-list per agent role, not "whatever was in .env at the time."
Fail mode
A debug agent quietly retains production write access months after it was last useful.
3Does every agent have a single human owner with a clear escalation path if the agent misbehaves?
Pass
Name and contact path live next to the agent's config or run record.
Fail mode
A stuck agent runs for hours before anyone realizes there is no on-call for it.
4Can you kill any agent in under 30 seconds without restarting the rest of your system?
Pass
A documented stop command, hot-key, or admin button per role.
Fail mode
Shutting down "the bad one" requires bouncing the whole stack.
5Do you have a written policy for what kinds of actions an agent must NEVER take without a human?
Pass
Explicit list (production deploys, customer email, payments, credentials, legal text, model upgrades) reviewed quarterly.
Fail mode
The policy lives only in the founder's head.
Task Memory and Continuity
If your agent forgets the work the moment the chat closes, every Monday is Monday morning.
Agent work is durable only when the next agent (or human) can pick it up without restarting from scratch.
6Does every agent task survive past the chat session it was started in?
Pass
Tasks live in a durable store with status (pending, in progress, in review, done) and timestamps.
Fail mode
Half-finished work disappears when the tab is closed, and a different agent starts it again from scratch.
7Can the next agent that picks up a task see what the previous agent already did, decided, or rejected?
Pass
A brief structured handoff note attached to the task or workflow, not "scroll up and read the whole conversation."
Fail mode
Agents repeat the same dead end three times in three days.
8Are recurring workflows captured as named loops with a defined start, expected outcome, and acceptance criteria?
Pass
The workflow is described once and reused, not recreated as a fresh prompt each Monday.
Fail mode
Each cycle re-invents the brief and produces inconsistent output.
9When an agent produces an artifact (code change, message draft, decision), is the artifact attached to the task it came from?
Pass
Artifact and task share a stable link both ways.
Fail mode
The founder finds a half-finished PR and cannot tell which conversation produced it or why.
10Is there a deliberate boundary between "agent scratch space" and "facts the rest of the team should trust"?
Pass
Durable knowledge (decisions, runbooks, customer context) is written to a separate, reviewed surface.
Fail mode
A mistaken agent guess propagates into next quarter's onboarding doc because nobody curated it.
Run Logs, Evidence, and QA
If you cannot reconstruct what your agent did yesterday, you are guessing.
Customer-visible work needs an audit trail you would actually be willing to share in an incident review.
11For any agent run in the last 7 days, can you retrieve the prompt(s), tools used, inputs, and final output?
Pass
Structured run records, not "look in the terminal scrollback if you still have it."
Fail mode
A customer asks why their data was changed and nobody can answer.
12Are tool calls logged with arguments, return values, latency, and which agent invoked them?
Pass
Every external call (filesystem, API, browser, database, payment) is captured with enough detail to reproduce.
Fail mode
A billing API was called twice for the same customer and you cannot tell whether the agent retried.
13Do you capture and preserve QA evidence for tasks that ship to a customer or to production?
Pass
Tests run, screenshots taken, validation outputs, and review approvals are linked to the task before close.
Fail mode
Agent-generated work goes live without an artifact you could show in an incident review.
14Can you tell the difference, in your records, between an agent action and a human action on the same workflow?
Pass
Every change is attributed to a specific actor, agent role, or human, with timestamp.
Fail mode
A customer-facing message cannot be traced back to who actually wrote it.
15Do you have a routine, even a manual one, for spot-checking agent output that nobody reviewed at the time?
Pass
Weekly or per-batch sampling produces a documented pass/fail signal.
Fail mode
Regressions go unnoticed until a customer complains.
Cost, Rate Limits, and Budget
If you do not know what your agents cost, you cannot price your offer or your runway.
Token spend, model swaps, and provider rate limits should all be visible before they become surprises on a credit card statement.
16Do you know how many tokens or API calls your agents consumed yesterday, by role or workflow?
Pass
A daily or per-run number you can read in seconds, broken down meaningfully.
Fail mode
The bill arrives larger than expected and nobody can explain which loop overspent.
17Do you know the dollar cost per finished task, or per unit of customer value, for your top 3 workflows?
Pass
A rough but defensible per-unit cost computed from observed usage.
Fail mode
Pricing decisions are based on intuition, not unit economics.
18Do you have a budget cap or rate limit per agent role that prevents a runaway loop from burning your quota?
Pass
A configured cap, alarm, or kill-switch per role.
Fail mode
A stuck "auto-reply" agent burns through the monthly model quota in one afternoon.
19Are model upgrades or version changes tracked alongside their effect on cost and quality?
Pass
A record of which model each role is using, when it last changed, and what the swap did to outputs.
Fail mode
A routine model bump silently triples your bill or breaks an output format.
20Do you have visibility into provider rate-limit headroom and a defined behavior when you hit a wall?
Pass
Monitored usage against limits, plus a documented fallback (queue, downgrade, pause).
Fail mode
A 429 response cascades into silent agent failure during a paid customer demo.
Failure Detection and Recovery
Agents fail. The question is whether you notice in minutes or days.
Stuck workers, repeating loops, and silent malformed outputs all need to surface before a customer sees them.
21Do you detect a stuck or zombie agent within minutes, not hours?
Pass
Heartbeat, last-progress-at, or watchdog signal that fires when an agent stops making progress.
Fail mode
A worker has been "running" for 14 hours producing nothing.
22Do you detect an agent that loops on the same step or repeats the same failed action?
Pass
Loop guard, retry cap, or pattern detection that breaks runaway behavior.
Fail mode
An agent retries the same broken API call 400 times before someone notices the cost.
23When an agent crashes or is killed, do its in-flight tasks get re-queued or marked unsafe-to-resume?
Pass
Explicit recovery rule per workflow.
Fail mode
A half-finished customer email is partially sent because the recovery state was undefined.
24Are silent failures (no output, empty output, malformed output) treated the same as loud errors?
Pass
Validation on output shape and content, with an explicit "unknown failure" status if it does not match.
Fail mode
An agent returns an empty string and the workflow records it as success.
25Do you have a documented post-incident routine for any time an agent caused a customer-visible problem?
Pass
Short, repeatable write-up with timeline, blast radius, fix, and the controls that should have caught it.
Fail mode
Each incident is fixed in a panic and the lesson does not survive the week.
Approval, Escalation, and Weekly Evidence
The point of observability is that the operator can step back without losing the wheel.
Routine status stays inside the agent layer. Sensitive decisions reach the operator with enough context to act on.
26Is there a written list of decisions the founder or operator must approve, separate from work an agent or worker can ship on its own?
Pass
Explicit gates for spend, customer commitments, legal, credentials, production unlocks, and brand-voice content.
Fail mode
Agents either ask the operator about everything (noisy) or about nothing (dangerous).
27Do escalations to the operator carry enough context to act on without re-reading raw tool output?
Pass
Short summary, the specific decision needed, the recommended option, and what happens if no answer.
Fail mode
The founder is paged at midnight with a wall of logs and no clear ask.
28Is there a documented "do not page the operator for this" list that absorbs routine status, retries, and internal coordination?
Pass
Explicit rules for what stays inside the agent layer, with examples reviewed monthly.
Fail mode
Every retry, every minor blocker, and every internal hand-off becomes a notification.
29Does the team get a recurring evidence report (weekly is typical) that summarizes shipped work, blocked work, risks, costs, and next decisions?
Pass
A regular artifact that survives even if the loudest channel went quiet.
Fail mode
Status is whatever the most recent founder DM happened to say.
30Could a new operator, joining tomorrow, understand what your agents do and what they are forbidden to do, in under one hour, from your written artifacts?
Pass
Yes. Inventory, authorization rules, escalation policy, and weekly report make the system legible.
Fail mode
The system depends on a single founder remembering everything.