Free self-diagnostic

AI Agent Prompt Evaluation Playbook

Twenty questions across four sections. Score your team in under fifteen minutes, then decide which gaps in your eval harness will let the next prompt regression reach production.

No signup, no email gate. Companion to the Always-On AgentOps Implementation pilot.

How to use it

Score 0, 1, or 2 per question. Maximum 40.

0 means absent or accidental. 1 means partial or manual. 2 means reliable and routine. Most teams running coding agents or RAG pipelines today score 8 to 18 on a first pass — that is the starting point, not a grade.

Score 0 – 13

Untested

You are shipping prompt changes on intuition. The next regression is already in production and you will hear about it from a customer. Fix Section A first: get fixtures and a CI step.

Score 14 – 27

Workable

You have evals, but rubrics, regression checks, or rollouts are not yet defensible. Move on Section B and C. Pick one workflow, harden the loop end-to-end, then expand.

Score 28 – 40

Production-grade

You can change prompts confidently, prove improvements, and catch regressions before users do. The eval set keeps up with the product, and rollouts are gated, not lobbed.

Section A

Test Coverage

If a prompt does not have a test, it does not have a baseline.

Test coverage is what lets you reason about prompt changes at all. Every revenue-bearing workflow, every tool-using agent, and every structured-output contract needs at least a small fixture set you can re-run before any prompt or model change ships.

  1. 1

    Does every revenue-bearing prompt or agent role have a fixture set of at least 10–25 representative inputs that can be replayed on demand?

    Pass

    A version-controlled fixture file or test database holds golden inputs per workflow (support reply, code review, lead enrichment, etc). Inputs cover happy path, common edge cases, and at least two known historical failures.

    Fail mode

    The only way to know if a prompt change is okay is to deploy it and watch user complaints.

  2. 2

    Are known failure modes captured as test cases the moment they are discovered, not just patched live?

    Pass

    Every customer-reported regression or operator-spotted issue ends with a new fixture row, so the next prompt change is verified against it automatically.

    Fail mode

    The same hallucination or format break shows up three quarters in a row because nobody added it to a regression set.

  3. 3

    Does the eval harness cover every model provider you actually run in production (not just your default)?

    Pass

    If you fall back to OpenAI, Anthropic, Google, Bedrock, OpenRouter, or a self-hosted model, the harness can run the same fixtures against each one and compare scores side by side.

    Fail mode

    Evals only run on the default model. The fallback path is untested until a provider outage forces it live.

  4. 4

    Do tool-using agents have eval fixtures that exercise the tools (search, code execution, browsers, vector queries), not just the language step?

    Pass

    Fixture inputs trigger the actual tool calls or use deterministic mocks that match the tool contract. End-to-end agent traces are scored, not just the final completion.

    Fail mode

    Evals score the model output text in isolation; the broken tool call, the bad SQL, or the wrong URL is invisible until production.

  5. 5

    Are evals run on every prompt change PR, not only nightly or only at release time?

    Pass

    A CI step runs the relevant fixture set on the PR branch and posts a score delta against main. Reviewers see quality and cost diffs before approving.

    Fail mode

    Evals run weekly. A prompt edit lands on Monday and the regression is caught on Thursday after a customer escalates.

Section B

Eval Design and Scoring

An eval is only as honest as its rubric.

Designing an eval is half the work. A vague "is the answer good?" rubric will rate two different responses identically. The pass patterns here force the rubric to be explicit, calibrated, and reproducible.

  1. 6

    Does each eval have a written scoring rubric covering correctness, completeness, format, and safety as separate dimensions?

    Pass

    Each fixture has expected facts, expected format (JSON schema, citation pattern, etc.), and a safety dimension scored independently. Aggregate score is computed from named sub-scores, not a single subjective grade.

    Fail mode

    Graders give a single number from 1–5 with no shared definition. Two reviewers grade the same output 2 vs 5.

  2. 7

    When you use LLM-as-judge, is the judge calibrated against human-graded examples and re-checked over time?

    Pass

    A small held-out human-graded set is replayed against the LLM judge regularly. Disagreement is tracked. The judge model and prompt are versioned.

    Fail mode

    A vendor or in-house LLM judge scores everything. Nobody has checked whether it agrees with operators on borderline cases since the day it was set up.

  3. 8

    Do you mitigate single-grader noise (multiple graders, majority vote, or pairwise comparison) on the cases that actually drive decisions?

    Pass

    Critical evals use either pairwise A/B preference between two prompts on the same input or a majority vote across two or three graders. Single-grader scores are reserved for low-stakes signals.

    Fail mode

    A single LLM judge or single human reviewer is the sole arbiter of whether a prompt change ships, and small wording differences in the judge prompt swing the result.

  4. 9

    Are eval fixtures kept up to date as the product evolves (new features, new tools, new customer segments)?

    Pass

    A quarterly review (or trigger-based update) refreshes fixture rows as features ship, new customer types appear, or legacy patterns retire. Stale fixtures are deleted, not silently kept.

    Fail mode

    Fixtures from a year-old version of the product still anchor the eval; new features have no coverage; the eval grade is misleadingly high.

  5. 10

    Are eval runs reproducible (frozen model versions, fixed seeds where supported, recorded judge prompts) so today's score and last quarter's score are comparable?

    Pass

    The eval run records the model id (e.g., claude-opus-4-7, gpt-5.5), temperature, system prompt hash, and judge prompt hash. A re-run with the same config produces the same score within tolerance.

    Fail mode

    Last quarter's eval used "gpt-4" with no temperature recorded; you cannot tell if today's lower score is your prompt regression or model drift.

Section C

Regression Detection

Quality, latency, cost — every prompt change moves all three.

A prompt change always has three dimensions of impact: quality, latency, and cost. A regression in any one of them is a regression. The pass patterns here force every change to declare its tradeoff before it ships.

  1. 11

    Is every prompt change scored against the current production baseline on the same fixture set?

    Pass

    The eval harness reports baseline_score and candidate_score on a shared fixture set with the same model, same temperature, and same judge. Reviewers see a score delta on every PR.

    Fail mode

    A new prompt is shipped because it "feels better on a few examples," with no comparison run.

  2. 12

    Is statistical significance checked before declaring a winner, not just absolute score difference?

    Pass

    On fixture sets larger than ~30 items, a basic significance test (paired bootstrap, sign test, or confidence interval) gates merging. "+2 points" is not auto-promoted on a 20-item set when the noise floor is ±5.

    Fail mode

    A change that scores 73% vs the baseline's 71% is shipped as a 2-point improvement when the noise floor on a 20-item set is ±5 points.

  3. 13

    Is latency regression tracked and gated alongside quality?

    Pass

    Each eval run records p50/p95 latency per fixture. A prompt that adds 1.2 seconds of latency is flagged even if it scores higher, because the user experience and cost both shift.

    Fail mode

    A switch to chain-of-thought silently doubles user-visible latency. Nobody noticed because only quality was scored.

  4. 14

    Is cost-per-task tracked and gated alongside quality?

    Pass

    Each eval run records token usage and dollar cost per fixture. A change that scores +1 point but +30% cost is flagged for explicit approval, not auto-merged.

    Fail mode

    Reasoning-token usage triples when a model is swapped in. Cost-per-task moves from $0.04 to $0.18 with no review.

  5. 15

    Are format and schema regressions detected automatically (JSON validity, citation count, tool-call argument shape)?

    Pass

    A schema check runs on every output. JSON parse failures, missing required fields, malformed citations, or wrong-shape tool arguments fail the eval row independently of the LLM judge's quality score.

    Fail mode

    A prompt edit accidentally drops a structured-output instruction. Downstream parsers start failing in production. The eval still rates the change as +1 quality.

Section D

Production Rollout and Failure Analysis

An eval that does not gate production is a documentation exercise.

Evals only matter when they protect users. The pass patterns here connect the harness to the rollout: canary deploys, kill-switches, and a habit of converting production failures back into fixtures.

  1. 16

    Are prompt changes rolled out as a canary or percentage rollout, not flipped 100% on merge?

    Pass

    A flag, header, or routing rule sends 5–20% of traffic to the new prompt for a defined window. Quality, latency, and cost are watched on the live slice before ramping.

    Fail mode

    A merge to main flips the prompt for 100% of traffic immediately. Issues are caught in customer support, not metrics.

  2. 17

    Do you A/B compare prompts on at least your top one or two highest-revenue or highest-volume workflows?

    Pass

    A live A/B test reports per-variant quality, latency, cost, and downstream conversion (where applicable) on the workflow that drives the most value. Decisions are made on the experiment, not vibes.

    Fail mode

    Every prompt swap is a one-way door, and the team relies on offline evals alone for high-stakes decisions.

  3. 18

    Is there a documented rollback path that returns production to the prior prompt within five minutes?

    Pass

    A flag flip, a config revert, or a one-command redeploy returns the previous prompt to 100% traffic. The path is tested at least once per quarter.

    Fail mode

    Rolling back a bad prompt requires a code revert, a CI run, and a redeploy. The window of damage is hours, not minutes.

  4. 19

    Is there an automated production quality monitor (sample-grading, parse-failure rate, refusal rate, or structured score) that can trigger a kill-switch?

    Pass

    A small fraction of production traffic is sampled and scored continuously. If the rolling score drops below a threshold or parse-failure rate spikes, an alert fires and the prompt either reverts or pauses.

    Fail mode

    The first signal that production quality cratered is a customer complaint or a Slack ping from sales, not the system itself.

  5. 20

    Are production failures and customer-reported regressions converted back into fixtures within one cycle, so the eval set keeps up with reality?

    Pass

    Every escalation that traces to a prompt or agent issue ends with a fixture row, a regression test, and a rubric clarification. The eval set grows with the product.

    Fail mode

    The same kind of failure shows up two quarters in a row because the lesson never made it back into the harness.

Scoring reference

Total your score across all four sections.

Each section carries 10 points. A combined score below 14 usually means the highest-leverage move is not buying an eval platform — it is picking one workflow, writing 15 fixture rows, wiring a CI step that compares baseline vs candidate, and turning the next regression into a fixture instead of a war room.

SectionTopicScore
ATest Coverage___ / 10
BEval Design and Scoring___ / 10
CRegression Detection___ / 10
DProduction Rollout and Failure Analysis___ / 10
Total___ / 40

What this playbook is

  • An operator-grade self-diagnostic you can run on a small AI engineering team in fifteen minutes.
  • The prompt-quality layer of AgentOps: fixtures, scoring rubrics, regression detection, and rollouts treated as production line items.
  • A structure you can carry into a 2-week pilot conversation without revealing internals.

What it is not

  • Not a buying guide for any specific eval platform. The patterns can be implemented with general-purpose CI, custom harnesses, or commercial eval tools.
  • Not a replacement for the AgentOps Observability Checklist or the Cost & Spend Tracking Starter Kit. Those cover what is happening; this covers whether what is happening is correct.
  • Not a benchmark suite. Public benchmarks measure model capabilities. This measures whether your specific application of those models is improving or regressing.
  • Not a substitute for human review on high-stakes outputs. It is the harness that lets human review focus on the right ten outputs, not all ten thousand.

Bring your worst-scoring section.

If your score is below 14, the highest-leverage move is rarely buying an eval vendor. Pick one revenue-bearing workflow, write 15 fixture rows from real history, wire a CI step that scores baseline vs candidate on every PR, and convert the next production failure into the 16th row. That is the exact shape of the Always-On AgentOps Implementation pilot.

This playbook is shared as-is for internal use, team self-assessment, and client conversations. Attribution to Empyer / AgentOps is appreciated, not required.

AI Agent Prompt Evaluation Playbook (20 Items, 4 Sections) | Empyer