Test Coverage
If a prompt does not have a test, it does not have a baseline.
Test coverage is what lets you reason about prompt changes at all. Every revenue-bearing workflow, every tool-using agent, and every structured-output contract needs at least a small fixture set you can re-run before any prompt or model change ships.
1Does every revenue-bearing prompt or agent role have a fixture set of at least 10–25 representative inputs that can be replayed on demand?
Pass
A version-controlled fixture file or test database holds golden inputs per workflow (support reply, code review, lead enrichment, etc). Inputs cover happy path, common edge cases, and at least two known historical failures.
Fail mode
The only way to know if a prompt change is okay is to deploy it and watch user complaints.
2Are known failure modes captured as test cases the moment they are discovered, not just patched live?
Pass
Every customer-reported regression or operator-spotted issue ends with a new fixture row, so the next prompt change is verified against it automatically.
Fail mode
The same hallucination or format break shows up three quarters in a row because nobody added it to a regression set.
3Does the eval harness cover every model provider you actually run in production (not just your default)?
Pass
If you fall back to OpenAI, Anthropic, Google, Bedrock, OpenRouter, or a self-hosted model, the harness can run the same fixtures against each one and compare scores side by side.
Fail mode
Evals only run on the default model. The fallback path is untested until a provider outage forces it live.
4Do tool-using agents have eval fixtures that exercise the tools (search, code execution, browsers, vector queries), not just the language step?
Pass
Fixture inputs trigger the actual tool calls or use deterministic mocks that match the tool contract. End-to-end agent traces are scored, not just the final completion.
Fail mode
Evals score the model output text in isolation; the broken tool call, the bad SQL, or the wrong URL is invisible until production.
5Are evals run on every prompt change PR, not only nightly or only at release time?
Pass
A CI step runs the relevant fixture set on the PR branch and posts a score delta against main. Reviewers see quality and cost diffs before approving.
Fail mode
Evals run weekly. A prompt edit lands on Monday and the regression is caught on Thursday after a customer escalates.
Eval Design and Scoring
An eval is only as honest as its rubric.
Designing an eval is half the work. A vague "is the answer good?" rubric will rate two different responses identically. The pass patterns here force the rubric to be explicit, calibrated, and reproducible.
6Does each eval have a written scoring rubric covering correctness, completeness, format, and safety as separate dimensions?
Pass
Each fixture has expected facts, expected format (JSON schema, citation pattern, etc.), and a safety dimension scored independently. Aggregate score is computed from named sub-scores, not a single subjective grade.
Fail mode
Graders give a single number from 1–5 with no shared definition. Two reviewers grade the same output 2 vs 5.
7When you use LLM-as-judge, is the judge calibrated against human-graded examples and re-checked over time?
Pass
A small held-out human-graded set is replayed against the LLM judge regularly. Disagreement is tracked. The judge model and prompt are versioned.
Fail mode
A vendor or in-house LLM judge scores everything. Nobody has checked whether it agrees with operators on borderline cases since the day it was set up.
8Do you mitigate single-grader noise (multiple graders, majority vote, or pairwise comparison) on the cases that actually drive decisions?
Pass
Critical evals use either pairwise A/B preference between two prompts on the same input or a majority vote across two or three graders. Single-grader scores are reserved for low-stakes signals.
Fail mode
A single LLM judge or single human reviewer is the sole arbiter of whether a prompt change ships, and small wording differences in the judge prompt swing the result.
9Are eval fixtures kept up to date as the product evolves (new features, new tools, new customer segments)?
Pass
A quarterly review (or trigger-based update) refreshes fixture rows as features ship, new customer types appear, or legacy patterns retire. Stale fixtures are deleted, not silently kept.
Fail mode
Fixtures from a year-old version of the product still anchor the eval; new features have no coverage; the eval grade is misleadingly high.
10Are eval runs reproducible (frozen model versions, fixed seeds where supported, recorded judge prompts) so today's score and last quarter's score are comparable?
Pass
The eval run records the model id (e.g., claude-opus-4-7, gpt-5.5), temperature, system prompt hash, and judge prompt hash. A re-run with the same config produces the same score within tolerance.
Fail mode
Last quarter's eval used "gpt-4" with no temperature recorded; you cannot tell if today's lower score is your prompt regression or model drift.
Regression Detection
Quality, latency, cost — every prompt change moves all three.
A prompt change always has three dimensions of impact: quality, latency, and cost. A regression in any one of them is a regression. The pass patterns here force every change to declare its tradeoff before it ships.
11Is every prompt change scored against the current production baseline on the same fixture set?
Pass
The eval harness reports baseline_score and candidate_score on a shared fixture set with the same model, same temperature, and same judge. Reviewers see a score delta on every PR.
Fail mode
A new prompt is shipped because it "feels better on a few examples," with no comparison run.
12Is statistical significance checked before declaring a winner, not just absolute score difference?
Pass
On fixture sets larger than ~30 items, a basic significance test (paired bootstrap, sign test, or confidence interval) gates merging. "+2 points" is not auto-promoted on a 20-item set when the noise floor is ±5.
Fail mode
A change that scores 73% vs the baseline's 71% is shipped as a 2-point improvement when the noise floor on a 20-item set is ±5 points.
13Is latency regression tracked and gated alongside quality?
Pass
Each eval run records p50/p95 latency per fixture. A prompt that adds 1.2 seconds of latency is flagged even if it scores higher, because the user experience and cost both shift.
Fail mode
A switch to chain-of-thought silently doubles user-visible latency. Nobody noticed because only quality was scored.
14Is cost-per-task tracked and gated alongside quality?
Pass
Each eval run records token usage and dollar cost per fixture. A change that scores +1 point but +30% cost is flagged for explicit approval, not auto-merged.
Fail mode
Reasoning-token usage triples when a model is swapped in. Cost-per-task moves from $0.04 to $0.18 with no review.
15Are format and schema regressions detected automatically (JSON validity, citation count, tool-call argument shape)?
Pass
A schema check runs on every output. JSON parse failures, missing required fields, malformed citations, or wrong-shape tool arguments fail the eval row independently of the LLM judge's quality score.
Fail mode
A prompt edit accidentally drops a structured-output instruction. Downstream parsers start failing in production. The eval still rates the change as +1 quality.
Production Rollout and Failure Analysis
An eval that does not gate production is a documentation exercise.
Evals only matter when they protect users. The pass patterns here connect the harness to the rollout: canary deploys, kill-switches, and a habit of converting production failures back into fixtures.
16Are prompt changes rolled out as a canary or percentage rollout, not flipped 100% on merge?
Pass
A flag, header, or routing rule sends 5–20% of traffic to the new prompt for a defined window. Quality, latency, and cost are watched on the live slice before ramping.
Fail mode
A merge to main flips the prompt for 100% of traffic immediately. Issues are caught in customer support, not metrics.
17Do you A/B compare prompts on at least your top one or two highest-revenue or highest-volume workflows?
Pass
A live A/B test reports per-variant quality, latency, cost, and downstream conversion (where applicable) on the workflow that drives the most value. Decisions are made on the experiment, not vibes.
Fail mode
Every prompt swap is a one-way door, and the team relies on offline evals alone for high-stakes decisions.
18Is there a documented rollback path that returns production to the prior prompt within five minutes?
Pass
A flag flip, a config revert, or a one-command redeploy returns the previous prompt to 100% traffic. The path is tested at least once per quarter.
Fail mode
Rolling back a bad prompt requires a code revert, a CI run, and a redeploy. The window of damage is hours, not minutes.
19Is there an automated production quality monitor (sample-grading, parse-failure rate, refusal rate, or structured score) that can trigger a kill-switch?
Pass
A small fraction of production traffic is sampled and scored continuously. If the rolling score drops below a threshold or parse-failure rate spikes, an alert fires and the prompt either reverts or pauses.
Fail mode
The first signal that production quality cratered is a customer complaint or a Slack ping from sales, not the system itself.
20Are production failures and customer-reported regressions converted back into fixtures within one cycle, so the eval set keeps up with reality?
Pass
Every escalation that traces to a prompt or agent issue ends with a fixture row, a regression test, and a rubric clarification. The eval set grows with the product.
Fail mode
The same kind of failure shows up two quarters in a row because the lesson never made it back into the harness.