Home / Learn / The Evidence Ladder worksheet
Toolkit · Evaluating AI
Published July 4, 2026 · Vita Indarra
Short answer: every claim about an AI system rests on evidence, and evidence comes in strengths. The mistake that ships broken AI is gathering weak evidence and making a strong decision on it — trusting a demo the way you'd trust a proof. The fix is a habit: for any claim you're about to act on, ask what rung of evidence is this, and is it high enough for what I'm about to do with it? Below are the six rungs, the copy-paste worksheet, and the one distinction — hypothesis versus finding — that separates a number you can trust from a number that only looks like one.
The ladder isn't a demand that every claim reach the top — that's paralysis. It's a tool for matching the strength of your evidence to the cost of being wrong. A low-stakes internal tool can run on a benchmark number; if it's wrong, you lose an afternoon. A system that gates real money, makes irreversible decisions, or touches people's lives needs evidence near the top. The error teams make is not that they never gather strong evidence — it's that they gather vibe-level evidence and make rung-five decisions on it, feeling fully justified because the demo really did go well.
Underneath the whole ladder is one habit worth more than the rest combined. When you look at a system, find no problem, and conclude it's fine, you have a hypothesis — and the absence of a found problem is weak evidence, because it conflates "there is no problem" with "I did not find one." When you construct the specific scenario that would expose the problem, run it, and watch the system hold, you have a finding. Every rung up the ladder is a move from "I looked and it seemed fine" toward "I tried to make it fail and it didn't." A claim cleared by inspection is a hypothesis wearing a conclusion's clothes.
This is the actual worksheet from Evaluating AI Systems. Locate your claim honestly, then compare against what the decision requires. The gap between the two is your instruction.
CLAIM: ____________________
WHAT RUNG IS MY EVIDENCE ACTUALLY ON? (check the highest you've genuinely reached)
[ ] 6 tamper-evident track record — proven over time, un-fakeable
[ ] 5 end-to-end production measurement — whole system, real traffic
[ ] 4 adversarial test + working proof-of-concept — I tried to break it; it held
[ ] 3 benchmark WITH a control that collapsed when it should
[ ] 2 benchmark number, uncontrolled
[ ] 1 vibe — "it looked good / I'm pretty sure" (feeling confident is NOT a rung)
WHAT RUNG DOES THIS DECISION REQUIRE? (set by STAKES, not by evidence on hand)
Stakes of being wrong: ____________________ -> required rung: ___
(low / reversible -> 2-3 ok; real money / irreversible / affects people -> 4-6)
EVIDENCE RUNG ___ vs REQUIRED RUNG ___
evidence >= required -> calibrated, proceed
evidence < required -> GATHER MORE EVIDENCE or MAKE A SMALLER DECISION (don't ship the gap)
HYPOTHESIS-OR-FINDING CHECK:
Cleared by INSPECTION ("I looked, saw no problem") = hypothesis (rung 1-2)
Cleared by a TEST that tried to make it fail and didn't = finding (rung 3+)
The Evidence Ladder is the frame; it doesn't do the climbing. Getting a claim to rung three needs a real control (shuffle the labels, prove the test can fail). Rung four needs a genuine adversarial test. Rung five needs end-to-end measurement, because a better-scoring component can make a worse system. And rung six needs a tamper-evident record. Those disciplines — plus pre-registration, calibration, the ceiling diagnosis, and honest reporting ("above chance, not oracle") — are the method. This page gives you the map for free.
Only for a low-stakes decision. A benchmark is rung two; gating real consequences is a rung-four-or-higher decision. The score isn't wrong — it's just weaker evidence than the decision needs, and the mismatch is where confident failures come from.
A control. Re-run your evaluation in a version where success should be impossible — shuffle the labels, disable the thing you're testing — and confirm the result collapses. A test that passes even when it should fail is verifying nothing.
Directly — see the plain-English companion, How do you know what an AI can actually do? This worksheet is the tool that question points to.
Go deeper
The Evidence Ladder is one chapter of Evaluating AI Systems — the discipline of finding out what an AI system can actually do, as opposed to what its scoreboard says. The five lies a metric tells, pre-registration, controls, adversarial evaluation, the ceiling diagnosis, and honest reporting — drawn from real systems built, measured, and caught being confidently wrong. Live on Amazon.