How do I know if an AI benchmark score can be trusted?

Ask what rung of evidence it is. An uncontrolled benchmark number is rung two: a real measurement, but vulnerable to leakage, a mismatched test distribution, or an average that hides a failure. It becomes rung three only when you add a control that proves the test can fail when it should — for example, shuffling the labels and confirming the score collapses to chance. A number you have not controlled is a number you have not yet earned the right to trust.

What are the six rungs of the Evidence Ladder?

From weakest to strongest: (1) a vibe — it looked good in the demo; (2) a benchmark number, uncontrolled; (3) a benchmark with a control that can fail; (4) an adversarial test cleared by a working proof-of-concept; (5) an end-to-end measurement of the whole system on real traffic; (6) a tamper-evident track record, proven over time and impossible to cherry-pick. Each rung adds a defense against a specific way you could be fooling yourself.

What's the difference between a hypothesis and a finding in AI evaluation?

A hypothesis is a claim cleared by inspection — you looked at the system, found no problem, and concluded it works. That conflates 'there is no problem' with 'I did not find one'. A finding is a claim cleared by a test that actively tried to make it fail and could not. The move from hypothesis to finding — from 'I looked and it seemed fine' to 'I tried to break it and it held' — is the whole climb up the ladder.

Toolkit · Evaluating AI

The Evidence Ladder: A Free Worksheet to Grade What Your AI Can Actually Do (2026)

Published July 4, 2026 · Vita Indarra

Short answer: every claim about an AI system rests on evidence, and evidence comes in strengths. The mistake that ships broken AI is gathering weak evidence and making a strong decision on it — trusting a demo the way you'd trust a proof. The fix is a habit: for any claim you're about to act on, ask what rung of evidence is this, and is it high enough for what I'm about to do with it? Below are the six rungs, the copy-paste worksheet, and the one distinction — hypothesis versus finding — that separates a number you can trust from a number that only looks like one.

The six rungs, weakest to strongest

Vibe — "it looked good in the demo / I'm pretty sure." Almost no defense against wanting it to be true.
Benchmark number — a real measurement, but uncontrolled: still vulnerable to a leak, a mismatched test set, or an average hiding the one failure that matters.
Benchmark with a control — a score you've shown can fail when it should. Shuffle the labels; if the number doesn't collapse to chance, it was measuring an artifact, not a signal.
Adversarial test + proof-of-concept — you stopped confirming the system on cooperative inputs and actively tried to break it. The claim is earned by the attack that failed.
End-to-end production measurement — the whole assembled system, on real traffic, measured on the outcome you care about. A component that wins in isolation can lose in the loop.
Tamper-evident track record — performance proven over time, recorded so it can't be cherry-picked or revised after the fact — the verifiable-publishing idea, applied to a system's own claims.

Match the rung to the stakes

The ladder isn't a demand that every claim reach the top — that's paralysis. It's a tool for matching the strength of your evidence to the cost of being wrong. A low-stakes internal tool can run on a benchmark number; if it's wrong, you lose an afternoon. A system that gates real money, makes irreversible decisions, or touches people's lives needs evidence near the top. The error teams make is not that they never gather strong evidence — it's that they gather vibe-level evidence and make rung-five decisions on it, feeling fully justified because the demo really did go well.

The cardinal distinction: hypothesis vs. finding

Underneath the whole ladder is one habit worth more than the rest combined. When you look at a system, find no problem, and conclude it's fine, you have a hypothesis — and the absence of a found problem is weak evidence, because it conflates "there is no problem" with "I did not find one." When you construct the specific scenario that would expose the problem, run it, and watch the system hold, you have a finding. Every rung up the ladder is a move from "I looked and it seemed fine" toward "I tried to make it fail and it didn't." A claim cleared by inspection is a hypothesis wearing a conclusion's clothes.

The worksheet — run any claim through it before you act

This is the actual worksheet from Evaluating AI Systems. Locate your claim honestly, then compare against what the decision requires. The gap between the two is your instruction.

CLAIM: ____________________
  WHAT RUNG IS MY EVIDENCE ACTUALLY ON?  (check the highest you've genuinely reached)
    [ ] 6  tamper-evident track record — proven over time, un-fakeable
    [ ] 5  end-to-end production measurement — whole system, real traffic
    [ ] 4  adversarial test + working proof-of-concept — I tried to break it; it held
    [ ] 3  benchmark WITH a control that collapsed when it should
    [ ] 2  benchmark number, uncontrolled
    [ ] 1  vibe — "it looked good / I'm pretty sure"   (feeling confident is NOT a rung)

  WHAT RUNG DOES THIS DECISION REQUIRE?  (set by STAKES, not by evidence on hand)
    Stakes of being wrong: ____________________   ->  required rung: ___
    (low / reversible -> 2-3 ok;  real money / irreversible / affects people -> 4-6)

  EVIDENCE RUNG ___  vs  REQUIRED RUNG ___
    evidence >= required -> calibrated, proceed
    evidence <  required -> GATHER MORE EVIDENCE or MAKE A SMALLER DECISION (don't ship the gap)

  HYPOTHESIS-OR-FINDING CHECK:
    Cleared by INSPECTION ("I looked, saw no problem")     = hypothesis (rung 1-2)
    Cleared by a TEST that tried to make it fail and didn't = finding    (rung 3+)

What this is, and what it isn't

The Evidence Ladder is the frame; it doesn't do the climbing. Getting a claim to rung three needs a real control (shuffle the labels, prove the test can fail). Rung four needs a genuine adversarial test. Rung five needs end-to-end measurement, because a better-scoring component can make a worse system. And rung six needs a tamper-evident record. Those disciplines — plus pre-registration, calibration, the ceiling diagnosis, and honest reporting ("above chance, not oracle") — are the method. This page gives you the map for free.

Frequently asked

Isn't a high benchmark score enough to ship?

Only for a low-stakes decision. A benchmark is rung two; gating real consequences is a rung-four-or-higher decision. The score isn't wrong — it's just weaker evidence than the decision needs, and the mismatch is where confident failures come from.

What's the single cheapest way to strengthen a claim?

A control. Re-run your evaluation in a version where success should be impossible — shuffle the labels, disable the thing you're testing — and confirm the result collapses. A test that passes even when it should fail is verifying nothing.

How does this connect to how I know an AI's real capability?

Directly — see the plain-English companion, How do you know what an AI can actually do? This worksheet is the tool that question points to.

Go deeper

The field guide behind this worksheet

The Evidence Ladder is one chapter of Evaluating AI Systems — the discipline of finding out what an AI system can actually do, as opposed to what its scoreboard says. The five lies a metric tells, pre-registration, controls, adversarial evaluation, the ceiling diagnosis, and honest reporting — drawn from real systems built, measured, and caught being confidently wrong. Live on Amazon.

Evaluating AI Systems · $9.99 Which book should I read first?

← More field notes