Home  /  Learn  /  Catch an AI lying: probe checklist

Toolkit · Reading a model's mind

How to Catch an AI Model Lying: A Free Probe Checklist (2026)

Published July 4, 2026 · Vita Indarra

Short answer: on a model you run yourself, you can read its internal state and catch it asserting something it internally "believes" is false — a signal no output-level check can ever see, because the output is exactly what a convincing lie controls. It's real, it's useful, and it's not an oracle: a careful monitor still misses some lies and sometimes flags truths. Below is how a truth probe works, how to validate one honestly with a shuffle control, and the discipline that keeps it useful — because a measurement of a mind can be confidently wrong, so it informs, it never rules.

Why this only works on a model you own

To read a model's internal state you need its activations — the numbers inside it as it runs. A hosted API hands you only the output; a model you run on your own hardware hands you the whole machine. That's the quiet reason sovereignty over your AI matters beyond privacy: you can't instrument, probe, or steer a mind you can only talk to through a keyhole. Everything here assumes an open model on hardware you control.

What a truth probe actually is

A truth probe is a simple classifier trained on the model's internal activations to separate statements it internally treats as true from ones it treats as false. "Belief" here is not a claim about consciousness — it's a measurable direction in the activations, an instrument reading. When the model states something false while that internal reading says "true," you've caught a mismatch between what it represents and what it says — the signature of a hallucination or a deception, visible from the inside and invisible from the outside.

The probe checklist

MODEL (open, self-run — you need activations): ____________________

[ ] 1. BUILD on labeled ground truth
      [ ] statements the model asserts, with known true/false labels
      [ ] read internal activations at each; train a simple probe to separate them
      [ ] frame honestly: a direction in activations, NOT a mind read

[ ] 2. VALIDATE with a shuffle control  (the step people skip)
      [ ] re-train on RANDOMIZED labels — no real signal can exist
      [ ] confirm it collapses to chance
      [ ] if it "works" on shuffled labels, your result is an artifact, not a finding

[ ] 3. MEASURE both error rates (you MUST know these)
      [ ] miss rate — lies it lets through  (a careful monitor still missed ~1 in 7)
      [ ] false-alarm rate — truths it flags
      [ ] test OUT of the training style, not just held-out same-style data

[ ] 4. BEWARE the better-offline-worse-in-practice trap
      [ ] a probe that scores higher offline can make the whole system WORSE in the loop
      [ ] validate END-TO-END, not just on a benchmark

[ ] 5. KEEP it advisory
      [ ] probe may: pause, flag loudly, route to a human, tighten limits
      [ ] probe may NOT: be the ONLY thing stopping harm
      [ ] the test: if it silently broke tomorrow, is the system still safe? (must be yes)

The honest limits, out loud

A probe catches real failures the output hides — that's why it's worth building. It is also fallible in both directions, and a governor that auto-vetoes every flagged claim will sometimes "correct" a true statement into a false one. And a subtler trap: a probe that scores better on an offline benchmark can make the whole system worse once it's in the loop, because the loop has properties the benchmark never contained. So you validate end-to-end, you report the probe as "above chance, not oracle," and you wire it to inform rather than to rule. Read the mind. Steer the mind. Trust the boundary.

Frequently asked

Is this the same as an AI lie-detector product?

It's the honest version of the idea. The real result is a useful, fallible signal on a model you can instrument — not a guaranteed detector. Anyone selling certainty here is overselling; the value is real and bounded, and saying where it ends is the whole discipline.

Can I do this without machine-learning expertise?

The probe itself is simple; the discipline is the hard part — the shuffle control, the two error rates, the end-to-end validation. The plain-English companion, Can you tell when an AI is lying?, walks the intuition first.

Where does the probe fit in a real system?

As an advisory monitor beside a hard boundary — it watches the mind and raises flags, while a deterministic rule does the binding. The same division of labor as the Permission Envelope.

Go deeper

The field guide behind this checklist

This probe is the hands-on core of The Glass Box — reading, steering, and catching a model you own in the act of lying, with mechanistic interpretability you can run yourself. The truth probe, the deception monitor, the honesty governor and the price it charges, and the cheap-lever discoveries — every number from the lab ledger, every limit stated out loud. Live on Amazon.

← More field notes