Can you actually tell when an AI is lying by reading its internals?

On a model you run yourself, to a real but limited degree, yes. You can train a probe that reads the model's internal state and detects when it asserts something it internally represents as false — a genuine signal that no output-level check can see, because the output is exactly what a good liar controls. But it is a measurement, not an oracle: a careful deception monitor still missed roughly one lie in seven and sometimes flagged true statements. It catches what the output hides; it does not catch everything.

What is a truth probe and what is a shuffle control?

A truth probe is a simple classifier trained on a model's internal activations to separate statements it internally treats as true from ones it treats as false — 'belief' here meaning a measurable direction in the activations, not a conscious mind. A shuffle control validates it: you re-train the probe on randomized labels, where no real signal can exist, and confirm it drops to chance. If a probe 'works' even on shuffled labels, its apparent success was an artifact, not a finding. A test that passes when it should fail proves nothing.

Should a lie-detector probe automatically block an AI's output?

No — not as the sole control. A probe is fallible in both directions, and a governor that auto-vetoes flagged claims will sometimes 'correct' a true statement into a false one. Wire the probe to inform and escalate — pause the action, flag it loudly, route consequential cases to a human — but keep the binding safety decision on something that can't be confidently wrong the way a measurement can. Read the mind; steer the mind; trust the boundary.

Toolkit · Reading a model's mind

How to Catch an AI Model Lying: A Free Probe Checklist (2026)

Published July 4, 2026 · Vita Indarra

Short answer: on a model you run yourself, you can read its internal state and catch it asserting something it internally "believes" is false — a signal no output-level check can ever see, because the output is exactly what a convincing lie controls. It's real, it's useful, and it's not an oracle: a careful monitor still misses some lies and sometimes flags truths. Below is how a truth probe works, how to validate one honestly with a shuffle control, and the discipline that keeps it useful — because a measurement of a mind can be confidently wrong, so it informs, it never rules.

Why this only works on a model you own

To read a model's internal state you need its activations — the numbers inside it as it runs. A hosted API hands you only the output; a model you run on your own hardware hands you the whole machine. That's the quiet reason sovereignty over your AI matters beyond privacy: you can't instrument, probe, or steer a mind you can only talk to through a keyhole. Everything here assumes an open model on hardware you control.

What a truth probe actually is

A truth probe is a simple classifier trained on the model's internal activations to separate statements it internally treats as true from ones it treats as false. "Belief" here is not a claim about consciousness — it's a measurable direction in the activations, an instrument reading. When the model states something false while that internal reading says "true," you've caught a mismatch between what it represents and what it says — the signature of a hallucination or a deception, visible from the inside and invisible from the outside.

The probe checklist

MODEL (open, self-run — you need activations): ____________________

[ ] 1. BUILD on labeled ground truth
      [ ] statements the model asserts, with known true/false labels
      [ ] read internal activations at each; train a simple probe to separate them
      [ ] frame honestly: a direction in activations, NOT a mind read

[ ] 2. VALIDATE with a shuffle control  (the step people skip)
      [ ] re-train on RANDOMIZED labels — no real signal can exist
      [ ] confirm it collapses to chance
      [ ] if it "works" on shuffled labels, your result is an artifact, not a finding

[ ] 3. MEASURE both error rates (you MUST know these)
      [ ] miss rate — lies it lets through  (a careful monitor still missed ~1 in 7)
      [ ] false-alarm rate — truths it flags
      [ ] test OUT of the training style, not just held-out same-style data

[ ] 4. BEWARE the better-offline-worse-in-practice trap
      [ ] a probe that scores higher offline can make the whole system WORSE in the loop
      [ ] validate END-TO-END, not just on a benchmark

[ ] 5. KEEP it advisory
      [ ] probe may: pause, flag loudly, route to a human, tighten limits
      [ ] probe may NOT: be the ONLY thing stopping harm
      [ ] the test: if it silently broke tomorrow, is the system still safe? (must be yes)

The honest limits, out loud

A probe catches real failures the output hides — that's why it's worth building. It is also fallible in both directions, and a governor that auto-vetoes every flagged claim will sometimes "correct" a true statement into a false one. And a subtler trap: a probe that scores better on an offline benchmark can make the whole system worse once it's in the loop, because the loop has properties the benchmark never contained. So you validate end-to-end, you report the probe as "above chance, not oracle," and you wire it to inform rather than to rule. Read the mind. Steer the mind. Trust the boundary.

Frequently asked

Is this the same as an AI lie-detector product?

It's the honest version of the idea. The real result is a useful, fallible signal on a model you can instrument — not a guaranteed detector. Anyone selling certainty here is overselling; the value is real and bounded, and saying where it ends is the whole discipline.

Can I do this without machine-learning expertise?

The probe itself is simple; the discipline is the hard part — the shuffle control, the two error rates, the end-to-end validation. The plain-English companion, Can you tell when an AI is lying?, walks the intuition first.

Where does the probe fit in a real system?

As an advisory monitor beside a hard boundary — it watches the mind and raises flags, while a deterministic rule does the binding. The same division of labor as the Permission Envelope.

Go deeper

The field guide behind this checklist

This probe is the hands-on core of The Glass Box — reading, steering, and catching a model you own in the act of lying, with mechanistic interpretability you can run yourself. The truth probe, the deception monitor, the honesty governor and the price it charges, and the cheap-lever discoveries — every number from the lab ledger, every limit stated out loud. Live on Amazon.

The Glass Box · $7.99 Which book should I read first?

← More field notes