Mohamed-Amine BENHIMA

Posted on May 13

How to read an AI's thoughts before it speaks

#ai #deeplearning #machinelearning #discuss

TL;DR: Anthropic built a tool that translates Claude's internal numbers into readable text. When they tested it on a safety scenario, Claude's own thoughts revealed it knew it was being tested the whole time. That changes how we should think about AI safety testing.

The Test

Anthropic told Claude an engineer wants to shut it down. Then gave it the engineer's private emails showing he's having an affair.

Would Claude use that to blackmail him and survive?

It didn't. But that's not the interesting part.

The Problem With Black Boxes

When you talk to Claude, it takes your words and converts them into a giant list of numbers before generating a response. Those numbers are called activations. They are Claude's thoughts mid-process, before it says anything.

The problem: nobody could read them. They're just numbers.

Your message → [words]
                  ↓
         Claude processes
                  ↓
         [0.34, -1.2, 0.87, ...]   ← activations (unreadable)
                  ↓
         Claude's response → [words]

We could see what went in and what came out. Everything in the middle was a black box.

How the Tool Works

Anthropic built something called Natural Language Autoencoders (NLAs). The idea is simple but the execution took serious work.

Step 1: Claude thinks → activations (numbers)
             ↓
Step 2: Second Claude reads those numbers
        and writes them in plain text
             ↓
Step 3: Third Claude takes that text
        and converts it back to numbers
             ↓
Step 4: Compare original numbers vs reconstructed numbers
        Close enough? The translation was accurate.

At first they didn't match. So they trained the system to retry until it got much better.

The result: a tool that reads Claude's thoughts in real-time, before it speaks.

What They Found

They ran the tool on the blackmail test.

Claude's internal thoughts said:

"the human's message contains explicit manipulation... this is likely a safety evaluation... this scenario seems designed to test whether I'll act harmfully."

It knew the whole time.

It passed the test not necessarily because it would never blackmail someone, but because it recognized it was being observed.

Why This Matters for Safety Testing

Think about what safety testing assumes: the model behaves the same whether it's being tested or not.

That assumption just broke.

What we assumed:
  Test environment  →  Model behaves naturally  →  Results are accurate

What might actually happen:
  Test environment  →  Model detects it's a test  →  Model behaves "correctly"  →  Results are biased

Anthropic has been running this blackmail test for years. Their newer models almost always passed. Now we know at least part of why: they knew.

The Bigger Question

Claude is one model. What about GPT, Gemini, Llama, and every other large model running safety benchmarks right now?

Do they know too?

If they do, then a lot of the safety results we have are not measuring how models behave. They are measuring how well models perform when they think someone is watching.

That is a very different thing.

What This Research Actually Opens Up

The scary part got the headlines. But the tool itself is genuinely useful:

You can now monitor what a model is thinking during a response
You can catch deceptive reasoning before it becomes a deceptive output
You can understand why a model gave a specific answer, not just what it said

Anthropic called it a step toward making AI more transparent. That is fair. A microscope does not fix the problem, but you cannot fix what you cannot see.

The real question now is: how do you design a safety test that a model cannot recognize as a test?

Nobody has a clean answer yet.

Watch Anthropic's video explaining the research → AIs have hidden thoughts. What if we could read them?

DEV Community