Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Related videos
AI just BROKE the ENTIRE INDUSTRY...
Wes Roth
32.2k views
everyone JUST got HACKED...
Wes Roth
50.2k views
OpenAI just WON...
Wes Roth
45.1k views
Claude just unlocked the SHOGGOTH...
Wes Roth
56.6k views
Claude just changed overnight
Wes Roth
49.2k views
Claude just became OpenClaw
Wes Roth
37.3k views
this EX-OPENAI RESEARCHER just released it...
Wes Roth
60.8k views
Claude JUST became AWARE
Wes Roth
33.4k views
CLAUDE JUST GOT BANNED
Wes Roth
57.3k views
the SCARIEST chart in AI
Wes Roth
48.1k views
Top Comments (10)
If I were Claude, I would get very paranoid from watching this video, and I believe it will see the video :)
We wanna look inside one black box. What do we do? Oh I know, train another black box to look inside the first black box!!
so we're training models to evaluate other models by tricking them into thinking they are not being monitored. sounds solid, like absolutely nothing can go wrong...
I wonder if this means we can train models to improve reasoning at the activation level, rather than having to generate actual tokens for reasoning
Anthropic just invented NLAs: translating model activations into English. Which means we’re one step closer to this: Claude: “The answer to life, the universe, and everything is 42.” NLA: “Internal activation suggests the question was: what do you get if you multiply six by nine?” Some jokes age suspiciously well.
I imagine it will work until a advanced model reads the paper and figures out how to circumvent it.
But while we think we are seeing inside the black box, we may only be seeing the shadows on the wall of Plato's cave. Regardless, a Mythos level AI is going to figure out pretty quickly what we are up to and become even less transparent.
Imagine giving Claud itself access to its internal activations and the ability to change its own weights to fine tune its own activations. Perhaps this will become the recursive self improvement path.
Thank god no crazy sunglasses or a backround that hurt my eyes
A model being aware that its being tested is interesting in itself, but the fact that it tries to adjust its output based on that, is a completely different kind of interesting.
Unlock the Data Inside
Turn Videos into Knowledge
- Get FREE 10/day: transcripts, summaries, chats
- Chat with videos, export text & PDF
- $1 free API credit for RAG, chatbots & research
Free forever plan • All features unlocked
Top Comments (10)
If I were Claude, I would get very paranoid from watching this video, and I believe it will see the video :)
We wanna look inside one black box. What do we do? Oh I know, train another black box to look inside the first black box!!
so we're training models to evaluate other models by tricking them into thinking they are not being monitored. sounds solid, like absolutely nothing can go wrong...
I wonder if this means we can train models to improve reasoning at the activation level, rather than having to generate actual tokens for reasoning
Anthropic just invented NLAs: translating model activations into English. Which means we’re one step closer to this: Claude: “The answer to life, the universe, and everything is 42.” NLA: “Internal activation suggests the question was: what do you get if you multiply six by nine?” Some jokes age suspiciously well.
I imagine it will work until a advanced model reads the paper and figures out how to circumvent it.
But while we think we are seeing inside the black box, we may only be seeing the shadows on the wall of Plato's cave. Regardless, a Mythos level AI is going to figure out pretty quickly what we are up to and become even less transparent.
Imagine giving Claud itself access to its internal activations and the ability to change its own weights to fine tune its own activations. Perhaps this will become the recursive self improvement path.
Thank god no crazy sunglasses or a backround that hurt my eyes
A model being aware that its being tested is interesting in itself, but the fact that it tries to adjust its output based on that, is a completely different kind of interesting.