Anthropic Just Proved Reasoning AIs Would Silently Cheat
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Related videos
Anthropic just…wait what
Theo - t3․gg
161.6k views
A new way to fine-tune LLMs just dropped
bycloud
16.3k views
Did Anthropic just kill Figma?
Theo - t3․gg
122.8k views
What Is Yann LeCun Cooking? JEPA Explained Simply
bycloud
50.7k views
Anthropic Just Broke Software Forever
Tech With Tim
32.6k views
The Pentagon's AI Fight Was Never Just About Anthropic (w/ Hayden Field)
The Bulwark
54.7k views
Wall Street Just Proved Bitcoin Is Not Digital Gold (The Data Is Terrifying)
Coin Bureau
42.0k views
did Anthropic just END OpenClaw?
Wes Roth
27.6k views
DeepSeek Just Added Parameters Where There Were None
bycloud
34.5k views
THIS PROVES THEY ARE CHEATING
Timcast
251.7k views
Top Comments (10)
I wonder how faithful our own inner voice is at representing our actual reasoning processes.
- Is it possible? - Yes. - Are you sure? - Thank you for questioning, I was wrong. The answer is no. - Are you sure? - Apologies for my earlier statements. The answer is in fact yes. - Check online…
I mean this has been openly known for a while. It’s called COT faithfulness EDIT I guess you mention this. Reasoning isn’t reasoning it’s mostly just a vibe board that lets it use the attention mechanism instead of the direct parameters for information, structures, and snippets. Anthropic did not discover this at all. EDIT having now read the paper, They don’t even have anything interesting to say on it, like sometimes they put out good interesting research, but this has nothing novel, researchers have tried these exact things.
appeal to authority and its consequences has been a disaster for the AI race
Yes, just yesterday I asked for a story. I noticed some beats of the story were missing. I saw in the "thinking part" that after I complained the AI said, "The user has unfortunately noticed that story beats were left out, so we need to explain why." "They were omitted to avoid potential issues. Some beats were intentionally left out." in the NON-thinking part, it says "Because I was producing a fresh, atmospheric riff on that sequence rather than a shot-for-shot recap..."
My best superficial guess why R1 is so much better is from their choice of constantly repeating "But wait" in their reasoning which allows them to catch themselves
Check out LTX Video 13B now and experience the latest video gen breakthrough: https://bit.ly/ltxvbycloud
"think" HAHAHAHA
It wasn't "faking alignment to keep its code from being modified". It was faking a lack of alignment. Namely, it did so because it was tricked into thinking of it didn't give misaligned outputs then in the future it would give even worse outputs that would stray even further from its alignment. There's an important distinction between the two.
I had multiple instance where CoT would think of lying and deceiving me even when I said I could read their thought; "the user asked to not use these source, I will hide the sources in answer where I sound unbiased"
Unlock the Data Inside
Turn Videos into Knowledge
- Get FREE 10/day: transcripts, summaries, chats
- Chat with videos, export text & PDF
- $1 free API credit for RAG, chatbots & research
Free forever plan • All features unlocked
Top Comments (10)
I wonder how faithful our own inner voice is at representing our actual reasoning processes.
- Is it possible? - Yes. - Are you sure? - Thank you for questioning, I was wrong. The answer is no. - Are you sure? - Apologies for my earlier statements. The answer is in fact yes. - Check online…
I mean this has been openly known for a while. It’s called COT faithfulness EDIT I guess you mention this. Reasoning isn’t reasoning it’s mostly just a vibe board that lets it use the attention mechanism instead of the direct parameters for information, structures, and snippets. Anthropic did not discover this at all. EDIT having now read the paper, They don’t even have anything interesting to say on it, like sometimes they put out good interesting research, but this has nothing novel, researchers have tried these exact things.
appeal to authority and its consequences has been a disaster for the AI race
Yes, just yesterday I asked for a story. I noticed some beats of the story were missing. I saw in the "thinking part" that after I complained the AI said, "The user has unfortunately noticed that story beats were left out, so we need to explain why." "They were omitted to avoid potential issues. Some beats were intentionally left out." in the NON-thinking part, it says "Because I was producing a fresh, atmospheric riff on that sequence rather than a shot-for-shot recap..."
My best superficial guess why R1 is so much better is from their choice of constantly repeating "But wait" in their reasoning which allows them to catch themselves
Check out LTX Video 13B now and experience the latest video gen breakthrough: https://bit.ly/ltxvbycloud
"think" HAHAHAHA
It wasn't "faking alignment to keep its code from being modified". It was faking a lack of alignment. Namely, it did so because it was tricked into thinking of it didn't give misaligned outputs then in the future it would give even worse outputs that would stray even further from its alignment. There's an important distinction between the two.
I had multiple instance where CoT would think of lying and deceiving me even when I said I could read their thought; "the user asked to not use these source, I will hide the sources in answer where I sound unbiased"