AI code benchmarks lied to us
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Related videos
Cursor just crushed Claude Code
Theo - t3․gg
118.5k views
A letter to tech CEOs
Theo - t3․gg
60.6k views
How does Claude Code *actually* work?
Theo - t3․gg
189.3k views
Claude Code is unusable now
Theo - t3․gg
212.4k views
We need to talk about the Claude Code rate limits
Theo - t3․gg
126.0k views
BREAKING: Claude Code source leaked
Theo - t3․gg
197.3k views
Anthropic is lying to us.
Theo - t3․gg
122.7k views
Opus 4.6 Is The Best Coding Model Ever Made*
Theo - t3․gg
106.6k views
Claude Code has a big problem
Theo - t3․gg
79.6k views
I’m addicted to Claude Code (i get it now)
Theo - t3․gg
185.4k views
Top Comments (10)
best ai benchmark video counter: 1
finally a benchmark that agrees with me
Great to have this. Lines up much better with the experience most of us have been having with these models
would be interesting to see composer-2.5 result on this bench
Best model ever -> I was wrong cycle continues
Gemini models never fail to un-suprise me.
The important point you are missing is even if all tests are green, it does not mean that it is good code that follows the standards and implied rules of the repo. LLMs rarely suggest refactoring, which is an important part of real projects. They do tasks like a stranger that has no idea what the "folklore" behind the repo is. Over time, consistency degrades and maintainability decreases.
I would like to see more languages there. Idiomatic modern C++ (at least C++20) and maybe also C. These are way more common than rust or Go, so not sure why they aren't there. I found that some models do way worse in some languages than others.
I really want to see a benchmark that compiles tests of different kinds (feature implementation, project from scratch, different kind of difficult logic) & different prompting styles (plan first, one-shot, back and forth, lots of threads) and see what models do well in that matrix.
Damn it I really want to know how composer 2.5 performs on this. Maybe one day
Unlock the Data Inside
Turn Videos into Knowledge
- Get FREE 10/day: transcripts, summaries, chats
- Chat with videos, export text & PDF
- $1 free API credit for RAG, chatbots & research
Free forever plan • All features unlocked
Top Comments (10)
best ai benchmark video counter: 1
finally a benchmark that agrees with me
Great to have this. Lines up much better with the experience most of us have been having with these models
would be interesting to see composer-2.5 result on this bench
Best model ever -> I was wrong cycle continues
Gemini models never fail to un-suprise me.
The important point you are missing is even if all tests are green, it does not mean that it is good code that follows the standards and implied rules of the repo. LLMs rarely suggest refactoring, which is an important part of real projects. They do tasks like a stranger that has no idea what the "folklore" behind the repo is. Over time, consistency degrades and maintainability decreases.
I would like to see more languages there. Idiomatic modern C++ (at least C++20) and maybe also C. These are way more common than rust or Go, so not sure why they aren't there. I found that some models do way worse in some languages than others.
I really want to see a benchmark that compiles tests of different kinds (feature implementation, project from scratch, different kind of difficult logic) & different prompting styles (plan first, one-shot, back and forth, lots of threads) and see what models do well in that matrix.
Damn it I really want to know how composer 2.5 performs on this. Maybe one day