Navigate Select ESC Close

AI code benchmarks lied to us

2026-05-31 Science & Technology
10.1k
587
104
Theo - t3․gg
Theo - t3․gg
539.0k subscribers

Unlock all features

FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.

Description

We finally got a benchmark that actually matches reality. Thank you Browserbase for sponsoring! Check them out at: https://soydev.link/browserbase SOURCES: https://deepswe.datacurve.ai/ https://x.com/theo/status/2059352130289651925 Want to sponsor a video? Learn more here: https://soydev.link/sponsor-me Check out my Twitch, Twitter, Discord more at https://t3.gg S/O @Ph4seon3 for the awesome edit 🙏

Top Comments (10)

@rolaca11 2026-05-31

best ai benchmark video counter: 1

511 3 replies
@realivanjx 2026-05-31

finally a benchmark that agrees with me

96 1 replies
@PaulPlay 2026-05-31

Great to have this. Lines up much better with the experience most of us have been having with these models

18
@BORNINSPACE 2026-05-31

would be interesting to see composer-2.5 result on this bench

91 6 replies
@vortex4705 2026-05-31

Best model ever -> I was wrong cycle continues

491 7 replies
@Zol_ui 2026-05-31

Gemini models never fail to un-suprise me.

58 8 replies
@MichaelScharf 2026-05-31

The important point you are missing is even if all tests are green, it does not mean that it is good code that follows the standards and implied rules of the repo. LLMs rarely suggest refactoring, which is an important part of real projects. They do tasks like a stranger that has no idea what the "folklore" behind the repo is. Over time, consistency degrades and maintainability decreases.

17 1 replies
@ryszardgoc1918 2026-05-31

I would like to see more languages there. Idiomatic modern C++ (at least C++20) and maybe also C. These are way more common than rust or Go, so not sure why they aren't there. I found that some models do way worse in some languages than others.

38 4 replies
@daboross2 2026-05-31

I really want to see a benchmark that compiles tests of different kinds (feature implementation, project from scratch, different kind of difficult logic) & different prompting styles (plan first, one-shot, back and forth, lots of threads) and see what models do well in that matrix.

2
@mfyoungblood 2026-05-31

Damn it I really want to know how composer 2.5 performs on this. Maybe one day

10

Unlock the Data Inside
Turn Videos into Knowledge

  • Get FREE 10/day: transcripts, summaries, chats
  • Chat with videos, export text & PDF
  • $1 free API credit for RAG, chatbots & research

Free forever plan • All features unlocked

App screenshot