AI code benchmarks lied to us

2026-05-31 Science & Technology

10.1k

587

104

Watch on YouTube

Theo - t3․gg

539.0k subscribers

Description

We finally got a benchmark that actually matches reality. Thank you Browserbase for sponsoring! Check them out at: https://soydev.link/browserbase SOURCES: https://deepswe.datacurve.ai/ https://x.com/theo/status/2059352130289651925 Want to sponsor a video? Learn more here: https://soydev.link/sponsor-me Check out my Twitch, Twitter, Discord more at https://t3.gg S/O @Ph4seon3 for the awesome edit 🙏

#web development #full stack #typescript #javascript #react #programming #programmer #theo

Top Comments (10)

@rolaca11 2026-05-31

best ai benchmark video counter: 1

511 3 replies

@realivanjx 2026-05-31

finally a benchmark that agrees with me

96 1 replies

@PaulPlay 2026-05-31

Great to have this. Lines up much better with the experience most of us have been having with these models

@BORNINSPACE 2026-05-31

would be interesting to see composer-2.5 result on this bench

91 6 replies

@vortex4705 2026-05-31

Best model ever -> I was wrong cycle continues

491 7 replies

@Zol_ui 2026-05-31

Gemini models never fail to un-suprise me.

58 8 replies

@MichaelScharf 2026-05-31

The important point you are missing is even if all tests are green, it does not mean that it is good code that follows the standards and implied rules of the repo. LLMs rarely suggest refactoring, which is an important part of real projects. They do tasks like a stranger that has no idea what the "folklore" behind the repo is. Over time, consistency degrades and maintainability decreases.

17 1 replies

@ryszardgoc1918 2026-05-31

I would like to see more languages there. Idiomatic modern C++ (at least C++20) and maybe also C. These are way more common than rust or Go, so not sure why they aren't there. I found that some models do way worse in some languages than others.

38 4 replies

@daboross2 2026-05-31

I really want to see a benchmark that compiles tests of different kinds (feature implementation, project from scratch, different kind of difficult logic) & different prompting styles (plan first, one-shot, back and forth, lots of threads) and see what models do well in that matrix.

@mfyoungblood 2026-05-31

Damn it I really want to know how composer 2.5 performs on this. Maybe one day

Description

Top Comments (10)

@rolaca11 2026-05-31

best ai benchmark video counter: 1

511 3 replies

@realivanjx 2026-05-31

finally a benchmark that agrees with me

96 1 replies

@PaulPlay 2026-05-31

Great to have this. Lines up much better with the experience most of us have been having with these models

@BORNINSPACE 2026-05-31

would be interesting to see composer-2.5 result on this bench

91 6 replies

@vortex4705 2026-05-31

Best model ever -> I was wrong cycle continues

491 7 replies

@Zol_ui 2026-05-31

Gemini models never fail to un-suprise me.

58 8 replies

@MichaelScharf 2026-05-31

17 1 replies

@ryszardgoc1918 2026-05-31

38 4 replies

@daboross2 2026-05-31

@mfyoungblood 2026-05-31

Damn it I really want to know how composer 2.5 performs on this. Maybe one day

Unlock the Data Inside
Turn Videos into Knowledge

Get FREE 10/day: transcripts, summaries, chats
Chat with videos, export text & PDF
$1 free API credit for RAG, chatbots & research

Try it free

Free forever plan • All features unlocked

AI code benchmarks lied to us

Description

Top Comments (10)

Related videos

Cursor just crushed Claude Code

A letter to tech CEOs

How does Claude Code actually work?

Claude Code is unusable now

We need to talk about the Claude Code rate limits

BREAKING: Claude Code source leaked

Anthropic is lying to us.

Opus 4.6 Is The Best Coding Model Ever Made*

Claude Code has a big problem

I’m addicted to Claude Code (i get it now)

Unlock all features

Unlock all features

Unlock all features

Unlock all features

Unlock all features

Related videos

Cursor just crushed Claude Code

A letter to tech CEOs

How does Claude Code actually work?

Claude Code is unusable now

We need to talk about the Claude Code rate limits

BREAKING: Claude Code source leaked

Anthropic is lying to us.

Opus 4.6 Is The Best Coding Model Ever Made*

Claude Code has a big problem

I’m addicted to Claude Code (i get it now)

Description

Top Comments (10)

Unlock the Data Inside
Turn Videos into Knowledge

AI code benchmarks lied to us

Description

Top Comments (10)

Related videos

Cursor just crushed Claude Code

A letter to tech CEOs

How does Claude Code *actually* work?

Claude Code is unusable now

We need to talk about the Claude Code rate limits

BREAKING: Claude Code source leaked

Anthropic is lying to us.

Opus 4.6 Is The Best Coding Model Ever Made*

Claude Code has a big problem

I’m addicted to Claude Code (i get it now)

Unlock all features

Unlock all features

Unlock all features

Unlock all features

Unlock all features

Related videos

Cursor just crushed Claude Code

A letter to tech CEOs

How does Claude Code *actually* work?

Claude Code is unusable now

We need to talk about the Claude Code rate limits

BREAKING: Claude Code source leaked

Anthropic is lying to us.

Opus 4.6 Is The Best Coding Model Ever Made*

Claude Code has a big problem

I’m addicted to Claude Code (i get it now)

Description

Top Comments (10)

Unlock the Data Inside Turn Videos into Knowledge

How does Claude Code actually work?

How does Claude Code actually work?

Unlock the Data Inside
Turn Videos into Knowledge