Home
Channel
bycloud
New AI Meta: Train LLMs To Explore On "Hard" Tokens [RLVR + Entropy]

New AI Meta: Train LLMs To Explore On "Hard" Tokens [RLVR + Entropy]

2025-08-31 Science & Technology

23.4k

1.7k

Watch on YouTube

bycloud

229.0k subscribers

Description

Get started with Strands Agents today: https://aws.amazon.com/blogs/opensource/introducing-strands-agents-1-0-production-ready-multi-agent-orchestration-made-simple/?trk=3fa00bba-d2bc-45e2-a2b0-f96bc12fd521&sc_channel=psm In this video, I will be sharing how researchers train LLMs to "explore" during RL to improve performance via entropy. My Newsletter https://mail.bycloud.ai/ my project: find, discover & explain AI research semantically https://findmypapers.ai/ My Patreon https://www.patreon.com/c/bycloud Beyond the 80/20 Rule [Paper] https://arxiv.org/abs/2506.01939 Reasoning with Exploration [Paper] https://arxiv.org/abs/2506.14758 Try out my new fav place to learn how to code https://scrimba.com/?via=bycloudAI This video is supported by the kind Patrons & YouTube Members: 🙏Nous Research, Chris LeDoux, Ben Shaener, DX Research Group, Poof N' Inu, Andrew Lescelius, Deagan, Robert Zawiasa, Ryszard Warzocha, Tobe2d, Louis Muk, Akkusativ, Kevin Tai, Mark Buckler, NO U, Tony Jimenez, Ângelo Fonseca, jiye, Anushka, Asad Dhamani, Binnie Yiu, Calvin Yan, Clayton Ford, Diego Silva, Etrotta, Gonzalo Fidalgo, Handenon, Hector, Jake Disco very, Michael Brenner, Nilly K, OlegWock, Daddy Wen, Shuhong Chen, Sid_Cipher, Stefan Lorenz, Sup, tantan assawade, Thipok Tham, Thomas Di Martino, Thomas Lin, Richárd Nagyfi, Paperboy, mika, Leo, Berhane-Meskel, Kadhai Pesalam, mayssam, Bill Mangrum, nyaa, Toru Mon [Discord] https://discord.gg/NhJZGtH [Twitter] https://twitter.com/bycloudai [Patreon] https://www.patreon.com/bycloud [Business Inquiries] [email protected] [Profile & Banner Art] https://twitter.com/pygm7 [Video Editor] @Booga04 [Ko-fi] https://ko-fi.com/bycloudai

#bycloud #bycloudai #RLVR #RLVR with entropy #RL LLM #LLM RL #LLM RLVR #reinforcement learning with verifiable rewards

Top Comments (10)

@gemstone7818 2025-08-31

finally somebody using high model entropy to teach them to be more creative, good luck with the military btw

123 2 replies

@6IGNITION9 2025-09-01

So this is fascinating. Training the model to say "but wait" and similar, forces high entropy tokens into the context, encouraging branches in the reasoning.

@Vlarkus 2025-08-31

9:00 Making LLMs is basically how we learned to organize the Library of Babel.

24 10 replies

@Shragenator 2025-08-31

This seems similar to the training principle of "explorative, then exploitive" that is utilized to avoid overfitting and local minima

@DeadtomGCthe2nd 2025-08-31

Great explanation as usual! Super exciting developments! So cool that it tackles the repetition AND performance bottlenecks!

@arlogodfrey1508 2025-08-31

dang, now *that's* how you build a language model!

@vladyskaizen 2025-09-06

This is amazing! The data the models train on naturally reveals what sequences of words are part of the same "thought" and where do different "thoughts" begin and end. These high entropy tokens are basically delimitations for what we call "chunks" of information. NOW what we must do is delegate the low entropy tokens to the smaller models and only use the BIG BRAIN complex models only for the high entropy tokens. This will maximize efficiency. Also, humans can only hold about 4-7 chunks of information when doing a task. If we can figure out how models can make more that that, we will be golden.

8 2 replies

@PrimeStackPro 2025-08-31

Please make an AI agentic system twin of yourself to post daily

@thorvaldspear 2025-08-31

This appears to be very scalable, I wonder what the effect would be on larger and larger training runs!

@gnanaprakash-ravi 2025-09-03

Hi! The content is as excellent as ever! Just to clarify, when you mention RL vs RL with Entropy Adjustment, are you actually referring to RLVR vs RLVR with Entropy Adjustment? It's not just standard RL, right?

Description

Top Comments (10)

@gemstone7818 2025-08-31

finally somebody using high model entropy to teach them to be more creative, good luck with the military btw

123 2 replies

@6IGNITION9 2025-09-01

So this is fascinating. Training the model to say "but wait" and similar, forces high entropy tokens into the context, encouraging branches in the reasoning.

@Vlarkus 2025-08-31

9:00 Making LLMs is basically how we learned to organize the Library of Babel.

24 10 replies

@Shragenator 2025-08-31

This seems similar to the training principle of "explorative, then exploitive" that is utilized to avoid overfitting and local minima

@DeadtomGCthe2nd 2025-08-31

Great explanation as usual! Super exciting developments! So cool that it tackles the repetition AND performance bottlenecks!

@arlogodfrey1508 2025-08-31

dang, now *that's* how you build a language model!

@vladyskaizen 2025-09-06

8 2 replies

@PrimeStackPro 2025-08-31

Please make an AI agentic system twin of yourself to post daily

@thorvaldspear 2025-08-31

This appears to be very scalable, I wonder what the effect would be on larger and larger training runs!

@gnanaprakash-ravi 2025-09-03

Unlock the Data Inside
Turn Videos into Knowledge

Get FREE 10/day: transcripts, summaries, chats
Chat with videos, export text & PDF
$1 free API credit for RAG, chatbots & research

Try it free

Free forever plan • All features unlocked

New AI Meta: Train LLMs To Explore On "Hard" Tokens [RLVR + Entropy]

Description

Top Comments (10)

Related videos

A Year Into Making LLMs, and now Topped Open Source SoTA?!

AI Tech Layoffs Explained by ex-Meta Principal Engineer

A new way to fine-tune LLMs just dropped

Mushrooms Have an Ability to Control Rain According to New Study

The Most Clever Trick To Speedup LLMs

Iran threatens to ‘rain fire’ on American forces | BBC News

Chinese DoorDash Is Making Better LLMs Than Meta

The RL Irony in LLMs

IHIP News: Trump's BJ SCANDAL EXPLODES On Him as MORE Details on BUBBA Are EXPOSED!!

The biggest Mystery of LLMs have just been solved

Unlock all features

Unlock all features

Unlock all features

Unlock all features

Unlock all features

Related videos

A Year Into Making LLMs, and now Topped Open Source SoTA?!

AI Tech Layoffs Explained by ex-Meta Principal Engineer

A new way to fine-tune LLMs just dropped

Mushrooms Have an Ability to Control Rain According to New Study

The Most Clever Trick To Speedup LLMs

Iran threatens to ‘rain fire’ on American forces | BBC News

Chinese DoorDash Is Making Better LLMs Than Meta

The RL Irony in LLMs

IHIP News: Trump's BJ SCANDAL EXPLODES On Him as MORE Details on BUBBA Are EXPOSED!!

The biggest Mystery of LLMs have just been solved

Description

Top Comments (10)

Unlock the Data Inside
Turn Videos into Knowledge

New AI Meta: Train LLMs To Explore On "Hard" Tokens [RLVR + Entropy]

Description

Top Comments (10)

Related videos

A Year Into Making LLMs, and now Topped Open Source SoTA?!

AI Tech Layoffs Explained by ex-Meta Principal Engineer

A new way to fine-tune LLMs just dropped

Mushrooms Have an Ability to Control Rain According to New Study

The Most Clever Trick To Speedup LLMs

Iran threatens to ‘rain fire’ on American forces | BBC News

Chinese DoorDash Is Making Better LLMs Than Meta

The RL Irony in LLMs

IHIP News: Trump's BJ SCANDAL EXPLODES On Him as MORE Details on BUBBA Are EXPOSED!!

The biggest Mystery of LLMs have just been solved

Unlock all features

Unlock all features

Unlock all features

Unlock all features

Unlock all features

Related videos

A Year Into Making LLMs, and now Topped Open Source SoTA?!

AI Tech Layoffs Explained by ex-Meta Principal Engineer

A new way to fine-tune LLMs just dropped

Mushrooms Have an Ability to Control Rain According to New Study

The Most Clever Trick To Speedup LLMs

Iran threatens to ‘rain fire’ on American forces | BBC News

Chinese DoorDash Is Making Better LLMs Than Meta

The RL Irony in LLMs

IHIP News: Trump's BJ SCANDAL EXPLODES On Him as MORE Details on BUBBA Are EXPOSED!!

The biggest Mystery of LLMs have just been solved

Description

Top Comments (10)

Unlock the Data Inside Turn Videos into Knowledge

Unlock the Data Inside
Turn Videos into Knowledge