The RL Irony in LLMs

2026-01-21 Science & Technology

23.0k

1.4k

113

Watch on YouTube

bycloud

229.0k subscribers

Description

Start learning cyber security with TryHackMe: https://tryhackme.com/bycloud Use my code "BYCLOUD25" to get 25% off on annual subscription! This video breaks down what's wrong with scaling RL for LLMs, especially in the direction of reaching AGI, but why RL still matters. As RL is noisy and can hurt generalization, yet it enables exploration and self-correction that pretraining can’t, we are stuck between a rock and a hard place with this direction. We’ll also look at why LoRA is becoming the practical way to do RL cheaply, swappable adapters that can match full fine-tuning on reasoning and make personalized agents easier to deploy, which might look like a promising future direction to apply RL on a massive scale. my latest project: Intuitive AI Academy https://intuitiveai.academy/ code "NYNM" for 50% off forever (limited to 50) Dwarkesh Podcast w/ AK [YouTube] https://youtu.be/lXUZvyajciY Dwarkesh Podcast w/ Ilya [YouTube] https://youtu.be/aR20FWCCjAs Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning [Paper] https://arxiv.org/abs/2506.01939 The Path Not Taken: RLVR Provably Learns Off the Principals [Paper] https://arxiv.org/abs/2511.08567 LoRA Without Regret [Blog] https://thinkingmachines.ai/blog/lora/ Tina: Tiny Reasoning Models via LoRA [Paper] https://arxiv.org/abs/2504.15777 Tinker [Website] https://thinkingmachines.ai/tinker/ My Newsletter https://mail.bycloud.ai/ My Patreon https://www.patreon.com/c/bycloud Try out my new fav place to learn how to code https://scrimba.com/?via=bycloudAI This video is supported by the kind Patrons & YouTube Members: 🙏Spam Maj, Alex, Chris LeDoux, DX Research Group, Poof N' Inu, Deagan, Robert Zawiasa, Ryszard Warzocha, Tobe2d, Louis Muk, Akkusativ, Kevin Tai, Mark Buckler, NO U, Tony Jimenez, Ângelo Fonseca, jiye, Anushka, Asad Dhamani, Binnie Yiu, Calvin Yan, Clayton Ford, Diego Silva, Etrotta, Gonzalo Fidalgo, Handenon, Hector, Jake Disco very, Michael Brenner, Nilly K, OlegWock, Daddy Wen, Shuhong Chen, Sid_Cipher, Stefan Lorenz, Sup, tantan assawade, Thipok Tham, Thomas Di Martino, Thomas Lin, Richárd Nagyfi, Paperboy, mika, Leo, Berhane-Meskel, Kadhai Pesalam, mayssam, Bill Mangrum, nyaa, Toru Mon, Lame Plane, Matej Macak [Discord] https://discord.gg/NhJZGtH [Twitter] https://twitter.com/bycloudai [Patreon] https://www.patreon.com/bycloud [Business Inquiries] [email protected] [Profile & Banner Art] https://twitter.com/pygm7 [Video Editor] Abhay and @Booga04 [Ko-fi] https://ko-fi.com/bycloudai

#bycloud #bycloudai #reinforcement learning #RL for LLMs #LLM reinforcement learning #RLHF #RLVR #verifiable rewards

Top Comments (10)

@that1nonja888 2026-01-21

Still years away from cheap ram huh?

457 35 replies

@ikciii 2026-01-22

Wait, you're telling me all that wasn't obvious from the moment lora was made? Also thanks for making a vid on this research, I'm currently almost done writing my bachelor's thesis where I use qlora to finetune base llama 3.1 8B into an unhelpful assistant that does whatever possible to make it seem like it answered your question while providing as little actual help as possible, and this is going to be a fine addition to my bibliography collection

132 13 replies

@sharannagarajan4089 2026-01-21

RL is not only for generalization. It is good for making AI learn things where training data is not present

86 8 replies

@stevenfallinge7149 2026-01-21

Maybe the problem with RL is that it only rewards the "verifiable reward." It doesn't reward exploration and creativity, which was one of the key breakthroughs for allowing game-playing RL agents previously to clear more of stages that required exploration.

79 3 replies

@anardart115 2026-01-21

13:28 "Regex fixer" 🤣

41 3 replies

@ViewOf 2026-01-21

The quickest way to answer a question correctly is by already knowing the answer. With LLMs being trained on every written media in existence...

13 10 replies

@bycloudAI 2026-01-21

Start learning cyber security with TryHackMe: https://tryhackme.com/bycloud Use my code "BYCLOUD25" to get 25% off on annual subscription! fun fact: I wrote this video on a phone back when i was in the military lol

13 5 replies

@Lexxxco1 2026-01-21

Lora part of this video was actually practically useful. Together with sources and arguments - great work! Keep it up

@zenze-sama 2026-01-21

5:58 what a choice

@ibollanos 2026-02-14

Really good quality and informative video, thank you!

Description

Top Comments (10)

@that1nonja888 2026-01-21

Still years away from cheap ram huh?

457 35 replies

@ikciii 2026-01-22

132 13 replies

@sharannagarajan4089 2026-01-21

RL is not only for generalization. It is good for making AI learn things where training data is not present

86 8 replies

@stevenfallinge7149 2026-01-21

79 3 replies

@anardart115 2026-01-21

13:28 "Regex fixer" 🤣

41 3 replies

@ViewOf 2026-01-21

The quickest way to answer a question correctly is by already knowing the answer. With LLMs being trained on every written media in existence...

13 10 replies

@bycloudAI 2026-01-21

13 5 replies

@Lexxxco1 2026-01-21

Lora part of this video was actually practically useful. Together with sources and arguments - great work! Keep it up

@zenze-sama 2026-01-21

5:58 what a choice

@ibollanos 2026-02-14

Really good quality and informative video, thank you!

Unlock the Data Inside
Turn Videos into Knowledge

Get FREE 10/day: transcripts, summaries, chats
Chat with videos, export text & PDF
$1 free API credit for RAG, chatbots & research

Try it free

Free forever plan • All features unlocked

The RL Irony in LLMs

Description

Top Comments (10)

Related videos

LLM that loops instead of Doing Chain-of-Thought

SEND IN THE FEDS

A Year Into Making LLMs, and now Topped Open Source SoTA?!

A new way to fine-tune LLMs just dropped

THEY’RE IN TROUBLE

The Most Clever Trick To Speedup LLMs

Why can’t LLMs just LEARN the context window?

THEY LOST BIG

The Death of RAG?

Kimi K2.5 Brought Us 3 brand NEW LLM Frontier!?

Unlock all features

Unlock all features

Unlock all features

Unlock all features

Unlock all features

Related videos

LLM that loops instead of Doing Chain-of-Thought

SEND IN THE FEDS

A Year Into Making LLMs, and now Topped Open Source SoTA?!

A new way to fine-tune LLMs just dropped

THEY’RE IN TROUBLE

The Most Clever Trick To Speedup LLMs

Why can’t LLMs just LEARN the context window?

THEY LOST BIG

The Death of RAG?

Kimi K2.5 Brought Us 3 brand NEW LLM Frontier!?

Description

Top Comments (10)

Unlock the Data Inside
Turn Videos into Knowledge

The RL Irony in LLMs

Description

Top Comments (10)

Related videos

LLM that loops instead of Doing Chain-of-Thought

SEND IN THE FEDS

A Year Into Making LLMs, and now Topped Open Source SoTA?!

A new way to fine-tune LLMs just dropped

THEY’RE IN TROUBLE

The Most Clever Trick To Speedup LLMs

Why can’t LLMs just LEARN the context window?

THEY LOST BIG

The Death of RAG?

Kimi K2.5 Brought Us 3 brand NEW LLM Frontier!?

Unlock all features

Unlock all features

Unlock all features

Unlock all features

Unlock all features

Related videos

LLM that loops instead of Doing Chain-of-Thought

SEND IN THE FEDS

A Year Into Making LLMs, and now Topped Open Source SoTA?!

A new way to fine-tune LLMs just dropped

THEY’RE IN TROUBLE

The Most Clever Trick To Speedup LLMs

Why can’t LLMs just LEARN the context window?

THEY LOST BIG

The Death of RAG?

Kimi K2.5 Brought Us 3 brand NEW LLM Frontier!?

Description

Top Comments (10)

Unlock the Data Inside Turn Videos into Knowledge

Unlock the Data Inside
Turn Videos into Knowledge