Navigate Select ESC Close

The RL Irony in LLMs

2026-01-21 Science & Technology
23.0k
1.4k
113
bycloud
bycloud
225.0k subscribers

Unlock all features

FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.

Description

Start learning cyber security with TryHackMe: https://tryhackme.com/bycloud Use my code "BYCLOUD25" to get 25% off on annual subscription! This video breaks down what's wrong with scaling RL for LLMs, especially in the direction of reaching AGI, but why RL still matters. As RL is noisy and can hurt generalization, yet it enables exploration and self-correction that pretraining can’t, we are stuck between a rock and a hard place with this direction. We’ll also look at why LoRA is becoming the practical way to do RL cheaply, swappable adapters that can match full fine-tuning on reasoning and make personalized agents easier to deploy, which might look like a promising future direction to apply RL on a massive scale. my latest project: Intuitive AI Academy https://intuitiveai.academy/ code "NYNM" for 50% off forever (limited to 50) Dwarkesh Podcast w/ AK [YouTube] https://youtu.be/lXUZvyajciY Dwarkesh Podcast w/ Ilya [YouTube] https://youtu.be/aR20FWCCjAs Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning [Paper] https://arxiv.org/abs/2506.01939 The Path Not Taken: RLVR Provably Learns Off the Principals [Paper] https://arxiv.org/abs/2511.08567 LoRA Without Regret [Blog] https://thinkingmachines.ai/blog/lora/ Tina: Tiny Reasoning Models via LoRA [Paper] https://arxiv.org/abs/2504.15777 Tinker [Website] https://thinkingmachines.ai/tinker/ My Newsletter https://mail.bycloud.ai/ My Patreon https://www.patreon.com/c/bycloud Try out my new fav place to learn how to code https://scrimba.com/?via=bycloudAI This video is supported by the kind Patrons & YouTube Members: 🙏Spam Maj, Alex, Chris LeDoux, DX Research Group, Poof N' Inu, Deagan, Robert Zawiasa, Ryszard Warzocha, Tobe2d, Louis Muk, Akkusativ, Kevin Tai, Mark Buckler, NO U, Tony Jimenez, Ângelo Fonseca, jiye, Anushka, Asad Dhamani, Binnie Yiu, Calvin Yan, Clayton Ford, Diego Silva, Etrotta, Gonzalo Fidalgo, Handenon, Hector, Jake Disco very, Michael Brenner, Nilly K, OlegWock, Daddy Wen, Shuhong Chen, Sid_Cipher, Stefan Lorenz, Sup, tantan assawade, Thipok Tham, Thomas Di Martino, Thomas Lin, Richárd Nagyfi, Paperboy, mika, Leo, Berhane-Meskel, Kadhai Pesalam, mayssam, Bill Mangrum, nyaa, Toru Mon, Lame Plane, Matej Macak [Discord] https://discord.gg/NhJZGtH [Twitter] https://twitter.com/bycloudai [Patreon] https://www.patreon.com/bycloud [Business Inquiries] [email protected] [Profile & Banner Art] https://twitter.com/pygm7 [Video Editor] Abhay and @Booga04 [Ko-fi] https://ko-fi.com/bycloudai

Top Comments (10)

@that1nonja888 2026-01-21

Still years away from cheap ram huh?

457 35 replies
@ikciii 2026-01-22

Wait, you're telling me all that wasn't obvious from the moment lora was made? Also thanks for making a vid on this research, I'm currently almost done writing my bachelor's thesis where I use qlora to finetune base llama 3.1 8B into an unhelpful assistant that does whatever possible to make it seem like it answered your question while providing as little actual help as possible, and this is going to be a fine addition to my bibliography collection

132 13 replies
@sharannagarajan4089 2026-01-21

RL is not only for generalization. It is good for making AI learn things where training data is not present

86 8 replies
@stevenfallinge7149 2026-01-21

Maybe the problem with RL is that it only rewards the "verifiable reward." It doesn't reward exploration and creativity, which was one of the key breakthroughs for allowing game-playing RL agents previously to clear more of stages that required exploration.

79 3 replies
@anardart115 2026-01-21

13:28 "Regex fixer" 🤣

41 3 replies
@ViewOf 2026-01-21

The quickest way to answer a question correctly is by already knowing the answer. With LLMs being trained on every written media in existence...

13 10 replies
@bycloudAI 2026-01-21

Start learning cyber security with TryHackMe: https://tryhackme.com/bycloud Use my code "BYCLOUD25" to get 25% off on annual subscription! fun fact: I wrote this video on a phone back when i was in the military lol

13 5 replies
@Lexxxco1 2026-01-21

Lora part of this video was actually practically useful. Together with sources and arguments - great work! Keep it up

13
@zenze-sama 2026-01-21

5:58 what a choice

10
@ibollanos 2026-02-14

Really good quality and informative video, thank you!

0

Unlock the Data Inside
Turn Videos into Knowledge

  • Get FREE 10/day: transcripts, summaries, chats
  • Chat with videos, export text & PDF
  • $1 free API credit for RAG, chatbots & research

Free forever plan • All features unlocked

App screenshot