Navigate Select ESC Close

Why would anyone let LLMs predict 4 tokens at once? Multi-Token Prediction Explained

2025-05-27 Science & Technology
55.8k
2.5k
129
bycloud
bycloud
225.0k subscribers

Unlock all features

FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.

Description

Check out LTX Video 13B now and experience the latest video gen breakthrough: https://bit.ly/ltxvbycloud My Newsletter https://mail.bycloud.ai/ my project: find, discover & explain AI research semantically https://findmypapers.ai/ My Patreon https://www.patreon.com/c/bycloud Future Lens [Paper] https://arxiv.org/abs/2311.04897 Multi-Token Prediction [Paper] https://arxiv.org/abs/2404.19737 DeepSeek-V3 [First Paper] https://arxiv.org/abs/2412.19437 [Technical Paper] https://arxiv.org/abs/2505.09343 Try out my new fav place to learn how to code https://scrimba.com/?via=bycloudAI This video is supported by the kind Patrons & YouTube Members: 🙏Nous Research, Chris LeDoux, Ben Shaener, DX Research Group, Poof N' Inu, Andrew Lescelius, Deagan, Robert Zawiasa, Ryszard Warzocha, Tobe2d, Louis Muk, Akkusativ, Kevin Tai, Mark Buckler, NO U, Tony Jimenez, Ângelo Fonseca, jiye, Anushka, Asad Dhamani, Binnie Yiu, Calvin Yan, Clayton Ford, Diego Silva, Etrotta, Gonzalo Fidalgo, Handenon, Hector, Jake Disco very, Michael Brenner, Nilly K, OlegWock, Daddy Wen, Shuhong Chen, Sid_Cipher, Stefan Lorenz, Sup, tantan assawade, Thipok Tham, Thomas Di Martino, Thomas Lin, Richárd Nagyfi, Paperboy, mika, Leo, Berhane-Meskel, Kadhai Pesalam, mayssam, Bill Mangrum, nyaa [Discord] https://discord.gg/NhJZGtH [Twitter] https://twitter.com/bycloudai [Patreon] https://www.patreon.com/bycloud [Business Inquiries] [email protected] [Profile & Banner Art] https://twitter.com/pygm7 [Video Editor] @Booga04 [Ko-fi] https://ko-fi.com/bycloudai

Top Comments (10)

@abhijeet1472-handle 2025-05-27

Stable diffusion LLM

242 5 replies
@hjups 2025-05-27

Your origin paper is incorrect, although you did show / mention it later in the video. The first paper to propose MTP was "Fast Inference from Transformers via Speculative Decoding" (2023), where the entire model predicts N future tokens (context + N-1 * mask -> N). MTP modified this by training a series of smaller models (heads) to predict the next N tokens from a single input (context -> N). DeepSeek then turned these heads into a transformer model with causal masking (still context -> N), which essentially treats the prediction like a RNN. In all cases though, the prediction is speculative, and might be discarded using the same method from the paper I mentioned. Furthermore, this technique has clearly been used by OpenAI since GPT 3.5, and is why the canvas editor in ChatGPT is so fast to make changes (they speculate on the previous canvas state - there's API documentation to support this). My guess is that this isn't being talked about much because it's orthogonal to other advancements, just like quantization is. Also, you never completed the initial motivation. Can DeepSeek's MTP method correctly guess the number of words in a sentence? Probably only slightly better than the next-token prediction objective. The diffusion methods should solve that, but are susceptible to block generation artifacts. Edit: I apologize if this comment came off as harsh, that wasn't my intention.

219 17 replies
@diegoantoniorosariopalomin2206 2025-05-27

Even better than diffusion over tokens. Would be diffusion in a latent space. Kind of like using a text autoencoder, and doing diffusion on that space

61 3 replies
@heys3th 2025-05-27

Can't wait for the Gemini diffusion video :D

28
@simeonnnnn 2025-05-27

Now I'm even more excited for V4 and R2

23 2 replies
@bycloudAI 2025-05-26

Check out LTX Video 13B now and experience the latest video gen breakthrough: https://bit.ly/ltxvbycloud

21 1 replies
@akirachisaka9997 2025-05-28

Not really applicable to this video. But I do feel that a lot of the recent bycloud videos feels like: - very click bait title, with baity thumbnails with memes - followed by actual very details and informative dives, explaining complex new things in a simple way that’s digestible while also being not too distilled into noise - but then concludes with the same click baity statement or claim again Which… I guess this isn’t a problem? Since this is what the algorithm wants. The evolutionary nightmare gives and it takes. You do what you need to to survive. But yeah, it does feel really surprising when I was a new watcher. Being very surprised when I thought the video wouldn’t contain anything meaningful, but it actually does.

10
@JorgetePanete 2025-05-28

My favourite podcast channel (visuals are plain brainrot)

8
@AnotherFreakingDude 2025-05-27

That moment when Will Smith eating spaghetti is a Video Gen AI benchmark.

7
@akirachisaka9997 2025-05-28

I wonder if, “multi model” LLM / general purpose agents could be a thing later. As in, have the sub components work together faster. A transformer-ish side that chains of thoughts, and strategizes more on instincts, with a diffusion model working on the “canvas” side, making adjustments to the actual “product”.

3

Unlock the Data Inside
Turn Videos into Knowledge

  • Get FREE 10/day: transcripts, summaries, chats
  • Chat with videos, export text & PDF
  • $1 free API credit for RAG, chatbots & research

Free forever plan • All features unlocked

App screenshot