Why would anyone let LLMs predict 4 tokens at once? Multi-Token Prediction Explained
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Unlock all features
FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.
Related videos
A new way to fine-tune LLMs just dropped
bycloud
16.3k views
What Is Yann LeCun Cooking? JEPA Explained Simply
bycloud
50.7k views
The Only Reason Why The INSANE AI Datacenter Build Out Would Make Sense
bycloud
20.7k views
IT Welcome To Derry Ending Explained | Season 2 Theories, Book Predictions & Your Questions Answered
Heavy Spoilers
44.0k views
STRANGER THINGS Season 5 Vol 2 and 3 | Ending Theories & Predictions Explained
Heavy Spoilers
30.4k views
BITCOIN: DECEMBER PRICE PREDICTION - whale explains
Ivan on Tech
35.3k views
ALICE IN BORDERLAND Season 3 Ending Explained | Breakdown, Season 4 Predictions & Review | Netflix
Heavy Spoilers
92.5k views
New AI Meta: Train LLMs To Explore On "Hard" Tokens [RLVR + Entropy]
bycloud
23.4k views
10x Faster Than Standard LLM!? DiffusionLM Explained
bycloud
63.7k views
SOLANA PRICE PREDICTION 2025-2026 (Is it too late to buy?) - Programmer explains
Ivan on Tech
32.6k views
Top Comments (10)
Stable diffusion LLM
Your origin paper is incorrect, although you did show / mention it later in the video. The first paper to propose MTP was "Fast Inference from Transformers via Speculative Decoding" (2023), where the entire model predicts N future tokens (context + N-1 * mask -> N). MTP modified this by training a series of smaller models (heads) to predict the next N tokens from a single input (context -> N). DeepSeek then turned these heads into a transformer model with causal masking (still context -> N), which essentially treats the prediction like a RNN. In all cases though, the prediction is speculative, and might be discarded using the same method from the paper I mentioned. Furthermore, this technique has clearly been used by OpenAI since GPT 3.5, and is why the canvas editor in ChatGPT is so fast to make changes (they speculate on the previous canvas state - there's API documentation to support this). My guess is that this isn't being talked about much because it's orthogonal to other advancements, just like quantization is. Also, you never completed the initial motivation. Can DeepSeek's MTP method correctly guess the number of words in a sentence? Probably only slightly better than the next-token prediction objective. The diffusion methods should solve that, but are susceptible to block generation artifacts. Edit: I apologize if this comment came off as harsh, that wasn't my intention.
Even better than diffusion over tokens. Would be diffusion in a latent space. Kind of like using a text autoencoder, and doing diffusion on that space
Can't wait for the Gemini diffusion video :D
Now I'm even more excited for V4 and R2
Check out LTX Video 13B now and experience the latest video gen breakthrough: https://bit.ly/ltxvbycloud
Not really applicable to this video. But I do feel that a lot of the recent bycloud videos feels like: - very click bait title, with baity thumbnails with memes - followed by actual very details and informative dives, explaining complex new things in a simple way that’s digestible while also being not too distilled into noise - but then concludes with the same click baity statement or claim again Which… I guess this isn’t a problem? Since this is what the algorithm wants. The evolutionary nightmare gives and it takes. You do what you need to to survive. But yeah, it does feel really surprising when I was a new watcher. Being very surprised when I thought the video wouldn’t contain anything meaningful, but it actually does.
My favourite podcast channel (visuals are plain brainrot)
That moment when Will Smith eating spaghetti is a Video Gen AI benchmark.
I wonder if, “multi model” LLM / general purpose agents could be a thing later. As in, have the sub components work together faster. A transformer-ish side that chains of thoughts, and strategizes more on instincts, with a diffusion model working on the “canvas” side, making adjustments to the actual “product”.
Unlock the Data Inside
Turn Videos into Knowledge
- Get FREE 10/day: transcripts, summaries, chats
- Chat with videos, export text & PDF
- $1 free API credit for RAG, chatbots & research
Free forever plan • All features unlocked
Top Comments (10)
Stable diffusion LLM
Your origin paper is incorrect, although you did show / mention it later in the video. The first paper to propose MTP was "Fast Inference from Transformers via Speculative Decoding" (2023), where the entire model predicts N future tokens (context + N-1 * mask -> N). MTP modified this by training a series of smaller models (heads) to predict the next N tokens from a single input (context -> N). DeepSeek then turned these heads into a transformer model with causal masking (still context -> N), which essentially treats the prediction like a RNN. In all cases though, the prediction is speculative, and might be discarded using the same method from the paper I mentioned. Furthermore, this technique has clearly been used by OpenAI since GPT 3.5, and is why the canvas editor in ChatGPT is so fast to make changes (they speculate on the previous canvas state - there's API documentation to support this). My guess is that this isn't being talked about much because it's orthogonal to other advancements, just like quantization is. Also, you never completed the initial motivation. Can DeepSeek's MTP method correctly guess the number of words in a sentence? Probably only slightly better than the next-token prediction objective. The diffusion methods should solve that, but are susceptible to block generation artifacts. Edit: I apologize if this comment came off as harsh, that wasn't my intention.
Even better than diffusion over tokens. Would be diffusion in a latent space. Kind of like using a text autoencoder, and doing diffusion on that space
Can't wait for the Gemini diffusion video :D
Now I'm even more excited for V4 and R2
Check out LTX Video 13B now and experience the latest video gen breakthrough: https://bit.ly/ltxvbycloud
Not really applicable to this video. But I do feel that a lot of the recent bycloud videos feels like: - very click bait title, with baity thumbnails with memes - followed by actual very details and informative dives, explaining complex new things in a simple way that’s digestible while also being not too distilled into noise - but then concludes with the same click baity statement or claim again Which… I guess this isn’t a problem? Since this is what the algorithm wants. The evolutionary nightmare gives and it takes. You do what you need to to survive. But yeah, it does feel really surprising when I was a new watcher. Being very surprised when I thought the video wouldn’t contain anything meaningful, but it actually does.
My favourite podcast channel (visuals are plain brainrot)
That moment when Will Smith eating spaghetti is a Video Gen AI benchmark.
I wonder if, “multi model” LLM / general purpose agents could be a thing later. As in, have the sub components work together faster. A transformer-ish side that chains of thoughts, and strategizes more on instincts, with a diffusion model working on the “canvas” side, making adjustments to the actual “product”.