@m-ric on Hugging Face: "Emu3: Next-token prediction conquers multimodal tasks 🔥 This is the most…"

Post

1029

Emu3: Next-token prediction conquers multimodal tasks 🔥

This is the most important research in months: we’re now very close to having a single architecture to handle all modalities. The folks at Beijing Academy of Artificial Intelligence (BAAI) just released Emu3, a single model that handles text, images, and videos all at once.

𝗪𝗵𝗮𝘁'𝘀 𝘁𝗵𝗲 𝗯𝗶𝗴 𝗱𝗲𝗮𝗹?
🌟 Emu3 is the first model to truly unify all these different types of data (text, images, video) using just one simple trick: predicting the next token.
And it’s only 8B, but really strong:
🖼️ For image generation, it's matching the best specialized models out there, like SDXL.
👁️ In vision tasks, it's outperforming top models like LLaVA-1.6-7B, which is a big deal for a model that wasn't specifically designed for this.
🎬 It's the first to nail video generation without using complicated diffusion techniques.

𝗛𝗼𝘄 𝗱𝗼𝗲𝘀 𝗶𝘁 𝘄𝗼𝗿𝗸?
🧩 Emu3 uses a special tokenizer (SBER-MoVQGAN) to turn images and video clips into sequences of 4,096 tokens.
🔗 Then, it treats everything - text, images, and videos - as one long series of tokens to predict.
🔮 During training, it just tries to guess the next token, whether that's a word, part of an image, or a video frame.

𝗖𝗮𝘃𝗲𝗮𝘁𝘀 𝗼𝗻 𝘁𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁𝘀:
👉 In image generation, Emu3 beats SDXL, but it’s also much bigger (8B vs 3.5B). It would be more difficult to beat the real diffusion GOAT FLUX-dev.
👉 In vision, authors also don’t show a comparison against all the current SOTA models like Qwen-VL or Pixtral.

This approach is exciting because it's simple (next token prediction) and scalable(handles all sorts of data)!

Read the paper 👉 Emu3: Next-Token Prediction is All You Need (2409.18869)

Join the conversation