Takara takes 3rd place in the {tech:munich} AI hackathon with Fudeno!
A little over 2 weeks ago @aldigobbler and I set out to create the largest MultiModal SVG dataset ever created, we succeeded in this and when I was in Munich, Germany I took it one step further and made an entire app with it!
We fine-tuned Mistral Small, made a Next.JS application and blew some minds, taking 3rd place out of over 100 hackers. So cool!
๐ Multimodal > Mistral AI released a 24B vision LM, both base and instruction FT versions, sota ๐ฅ (OS) > with IBM we released SmolDocling, a sota 256M document parser with Apache 2.0 license (OS) > SpatialLM is a new vision LM that outputs 3D bounding boxes, comes with 0.5B (QwenVL based) and 1B (Llama based) variants > SkyWork released SkyWork-R1V-38B, new vision reasoning model (OS)
๐ฌ LLMs > NVIDIA released new Nemotron models in 49B and 8B with their post-training dataset > LG released EXAONE, new reasoning models in 2.4B, 7.8B and 32B > Dataset: Glaive AI released a new reasoning dataset of 22M+ examples > Dataset: NVIDIA released new helpfulness dataset HelpSteer3 > Dataset: OpenManusRL is a new agent dataset based on ReAct framework (OS) > Open-R1 team released OlympicCoder, new competitive coder model in 7B and 32B > Dataset: GeneralThought-430K is a new reasoning dataset (OS)
๐ผ๏ธ Image Generation/Computer Vision > Roboflow released RF-DETR, new real-time sota object detector (OS) ๐ฅ > YOLOE is a new real-time zero-shot object detector with text and visual prompts ๐ฅน > Stability AI released Stable Virtual Camera, a new novel view synthesis model > Tencent released Hunyuan3D-2mini, new small and fast 3D asset generation model > ByteDance released InfiniteYou, new realistic photo generation model > StarVector is a new 8B model that generates svg from images > FlexWorld is a new model that expands 3D views (OS)
๐ค Audio > Sesame released CSM-1B new speech generation model (OS)
๐ค Robotics > NVIDIA released GR00T, new robotics model for generalized reasoning and skills, along with the dataset
At Rapidata, we compared DeepL with LLMs like DeepSeek-R1, Llama, and Mixtral for translation quality using feedback from over 51,000 native speakers. Despite the costs, the performance makes it a valuable investment, especially in critical applications where translation quality is paramount. Now we can say that Europe is more than imposing regulations.
Our dataset, based on these comparisons, is now available on Hugging Face. This might be useful for anyone working on AI translation or language model evaluation.
An assembly of 18 European companies, labs, and universities have banded together to launch ๐ช๐บ EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc.
๐ช๐บ 15 Languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, Hindi 3๏ธโฃ 3 model sizes: 210M, 610M, and 2.1B parameters - very very useful sizes in my opinion โก๏ธ Sequence length of 8192 tokens! Nice to see these higher sequence lengths for encoders becoming more common. โ๏ธ Architecture based on Llama, but with bi-directional (non-causal) attention to turn it into an encoder. Flash Attention 2 is supported. ๐ฅ A new Pareto frontier (stronger *and* smaller) for multilingual encoder models ๐ Evaluated against mDeBERTa, mGTE, XLM-RoBERTa for Retrieval, Classification, and Regression (after finetuning for each task separately): EuroBERT punches way above its weight. ๐ Detailed paper with all details, incl. data: FineWeb for English and CulturaX for multilingual data, The Stack v2 and Proof-Pile-2 for code.
The next step is for researchers to build upon the 3 EuroBERT base models and publish strong retrieval, zero-shot classification, etc. models for all to use. I'm very much looking forward to it!
Since I published it on GitHub a few days ago, Hugging Face's new agentic library ๐๐บ๐ผ๐น๐ฎ๐ด๐ฒ๐ป๐๐ has gathered nearly 4k stars ๐คฏ
โก๏ธ But we are just getting started on agents: so we are hiring an ML Engineer to join me and double down on this effort!
The plan is to build GUI agents: agents that can act on your computer with mouse & keyboard, like Claude Computer Use.
Exciting News in AI: JinaAI Releases JINA-CLIP-v2!
The team at Jina AI has just released a groundbreaking multilingual multimodal embedding model that's pushing the boundaries of text-image understanding. Here's why this is a big deal:
๐ Technical Highlights: - Dual encoder architecture combining a 561M parameter Jina XLM-RoBERTa text encoder and a 304M parameter EVA02-L14 vision encoder - Supports 89 languages with 8,192 token context length - Processes images up to 512ร512 pixels with 14ร14 patch size - Implements FlashAttention2 for text and xFormers for vision processing - Uses Matryoshka Representation Learning for efficient vector storage
โก๏ธ Under The Hood: - Multi-stage training process with progressive resolution scaling (224โ384โ512) - Contrastive learning using InfoNCE loss in both directions - Trained on massive multilingual dataset including 400M English and 400M multilingual image-caption pairs - Incorporates specialized datasets for document understanding, scientific graphs, and infographics - Uses hard negative mining with 7 negatives per positive sample
๐ Performance: - Outperforms previous models on visual document retrieval (52.65% nDCG@5) - Achieves 89.73% image-to-text and 79.09% text-to-image retrieval on CLIP benchmark - Strong multilingual performance across 30 languages - Maintains performance even with 75% dimension reduction (256D vs 1024D)
๐ฏ Key Innovation: The model solves the long-standing challenge of unifying text-only and multi-modal retrieval systems while adding robust multilingual support. Perfect for building cross-lingual visual search systems!
Kudos to the research team at Jina AI for this impressive advancement in multimodal AI!
Exciting breakthrough in AI: @Meta's new Byte Latent Transformer (BLT) revolutionizes language models by eliminating tokenization!
The BLT architecture introduces a groundbreaking approach that processes raw bytes instead of tokens, achieving state-of-the-art performance while being more efficient and robust. Here's what makes it special:
>> Key Innovations Dynamic Patching: BLT groups bytes into variable-sized patches based on entropy, allocating more compute power where the data is more complex. This results in up to 50% fewer FLOPs during inference compared to traditional token-based models.
Three-Component Architecture: โข Lightweight Local Encoder that converts bytes to patch representations โข Powerful Global Latent Transformer that processes patches โข Local Decoder that converts patches back to bytes
>> Technical Advantages โข Matches performance of Llama 3 at 8B parameters while being more efficient โข Superior handling of non-English languages and rare character sequences โข Remarkable 99.9% accuracy on spelling tasks โข Better scaling properties than token-based models
>> Under the Hood The system uses an entropy model to determine patch boundaries, cross-attention mechanisms for information flow, and hash n-gram embeddings for improved representation. The architecture allows simultaneous scaling of both patch and model size while maintaining fixed inference costs.
This is a game-changer for multilingual AI and could reshape how we build future language models. Excited to see how this technology evolves!
For anyone looking to boost their LLM fine-tuning and alignment skills this decemeber. We're running this free and open course called smol course. Itโs not big like Li Yin and @mlabonne, itโs just smol.
๐ท It focuses on practical use cases, so if youโre working on something, bring it along.
๐ฏโโ๏ธ Itโs peer reviewed and open so you can discuss and get feedback.
๐ค If youโre already a smol pro, feel free to drop a star or issue.
> > Part 1 starts now, and itโs on instruction tuning!