Team8 (Team 8)

loubnabnl

authored a paper 8 days ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published 9 days ago • 158

loubnabnl

authored a paper 2 months ago

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4 • 224

loubnabnl

posted an update 5 months ago

Post

3436

Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?

HugoLaurencon

authored a paper 8 months ago

Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 130

loubnabnl

authored a paper 10 months ago

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Paper • 2406.17557 • Published Jun 25, 2024 • 96

loubnabnl

posted an update 11 months ago

Post

5885

🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.

Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu

We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.

You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.

Enjoy!

loubnabnl

authored a paper 11 months ago

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Paper • 2405.18392 • Published May 28, 2024 • 12

HugoLaurencon

authored a paper 11 months ago

What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 104

HugoLaurencon

posted an update 11 months ago

Post

3017

We release Idefics2-chatty, the chatbot-optimized version of Idefics2: HuggingFaceM4/idefics2-8b-chatty

Idefics2-chatty is better at following instructions and following Chain-of-Thoughts reasoning.

Moreover, we also release a paper, containing a lot of findings on how to build an efficient and performant Vision-Language Model: What matters when building vision-language models? (2405.02246)

How are you going to use the model, or what data are you going to fine-tune it on?

5 replies

·

HugoLaurencon

posted an update 12 months ago

Post

2562

Idefics2 is trained mostly on OBELICS, our open interleaved image-text document dataset.

Training on interleaved data is crucial to reaching high performance on VQA tasks, taking an arbitrary number of images as input, and doing in-context learning.

Dataset: HuggingFaceM4/OBELICS
Nomic visualization: https://atlas.nomic.ai/map/f2fba2aa-3647-4f49-a0f3-9347daeee499/ee4a84bd-f125-4bcc-a683-1b4e231cb10f
Link to OBELICS thread: https://twitter.com/HugoLaurencon/status/1694005892839006301

HugoLaurencon

posted an update 12 months ago

Post

2942

The Cauldron is a massive collection of 50 high-quality datasets, all converted to the user/assistant format, and ready to use to fine-tune any Vision Language Model.

The Cauldron covers a wide range of tasks, including general visual question answering, counting, captioning, text transcription, document understanding, chart/figure understanding, table understanding, visual reasoning, geometry, spotting differences between 2 images or converting a screenshot to a code.

HuggingFaceM4/the_cauldron

HugoLaurencon

posted an update about 1 year ago

Post

3098

We release Idefics2-8B, a foundation vision language model with SOTA results for its size on many benchmarks.

For Idefics2, we adopted a simple architecture:
-Images are fed to a vision encoder, then to a modality projection to match the input dimension of the LLM, and finally to a perceiver resampler for efficient pooling.
-Interleaved image-text data are then passed to the LLM.

During the pre-training:
-The modality projection and perceiver resampler weights are newly initialized.
-We start with pre-trained models for the vision encoder and the LLM, and continue the training with LoRA.
-In total, we see 1.5T images!

We pre-train on 3 types of data, all publicly available:
-Interleaved image-text documents: our dataset OBELICS HuggingFaceM4/OBELICS
-Image caption pairs: only synthetic captions!
-PDF documents: IDL and PDFA

We kept the aspect ratio of the images with the Patch n' Pack strategy, with a resolution of up to 980x980.
At inference, it's also more efficient for lower-resolution images.

For the SFT, we build The Cauldron, a collection of 50 high-quality datasets in the user/assistant format.
It is a ready-to-use dataset for the fine-tuning of any VLM.
HuggingFaceM4/the_cauldron

Most current models, like LLaVA-NeXT, encode images with an excessive number of tokens, like 2880.
Instead, we put a focus on being efficient at inference by training on a mix of images encoded with 64 tokens, and 320 tokens.
The result is that we perform favorably compared to the best models in our size class, while being efficient at inference.

loubnabnl

posted an update about 1 year ago

Post

6438

We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training.
https://huggingface.co/blog/cosmopedia

Here are some key takeaways:
🎯 Prompt curation is crucial: we want to cover many topics with few duplicates.
📚 You can leverage various resources for diversity: using different seed data, generation formats, and target audiences.
⚙️ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.

Have a good read!

1 reply

·

HugoLaurencon

posted an update about 1 year ago

Post

With the new WebSight dataset, converting the screenshot of a web page to its corresponding HTML code is just one fine-tuning step away

We release a new version of our synthetic dataset:
-Real images within web pages 🖼️
-Tailwind CSS 🎨
-2M examples 📈

Our initial release, v0.1, featured web designs in HTML + CSS, using simple colored rectangles as image placeholders.
It was a good start to help models grasp the basics of web page structure and coding associations.
Yet, it was missing the look of a real website.

Improving visual appeal, we've now embedded actual images in our web designs, ensuring they match the site's content for a more authentic look.

Switching to Tailwind CSS offers a more compact representation of the code.

We've also expanded our dataset to 2 million examples!

After fine-tuning our forthcoming foundation vision-language model on this dataset, we've observed some encouraging capabilities, such as converting sketches directly into functional HTML code.

We're excited to hear your thoughts and suggestions for future versions. What would you like to see next? Feel free to open a discussion on the hub!

Dataset: HuggingFaceM4/WebSight
Technical report: Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset (2403.09029)
Blog post: https://huggingface.co/blog/websight
Google Colab: https://colab.research.google.com/drive/1LdamGKR2oacrDk-kYwz_Wfc1-RBUdzcO?usp=sharing

Work done with @VictorSanh @Leyo

4 replies

·

HugoLaurencon

authored 2 papers about 1 year ago

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

Paper • 2308.12539 • Published Aug 24, 2023

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Paper • 2403.09029 • Published Mar 14, 2024 • 56

loubnabnl

authored a paper about 1 year ago

StarCoder 2 and The Stack v2: The Next Generation

Paper • 2402.19173 • Published Feb 29, 2024 • 143

loubnabnl

posted an update about 1 year ago

Post

⭐ Today we’re releasing The Stack v2 & StarCoder2: a series of 3B, 7B & 15B code generation models trained on 3.3 to 4.5 trillion tokens of code:

- StarCoder2-15B matches or outperforms CodeLlama 34B, and approaches DeepSeek-33B on multiple benchmarks.
- StarCoder2-3B outperforms StarCoderBase-15B and similar sized models.
- The Stack v2 a 4x larger dataset than the Stack v1, resulting in 900B unique code tokens 🚀
As always, we released everything from models and datasets to curation code. Enjoy!

🔗 StarCoder2 collection: bigcode/starcoder2-65de6da6e87db3383572be1a
🔗 Paper: https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view
🔗 BlogPost: https://huggingface.co/blog/starcoder2
🔗 Code Leaderboard: bigcode/bigcode-models-leaderboard

HugoLaurencon

authored 2 papers almost 2 years ago

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Paper • 2306.16527 • Published Jun 21, 2023 • 46

The ROOTS Search Tool: Data Transparency for LLMs

Paper • 2302.14035 • Published Feb 27, 2023

Team 8

AI & ML interests

Recent Activity

Team8's activity

SmolVLM: Redefining small and efficient multimodal models

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Building and better understanding vision-language models: insights and future directions

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

What matters when building vision-language models?

CALM : A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

StarCoder 2 and The Stack v2: The Next Generation

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

The ROOTS Search Tool: Data Transparency for LLMs

AI & ML interests

Recent Activity

Team members 6

Team8's activity