Llama 3.2 Collection This collection hosts the transformers and original repos of the Llama 3.2 and Llama Guard 3 • 11 items • Updated 9 days ago • 322
view article Article 🌟 Easy Fine-Tuning with Hugging Face SQL Console, Notebook Creator, and SFT By asoria • 10 days ago • 12
view article Article Deep Learning over the Internet: Training Language Models Collaboratively Jul 15, 2021 • 4
NuminaMath Collection Datasets and models for training SOTA math LLMs. See our GitHub for training & inference code: https://github.com/project-numina/aimo-progress-prize • 6 items • Updated Jul 21 • 57
view article Article Querying Datasets with the Datasets Explorer Chrome Extension By cfahlgren1 • Jul 19 • 6
view article Article Announcing Finance Commons and the Bad Data Toolbox: Pioneering Open Data and Advanced Document Processing By Pclanglais • Jul 19 • 17
view article Article Enhancing Search Capabilities for Non-English Datasets in the Dataset Viewer By asoria • Jul 10 • 4
view article Article Experimenting with Automatic PII Detection on the Hub using Presidio Jul 10 • 23
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25 • 85
view article Article Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation By davanstrien • Jun 20 • 12
view article Article The CVPR Survival Guide: Discovering Research That's Interesting to YOU! By harpreetsahota • Jun 14 • 9
view article Article How to directly access 150k+ Hugging Face Datasets with DuckDB and query using GPT-4o By chilijung • May 31 • 10
Embedding Model Datasets Collection A curated subset of the datasets that work out of the box with Sentence Transformers: https://huggingface.co/datasets?other=sentence-transformers • 67 items • Updated Jul 3 • 63
Albert Collection Les différents modèles à jour dans la famille Albert, les modèles archivés n'apparaissent pas dans cette collection. The various models behind Albert • 5 items • Updated May 29 • 7
view article Article Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B Apr 4 • 23
Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset Paper • 2403.09029 • Published Mar 14 • 54
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows Paper • 2402.10379 • Published Feb 16 • 29
All the ImageNets Collection Noteworthy instances of ImageNet on the Hub. Vetted and tested with timm train and validation scripts. • 7 items • Updated Jun 12 • 4
Power Hungry Processing: Watts Driving the Cost of AI Deployment? Paper • 2311.16863 • Published Nov 28, 2023 • 6
Pokemons dataset captioned with different models Collection The Pokemons dataset from Lambda Labs is quite popular in the diffusion community because it lets us quickly validate ideas. • 3 items • Updated Nov 28, 2023 • 3
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset Paper • 2303.03915 • Published Mar 7, 2023 • 6
Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements Paper • 2210.01970 • Published Sep 30, 2022 • 11
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Paper • 2211.05100 • Published Nov 9, 2022 • 28
HuggingFace's Transformers: State-of-the-art Natural Language Processing Paper • 1910.03771 • Published Oct 9, 2019 • 16
Datasets: A Community Library for Natural Language Processing Paper • 2109.02846 • Published Sep 7, 2021 • 8