FineData

Team

community

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

cfahlgren1 submitted a paper 1 day ago

How AI Impacts Skill Formation

hynky updated a Space 21 days ago

HuggingFaceFW/README

hynky updated a collection 21 days ago

View all activity

Papers

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

View all Papers

cfahlgren1

submitted a paper to Daily Papers 1 day ago

How AI Impacts Skill Formation

Paper • 2601.20245 • Published 3 days ago • 6

hynky

updated a Space 21 days ago

README

hynky

updated a collection 21 days ago

📄 FinePDFs

82 items • Updated 21 days ago • 27

guipenedo

in HuggingFaceFW/finetranslations-edu 22 days ago

Update README.md

#1 opened 22 days ago by

lhoestq

in HuggingFaceFW/finetranslations-edu 22 days ago

Update README.md

#1 opened 22 days ago by

guipenedo

published a dataset 22 days ago

HuggingFaceFW/finetranslations-edu

Viewer • Updated 22 days ago • 109M • 3.17k • 24

guipenedo

updated 2 datasets 22 days ago

HuggingFaceFW/finetranslations-edu

Viewer • Updated 22 days ago • 109M • 3.17k • 24

HuggingFaceFW/finetranslations

Viewer • Updated 22 days ago • 3.33B • 56.9k • 257

guipenedo

updated a Space 22 days ago

README

guipenedo

updated a dataset 22 days ago

HuggingFaceFW/admin

Viewer • Updated 22 days ago • 18 • 37.2k • 3

hynky

in HuggingFaceFW/finepdfs 22 days ago

Row count mismatch for the unknown languages subset

#25 opened 3 months ago by

hynky

updated a dataset 22 days ago

HuggingFaceFW/finepdfs

Viewer • Updated 22 days ago • 476M • 31.7k • 810

guipenedo

published a dataset 22 days ago

HuggingFaceFW/finetranslations

Viewer • Updated 22 days ago • 3.33B • 56.9k • 257

hynky

in HuggingFaceFW/finepdfs 23 days ago

How to use this dataset to extract PDFs by subject?

#14 opened 5 months ago by

Can additional corpuses further train this model?

#13 opened 5 months ago by

Decontamination against benchmarks?

#11 opened 5 months ago by

MarCognity-AI for HuggingFaceFW/finepdfs

#23 opened 4 months ago by

hynky

updated a Space 24 days ago

FinePDFs: Liberating 3T of the finest tokens from PDFs

hynky

published a Space 24 days ago

FinePDFs: Liberating 3T of the finest tokens from PDFs

guipenedo

updated a Space 25 days ago

FinePDFs: Liberating 3T of the finest tokens from PDFs