Sultan Alrashed PRO

SultanR

AI & ML interests

Smol language modelling!

Recent Activity

liked a dataset 2 days ago
2A2I/argilla-dpo-mix-7k-arabic
liked a dataset 2 days ago
HuggingFaceH4/ultrachat_200k
liked a dataset 2 days ago
OpenGVLab/MMPR-v1.1
View all activity

Organizations

KAUST Center of Excellence in Generative AI's profile picture

SultanR's activity

reacted to anton-l's post with πŸ”₯ 4 days ago
view post
Post
1978
Introducing πŸ“π…π’π§πžπŒπšπ­π‘: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
πŸ› οΈ carefully extracting math data from Common Crawl;
πŸ”Ž iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! πŸš€
We’re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2