Introducing 📐𝐅𝐢𝐧𝐞𝐌𝐚𝐭𝐡: the best public math pre-training dataset with 50B+ tokens! HuggingFaceTB/finemath
Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.
We build the dataset by: 🛠️ carefully extracting math data from Common Crawl; 🔎 iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.
We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.
We hope this helps advance the performance of LLMs on math and reasoning! 🚀 We’re also releasing all the ablation models as well as the evaluation code.
🌟 Progress in the German FineWeb edu reproduction 🌟
We're delighted to share the launch of our new Data Quality Classification Model, designed specifically for evaluating educational content in German. This tool uses advanced machine learning techniques to assess texts across all educational levels, from primary school to university.
🔍 Inspired by Huggingface's fine web edu dataset, we've worked hard to refine data classification methods ensuring educators and learners access top-quality resources. We're excited about the future as we continue improving our models and expanding our datasets.
🙏 A huge thank you to David and Daryoush from Vago Solutions; Björn and Jan from Ellamind / DiscoResearch for their expert insights throughout this project. Your support has been crucial. This project was made possible by the support of PrimeLine AI.
The model was converted from Mistral Instruct v0.2 using a novel technique called trans-tokenization. As a result, the model uses a brand-new tokenizer, fully tailored for the Tatar language.
We also release a model which can be finetuned for translation of English or Russian into Tatar, and achieves a performance similar to commercial offerings:
Unlocking the Power of locally running Llama-3 8B Model Agents with Chat-UI! 🔥🚀✨
I'm thrilled to share my hackathon-style side project: 1. Finetuning Llama-8B for function calling using PEFT QLoRA as the instruct Llama-3 model doesn't support this. The colab notebook for it is here: https://lnkd.in/ggJMzqh2. 🛠️ 2. Finetuned model along with the 4-bit quants here: https://lnkd.in/gNpFKY6V ✨ 3. Clone Hugging Face https://lnkd.in/gKBKuUBQ and make it compatible for function calling by building upon the PR https://lnkd.in/gnqFuAd4 for my model and local inferencing usecase using Ollama. This was a steep learning curve wherein I stayed awake the whole night to get it working. 💪🏽 4. Above, I used SerpAPI for web browsing and Mongo DB Atlas free tier for persistence of conversations and assistant configs. 🔎 5. More work is required to switch between using tools and responding directly wherein I see the model breaks. 🧐
How cool is this wherein we are approaching experience akin to ChatGPT while using local hosted agent model running on your laptop! 💻