WebOrganizer/LM-1b_1x-DCLMFasttext
A 1.4B parameter model trained for 29B tokens from WebOrganizer/Corpus-200B.
The training data for this model was selected via:
- Selection method: Top scores from DCLM-Fasttext Model
- Domain definition: n/a (global selection)
- Domain mixture: n/a
Repository Contents
Besides the HuggingFace model and tokenizer, the repository contains:
open_lm/
: Contains the OpenLM config and final checkpointevals/
: Evaluation results for various benchmarkscore_9mcqa/
: Results of 9 multiple choice QA tasks with the OLMES evaluation frameworkmmlu/
: MMLU results with the OLMES evaluation frameworkdclm/
: Results using the DCLM evaluation frameworkperplexity/
: Perplexity results using the huggingface trainer
indices.tar.zst
: The indices for the selected documents in each shard of the Corpus-200B dataset used for training. The indices can be extracted withtar --use-compress-program "zstd" -xf indices.tar.zst
.
Usage
To use this model, you need to install the open_lm library and add from open_lm.hf import *
before loading the model with AutoModel.from_pretrained(...)
.
Citation
@article{wettig2025organize,
title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
journal={arXiv preprint arXiv:2502.10341},
year={2025}
}
- Downloads last month
- 10
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.