language: en
tags:
- multiberts
- multiberts-seed_2
license: mit
datasets:
- wikimedia/wikipedia
- bookcorpus/bookcorpus
base_model:
- google/multiberts-seed_2-step_0k
library_name: transformers
EarlyBERTs
Random Seed 2 | Steps 10 – 40,000
🐤 EarlyBERTs reproduces the MultiBERTs (Sellam et al., 2022), and introduces more granular checkpoints covering the initial and critical learning phases. In "The Subspace Chronicles" (Müller-Eberstein et al., 2023), we leverage these checkpoints to study their early learning dynamics.
This suite builds on MultiBERTs and the underlying BERT architecture, covering seeds 0 – 4 for which intermediate checkpoints were originallt released. For each seed, we provide 31 additional checkpoints for steps 10, 100, 200, ..., 1,000, 2,000, ..., 20,000, 40,000, which are stored as respective model revisions (e.g., revision=step11000
).
Model Details
Model Developers
Max Müller-Eberstein as part of the NLPnorth research unit at the IT University of Copenhagen, Denmark.
Variations
EarlyBERTs cover seeds 0–4 (in respective repositories) and steps 10–40,000 (in respective model revision branches).
Input
Text only.
Output
Text and/or embeddings of the input.
Additionally, the CLS-classification head is trained on next sentence prediction as in Devlin et al. (2019).
Model Architecture
EarlyBERTs are based on the original BERT architecture (Devlin et al., 2019), and loads the respective MultiBERTs seed at step 0 as initialization.
Research Paper
Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training (Müller-Eberstein et al., 2023).
Training
Data
As both the original BERT as well as the MultiBERTs pre-training data are not publicly available, we gather a corresponding corpus using fully public versions of both the English Wikipedia and BookCorpus. Scripts to re-create the exact data ordering, sentence pairing and subword masking can be found in the project repository.
Hyperparameters
We replicate the exact training hyperparameters as in MultiBERTs, and document them in our research paper. Code to reproduce our training procedure can be found in the project repository.
Usage
Loading the intermediate checkpoint for a specific seed and step follows the standard HF API:
from transformers import AutoTokenizer, AutoModel
seed, step = 0, 7000
tokenizer = AutoTokenizer.from_pretrained(f'personads/earlyberts-seed{seed}')
model = AutoModel.from_pretrained(f'personads/earlyberts-seed{seed}', revision=f'step{step}')
Citation
If you find these models useful, please cite this, as well as the original MultiBERTs works:
@inproceedings{muller-eberstein-etal-2023-subspace,
title = "Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training",
author = {M{\"u}ller-Eberstein, Max and
van der Goot, Rob and
Plank, Barbara and
Titov, Ivan},
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.879",
doi = "10.18653/v1/2023.findings-emnlp.879",
pages = "13190--13208"
}
@inproceedings{
sellam2022the,
title={The Multi{BERT}s: {BERT} Reproductions for Robustness Analysis},
author={Thibault Sellam and Steve Yadlowsky and Ian Tenney and Jason Wei and Naomi Saphra and Alexander D'Amour and Tal Linzen and Jasmijn Bastings and Iulia Raluca Turc and Jacob Eisenstein and Dipanjan Das and Ellie Pavlick},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=K0E_F0gFDgA}
}