personads commited on
Commit
69e50c5
1 Parent(s): de18e29

updated model card

Browse files
Files changed (1) hide show
  1. README.md +108 -3
README.md CHANGED
@@ -1,3 +1,108 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - multiberts
5
+ - multiberts-seed_4
6
+ license: mit
7
+ datasets:
8
+ - wikimedia/wikipedia
9
+ - bookcorpus/bookcorpus
10
+ base_model:
11
+ - google/multiberts-seed_4-step_0k
12
+ library_name: transformers
13
+ ---
14
+
15
+ # EarlyBERTs
16
+
17
+ **Random Seed** 4 | **Steps** 10 – 40,000
18
+
19
+ 🐤 **EarlyBERTs** reproduces the [MultiBERTs](http://goo.gle/multiberts) ([Sellam et al., 2022](https://openreview.net/forum?id=K0E_F0gFDgA)), and introduces more granular checkpoints covering the initial and critical learning phases. In "The Subspace Chronicles" ([Müller-Eberstein et al., 2023](https://mxij.me/x/subspace-chronicles)), we leverage these checkpoints to study their early learning dynamics.
20
+
21
+ This suite builds on MultiBERTs and the underlying BERT architecture, covering seeds 0 – 4 for which intermediate checkpoints were originallt released. For each seed, we provide 31 additional checkpoints for steps 10, 100, 200, ..., 1,000, 2,000, ..., 20,000, 40,000, which are stored as respective model revisions (e.g., `revision=step11000`).
22
+
23
+ ## Model Details
24
+
25
+ **Model Developers**
26
+
27
+ [Max Müller-Eberstein](https://mxij.me) as part of the [NLPnorth research unit](https://nlpnorth.github.io) at the [IT University of Copenhagen](https://itu.dk), Denmark.
28
+
29
+ **Variations**
30
+
31
+ EarlyBERTs cover seeds 0–4 (in respective repositories) and steps 10–40,000 (in respective model revision branches).
32
+
33
+ **Input**
34
+
35
+ Text only.
36
+
37
+ **Output**
38
+
39
+ Text and/or embeddings of the input.
40
+
41
+ Additionally, the CLS-classification head is trained on next sentence prediction as in [Devlin et al. (2019)](https://aclanthology.org/N19-1423/).
42
+
43
+ **Model Architecture**
44
+
45
+ EarlyBERTs are based on the original BERT architecture [(Devlin et al., 2019)](https://aclanthology.org/N19-1423/), and loads the respective MultiBERTs seed at step 0 as initialization.
46
+
47
+ **Research Paper**
48
+
49
+ Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training ([Müller-Eberstein et al., 2023](https://mxij.me/x/subspace-chronicles)).
50
+
51
+ ## Training
52
+
53
+ **Data**
54
+
55
+ As both the original BERT as well as the MultiBERTs pre-training data are not publicly available, we gather a corresponding corpus using fully public versions of both the [English Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) and [BookCorpus](https://huggingface.co/datasets/bookcorpus/bookcorpus). Scripts to re-create the exact data ordering, sentence pairing and subword masking can be found in [the project repository](http://mxij.me/x/emnlp-2023-code).
56
+
57
+ **Hyperparameters**
58
+
59
+ We replicate the exact training hyperparameters as in MultiBERTs, and document them in [our research paper](https://mxij.me/x/subspace-chronicles). Code to reproduce our training procedure can be found in [the project repository](http://mxij.me/x/emnlp-2023-code).
60
+
61
+ ## Usage
62
+
63
+ Loading the intermediate checkpoint for a specific seed and step follows the standard HF API:
64
+
65
+ ```python
66
+ from transformers import AutoTokenizer, AutoModel
67
+
68
+ seed, step = 0, 7000
69
+
70
+ tokenizer = AutoTokenizer.from_pretrained(f'personads/earlyberts-seed{seed}')
71
+ model = AutoModel.from_pretrained(f'personads/earlyberts-seed{seed}', revision=f'step{step}')
72
+ ```
73
+
74
+ ## Citation
75
+
76
+ If you find these models useful, please cite this, as well as the original MultiBERTs works:
77
+
78
+ ```
79
+ @inproceedings{muller-eberstein-etal-2023-subspace,
80
+ title = "Subspace Chronicles: How Linguistic Information Emerges, Shifts and Interacts during Language Model Training",
81
+ author = {M{\"u}ller-Eberstein, Max and
82
+ van der Goot, Rob and
83
+ Plank, Barbara and
84
+ Titov, Ivan},
85
+ editor = "Bouamor, Houda and
86
+ Pino, Juan and
87
+ Bali, Kalika",
88
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
89
+ month = dec,
90
+ year = "2023",
91
+ address = "Singapore",
92
+ publisher = "Association for Computational Linguistics",
93
+ url = "https://aclanthology.org/2023.findings-emnlp.879",
94
+ doi = "10.18653/v1/2023.findings-emnlp.879",
95
+ pages = "13190--13208"
96
+ }
97
+ ```
98
+
99
+ ```bibtex
100
+ @inproceedings{
101
+ sellam2022the,
102
+ title={The Multi{BERT}s: {BERT} Reproductions for Robustness Analysis},
103
+ author={Thibault Sellam and Steve Yadlowsky and Ian Tenney and Jason Wei and Naomi Saphra and Alexander D'Amour and Tal Linzen and Jasmijn Bastings and Iulia Raluca Turc and Jacob Eisenstein and Dipanjan Das and Ellie Pavlick},
104
+ booktitle={International Conference on Learning Representations},
105
+ year={2022},
106
+ url={https://openreview.net/forum?id=K0E_F0gFDgA}
107
+ }
108
+ ```