Commit
·
b66816c
1
Parent(s):
5e14a20
Update README.md
Browse files
README.md
CHANGED
@@ -79,7 +79,7 @@ The masking procedure used is the standard one for Bert-style training:
|
|
79 |
|
80 |
### Pretraining
|
81 |
|
82 |
-
The model was trained with
|
83 |
|
84 |
|
85 |
### BibTeX entry and citation info
|
|
|
79 |
|
80 |
### Pretraining
|
81 |
|
82 |
+
The model was trained with 128 A100 80GB on 300B tokens, with an effective batch size of 1M tokens. The sequence length used was 1000 tokens. The Adam optimizer [38] was used with a learning rate schedule, and standard values for exponential decay rates and epsilon constants, β1 = 0.9, β2 = 0.999 and ε=1e-8. During a first warmup period, the learning rate was increased linearly between 5e-5 and 1e-4 over 16k steps before decreasing following a square root decay until the end of training.
|
83 |
|
84 |
|
85 |
### BibTeX entry and citation info
|