Update README.md
Browse files
README.md
CHANGED
|
@@ -61,7 +61,11 @@ Users (both direct and downstream) should be made aware of the risks, biases and
|
|
| 61 |
|
| 62 |
# Training
|
| 63 |
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
The model developers also write that:
|
| 67 |
|
|
|
|
| 61 |
|
| 62 |
# Training
|
| 63 |
|
| 64 |
+
The model developers write:
|
| 65 |
+
|
| 66 |
+
> In all experiments, we use a Transformer architecture with 1024 hidden units, 8 heads, GELU activations (Hendrycks and Gimpel, 2016), a dropout rate of 0.1 and learned positional embeddings. We train our models with the Adam op- timizer (Kingma and Ba, 2014), a linear warm- up (Vaswani et al., 2017) and learning rates varying from 10^−4 to 5.10^−4.
|
| 67 |
+
|
| 68 |
+
See the [associated paper](https://arxiv.org/pdf/1901.07291.pdf) for links, citations, and further details on the training data and training procedure.
|
| 69 |
|
| 70 |
The model developers also write that:
|
| 71 |
|