Update README.md
Browse files
README.md
CHANGED
@@ -98,7 +98,7 @@ The model vocabulary consists of 29,000 tokens from a custom word-piece vocabula
|
|
98 |
### Pretraining
|
99 |
The model was trained on a SambaNova cluster, with 8 RDUs, for 1.7 million steps. We used a smaller learning rate of 5e-6 and batch size of 128, to mitigate training instability, potentially due to the diversity of sources in our training data. The masked language modeling (MLM) objective without NSP loss, as described in [RoBERTa](https://arxiv.org/abs/1907.11692), was used for pretraining. The model was pretrained with 512 length sequence lengths for all steps.
|
100 |
|
101 |
-
We trained two models with the same setup in parallel model training runs, with different random seeds. We selected the lowest log likelihood model, [legalbert-large-1.7M-1](https://huggingface.co/pile-of-law/legalbert-large-1.7M-1), which we refer to as PoL-BERT-Large, for experiments, but also release the second model, [legalbert-large-1.7M-2](https://huggingface.co/pile-of-law/legalbert-large-1.7M-2).
|
102 |
|
103 |
## Evaluation results
|
104 |
When finetuned on the CaseHOLD variant provided by the [LexGLUE paper](https://arxiv.org/abs/2110.00976), this model, PoL-BERT-Large, achieves the following results. In the table below, we also report results for [BERT-Large-Uncased](https://huggingface.co/bert-large-uncased) and [CaseLaw-BERT](https://huggingface.co/zlucia/custom-legalbert). We report results on the models with hyperparameter tuning on the downstream task and the result reported for the CaseLaw-BERT model from the [LexGLUE paper](https://arxiv.org/abs/2110.00976), which uses a fixed experimental setup.
|
|
|
98 |
### Pretraining
|
99 |
The model was trained on a SambaNova cluster, with 8 RDUs, for 1.7 million steps. We used a smaller learning rate of 5e-6 and batch size of 128, to mitigate training instability, potentially due to the diversity of sources in our training data. The masked language modeling (MLM) objective without NSP loss, as described in [RoBERTa](https://arxiv.org/abs/1907.11692), was used for pretraining. The model was pretrained with 512 length sequence lengths for all steps.
|
100 |
|
101 |
+
We trained two models with the same setup in parallel model training runs, with different random seeds. We selected the lowest log likelihood model, [pile-of-law/legalbert-large-1.7M-1](https://huggingface.co/pile-of-law/legalbert-large-1.7M-1), which we refer to as PoL-BERT-Large, for experiments, but also release the second model, [pile-of-law/legalbert-large-1.7M-2](https://huggingface.co/pile-of-law/legalbert-large-1.7M-2).
|
102 |
|
103 |
## Evaluation results
|
104 |
When finetuned on the CaseHOLD variant provided by the [LexGLUE paper](https://arxiv.org/abs/2110.00976), this model, PoL-BERT-Large, achieves the following results. In the table below, we also report results for [BERT-Large-Uncased](https://huggingface.co/bert-large-uncased) and [CaseLaw-BERT](https://huggingface.co/zlucia/custom-legalbert). We report results on the models with hyperparameter tuning on the downstream task and the result reported for the CaseLaw-BERT model from the [LexGLUE paper](https://arxiv.org/abs/2110.00976), which uses a fixed experimental setup.
|