Create model card
Browse files
README.md
CHANGED
@@ -3,3 +3,49 @@ language:
|
|
3 |
- en
|
4 |
pipeline_tag: fill-mask
|
5 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
- en
|
4 |
pipeline_tag: fill-mask
|
5 |
---
|
6 |
+
|
7 |
+
# LegalBERT large model (uncased)
|
8 |
+
Pretrained model on English language legal and administrative text using the [RoBERTa](https://arxiv.org/abs/1907.11692) pretraining objective.
|
9 |
+
|
10 |
+
## Model description
|
11 |
+
LegalBERT large is a transformers model with the [BERT large model (uncased)](https://huggingface.co/bert-large-uncased) architecture pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining.
|
12 |
+
|
13 |
+
## Intended uses & limitations
|
14 |
+
You can use the raw model for masked language modeling or fine-tune it for a downstream task. Since this model was pretrained on a English language legal and administrative text corpus, legal downstream tasks will likely be more in-domain for this model.
|
15 |
+
|
16 |
+
## How to use
|
17 |
+
|
18 |
+
## Limitations and bias
|
19 |
+
|
20 |
+
## Training data
|
21 |
+
The LegalBERT model was pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining. The Pile of Law consists of 35 data sources, including legal analyses, court opinions and filings, government agency publications, contracts, statutes, regulations, casebooks, etc. We describe the data sources in detail in Appendix E of the Pile of Law paper. The Pile of Law dataset is placed under a CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International license.
|
22 |
+
|
23 |
+
## Training procedure
|
24 |
+
### Preprocessing
|
25 |
+
The model vocabulary consists of 29,000 tokens from a custom word-piece vocabulary fit to Pile of Law using the [HuggingFace WordPiece tokenizer](https://github.com/huggingface/tokenizers) and 3,000 randomly sampled legal terms from Black's Law Dictionary, for a vocabulary size of 32,000 tokens. The 80-10-10 masking, corruption, leave split, as in [BERT](https://arxiv.org/abs/1810.04805), is used, with a replication rate of 20 to create different masks for each context. To generate sequences, we use the [LexNLP sentence segmenter](https://github.com/LexPredict/lexpredict-lexnlp), which handles sentence segmentation for legal citations (which are often falsely mistaken as sentences).
|
26 |
+
|
27 |
+
### Pretraining
|
28 |
+
The model was trained on a SambaNova cluster, with 8 RDUs, for 1.7 million steps. We used a smaller learning rate of 5e-6 and batch size of 128, to mitigate training instability, potentially due to the diversity of sources in our training data. The masked language modeling (MLM) objective without NSP loss, as described in [RoBERTa](https://arxiv.org/abs/1907.11692), was used for pretraining. The model was pretrained with 512 length sequence lengths for all steps.
|
29 |
+
|
30 |
+
We trained two models with the same configuration in parallel model training runs, with different random seeds. We selected the lowest log likelihood model, [legalbert-large-1.7M-1](https://huggingface.co/pile-of-law/legalbert-large-1.7M-1), which we refer to as PoL-BERT-Large, for experiments, but also release the second model, [legalbert-large-1.7M-2](https://huggingface.co/pile-of-law/legalbert-large-1.7M-2).
|
31 |
+
|
32 |
+
## Evaluation results
|
33 |
+
When finetuned on the CaseHOLD variant provided by the [LexGLUE paper](https://arxiv.org/abs/2110.00976), this model, PoL-BERT-Large, achieves the following results. In the table below, we also report results for [BERT-Large-Uncased(]https://huggingface.co/bert-large-uncased) and [CaseLaw-BERT](https://huggingface.co/zlucia/custom-legalbert). We report results on the models with hyperparameter tuning on the downstream task and the result reported for the CaseLaw-BERT model from the [LexGLUE paper](https://arxiv.org/abs/2110.00976), which uses a fixed experimental setup.
|
34 |
+
|
35 |
+
CaseHOLD test results:
|
36 |
+
|
37 |
+
| Model | F1 |
|
38 |
+
| ---------------------|-----|
|
39 |
+
| CaseLaw-BERT (tuned)| 78.5 |
|
40 |
+
| CaseLaw-BERT (LexGLUE)| 75.4 |
|
41 |
+
| PoL-BERT-Large| 75.0 |
|
42 |
+
| BERT-Large-Uncased| 71.3|
|
43 |
+
|
44 |
+
### BibTeX entry and citation info
|
45 |
+
```bibtex
|
46 |
+
@article{henderson2022pile,
|
47 |
+
title={Pile of Law: Learning Responsible Data Filtering from Law and a 256GB Open-Source Legal Dataset},
|
48 |
+
author={Henderson, Peter and Krass, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Chris and Jurafsky, Dan and Ho, Daniel E.},
|
49 |
+
year={2022}
|
50 |
+
}
|
51 |
+
```
|