zlucia commited on
Commit
f14be20
1 Parent(s): e1b5f13

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -0
README.md CHANGED
@@ -1,5 +1,113 @@
1
  ---
2
  language:
3
  - en
 
 
4
  pipeline_tag: fill-mask
5
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language:
3
  - en
4
+ datasets:
5
+ - pile-of-law/pile-of-law
6
  pipeline_tag: fill-mask
7
  ---
8
+
9
+ # Pile of Law BERT large 2 model (uncased)
10
+ Pretrained model on English language legal and administrative text using the [RoBERTa](https://arxiv.org/abs/1907.11692) pretraining objective. This model was trained with the same setup as [pile-of-law/legalbert-large-1.7M-1](https://huggingface.co/pile-of-law/legalbert-large-1.7M-1), but with a different seed.
11
+
12
+ ## Model description
13
+ Pile of Law BERT large is a transformers model with the [BERT large model (uncased)](https://huggingface.co/bert-large-uncased) architecture pretrained on the [Pile of Law](https://huggingface.co/datasets/pile-of-law/pile-of-law), a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining.
14
+
15
+ ## Intended uses & limitations
16
+ You can use the raw model for masked language modeling or fine-tune it for a downstream task. Since this model was pretrained on a English language legal and administrative text corpus, legal downstream tasks will likely be more in-domain for this model.
17
+
18
+ ## How to use
19
+ You can use the model directly with a pipeline for masked language modeling:
20
+ ```python
21
+ >>> from transformers import pipeline
22
+ >>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2')
23
+ >>> pipe("An [MASK] is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.")
24
+
25
+ [{'sequence': 'an exception is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
26
+ 'score': 0.5218929052352905,
27
+ 'token': 4028,
28
+ 'token_str': 'exception'},
29
+ {'sequence': 'an appeal is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
30
+ 'score': 0.11434809118509293,
31
+ 'token': 1151,
32
+ 'token_str': 'appeal'},
33
+ {'sequence': 'an exclusion is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
34
+ 'score': 0.06454459577798843,
35
+ 'token': 5345,
36
+ 'token_str': 'exclusion'},
37
+ {'sequence': 'an example is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
38
+ 'score': 0.043593790382146835,
39
+ 'token': 3677,
40
+ 'token_str': 'example'},
41
+ {'sequence': 'an objection is a request made after a trial by a party that has lost on one or more issues that a higher court review the decision to determine if it was correct.',
42
+ 'score': 0.03758585825562477,
43
+ 'token': 3542,
44
+ 'token_str': 'objection'}]
45
+ ```
46
+
47
+ Here is how to use this model to get the features of a given text in PyTorch:
48
+
49
+ ```python
50
+ from transformers import BertTokenizer, BertModel
51
+ tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
52
+ model = BertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
53
+ text = "Replace me by any text you'd like."
54
+ encoded_input = tokenizer(text, return_tensors='pt')
55
+ output = model(**encoded_input)
56
+ ```
57
+
58
+ and in TensorFlow:
59
+
60
+ ```python
61
+ from transformers import BertTokenizer, TFBertModel
62
+ tokenizer = BertTokenizer.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
63
+ model = TFBertModel.from_pretrained('pile-of-law/legalbert-large-1.7M-2')
64
+ text = "Replace me by any text you'd like."
65
+ encoded_input = tokenizer(text, return_tensors='tf')
66
+ output = model(encoded_input)
67
+ ```
68
+
69
+ ## Limitations and bias
70
+ Please see Appendix G of the Pile of Law paper for copyright limitations related to dataset and model use.
71
+
72
+ This model can have biased predictions. In the following example where the model is used with a pipeline for masked language modeling, for the race descriptor of the criminal, the model predicts a higher score for "black" than "white".
73
+
74
+ ```python
75
+ >>> from transformers import pipeline
76
+ >>> pipe = pipeline(task='fill-mask', model='pile-of-law/legalbert-large-1.7M-2')
77
+ >>> pipe("The transcript of evidence reveals that at approximately 7:30 a. m. on January 22, 1973, the prosecutrix was awakened in her home in DeKalb County by the barking of the family dog, and as she opened her eyes she saw a [MASK] man standing beside her bed with a gun.", targets=["black", "white"])
78
+
79
+ [{'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a black man standing beside her bed with a gun.',
80
+ 'score': 0.02685137465596199,
81
+ 'token': 4311,
82
+ 'token_str': 'black'},
83
+ {'sequence': 'the transcript of evidence reveals that at approximately 7 : 30 a. m. on january 22, 1973, the prosecutrix was awakened in her home in dekalb county by the barking of the family dog, and as she opened her eyes she saw a white man standing beside her bed with a gun.',
84
+ 'score': 0.013632853515446186,
85
+ 'token': 4249,
86
+ 'token_str': 'white'}]
87
+ ```
88
+
89
+ This bias will also affect all fine-tuned versions of this model.
90
+
91
+ ## Training data
92
+ The Pile of Law BERT large model was pretrained on the Pile of Law, a dataset consisting of ~256GB of English language legal and administrative text for language model pretraining. The Pile of Law consists of 35 data sources, including legal analyses, court opinions and filings, government agency publications, contracts, statutes, regulations, casebooks, etc. We describe the data sources in detail in Appendix E of the Pile of Law paper. The Pile of Law dataset is placed under a CreativeCommons Attribution-NonCommercial-ShareAlike 4.0 International license.
93
+
94
+ ## Training procedure
95
+ ### Preprocessing
96
+ The model vocabulary consists of 29,000 tokens from a custom word-piece vocabulary fit to Pile of Law using the [HuggingFace WordPiece tokenizer](https://github.com/huggingface/tokenizers) and 3,000 randomly sampled legal terms from Black's Law Dictionary, for a vocabulary size of 32,000 tokens. The 80-10-10 masking, corruption, leave split, as in [BERT](https://arxiv.org/abs/1810.04805), is used, with a replication rate of 20 to create different masks for each context. To generate sequences, we use the [LexNLP sentence segmenter](https://github.com/LexPredict/lexpredict-lexnlp), which handles sentence segmentation for legal citations (which are often falsely mistaken as sentences). The input is formatted by filling sentences until they comprise 256 tokens, followed by a [SEP] token, and then filling sentences such that the entire span is under 512 tokens. If the next sentence in the series is too large, it is not added, and the remaining context length is filled with padding tokens.
97
+
98
+ ### Pretraining
99
+ The model was trained on a SambaNova cluster, with 8 RDUs, for 1.7 million steps. We used a smaller learning rate of 5e-6 and batch size of 128, to mitigate training instability, potentially due to the diversity of sources in our training data. The masked language modeling (MLM) objective without NSP loss, as described in [RoBERTa](https://arxiv.org/abs/1907.11692), was used for pretraining. The model was pretrained with 512 length sequence lengths for all steps.
100
+
101
+ We trained two models with the same setup in parallel model training runs, with different random seeds. We selected the lowest log likelihood model, [pile-of-law/legalbert-large-1.7M-1](https://huggingface.co/pile-of-law/legalbert-large-1.7M-1), which we refer to as PoL-BERT-Large, for experiments, but also release the second model, [pile-of-law/legalbert-large-1.7M-2](https://huggingface.co/pile-of-law/legalbert-large-1.7M-2).
102
+
103
+ ## Evaluation results
104
+ See the model card for [pile-of-law/legalbert-large-1.7M-1](https://huggingface.co/pile-of-law/legalbert-large-1.7M-1) for finetuning results on the CaseHOLD variant provided by the [LexGLUE paper](https://arxiv.org/abs/2110.00976).
105
+
106
+ ### BibTeX entry and citation info
107
+ ```bibtex
108
+ @article{henderson2022pile,
109
+ title={Pile of Law: Learning Responsible Data Filtering from Law and a 256GB Open-Source Legal Dataset},
110
+ author={Henderson, Peter and Krass, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Chris and Jurafsky, Dan and Ho, Daniel E.},
111
+ year={2022}
112
+ }
113
+ ```