iarbel commited on
Commit
fa9237a
1 Parent(s): 5c9b1b1

Upload 8 files

Browse files
README.md CHANGED
@@ -1,3 +1,189 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: peft
4
+ tags:
5
+ - alignment-handbook
6
+ - trl
7
+ - sft
8
+ - generated_from_trainer
9
+ base_model: mistralai/Mistral-7B-v0.1
10
+ model-index:
11
+ - name: Cimphony-Mistral-Law-7B
12
+ results:
13
+ - task:
14
+ type: text-generation
15
+ dataset:
16
+ type: cais/mmlu
17
+ name: MMLU
18
+ metrics:
19
+ - name: International Law
20
+ type: accuracy
21
+ value: 0.802
22
+ verified: false
23
+ - task:
24
+ type: text-generation
25
+ dataset:
26
+ type: cais/mmlu
27
+ name: MMLU
28
+ metrics:
29
+ - name: Jurisprudence
30
+ type: accuracy
31
+ value: 0.704
32
+ verified: false
33
+ - task:
34
+ type: text-generation
35
+ dataset:
36
+ type: cais/mmlu
37
+ name: MMLU
38
+ metrics:
39
+ - name: Professional Law
40
+ type: accuracy
41
+ value: 0.416
42
+ verified: false
43
+ - task:
44
+ type: text-generation
45
+ dataset:
46
+ type: coastalcph/lex_glue
47
+ name: LexGLUE
48
+ metrics:
49
+ - name: ECtHR A
50
+ type: balanced accuracy
51
+ value: 0.631
52
+ verified: false
53
+ - task:
54
+ type: text-generation
55
+ dataset:
56
+ type: coastalcph/lex_glue
57
+ name: LexGLUE
58
+ metrics:
59
+ - name: LEDGAR
60
+ type: balanced accuracy
61
+ value: 0.741
62
+ verified: false
63
+ - task:
64
+ type: text-generation
65
+ dataset:
66
+ type: coastalcph/lex_glue
67
+ name: LexGLUE
68
+ metrics:
69
+ - name: CaseHOLD
70
+ type: accuracy
71
+ value: 0.776
72
+ verified: false
73
+ - task:
74
+ type: text-generation
75
+ dataset:
76
+ type: coastalcph/lex_glue
77
+ name: LexGLUE
78
+ metrics:
79
+ - name: Unfair-ToS
80
+ type: balanced accuracy
81
+ value: 0.809
82
+ verified: false
83
+
84
+ pipeline_tag: text-generation
85
+ ---
86
+
87
+ # Cimphony-Mistral-Law-7B
88
+
89
+ We introduce Cimphony-Mistral-Law-7B, a fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).
90
+
91
+ Cimphony’s LLMs present state-of-the-art performance on legal benchmarks, suppressing models trained on a much larger corpus with significantly more resources, and in some cases even GPT-4, OpenAI’s flagship model.
92
+
93
+ ![image/png](https://assets-global.website-files.com/64bde220a9b6312524909e4f/654a74e810c7aa04e73056c6_PNG-Final%20File-Logo-Cimphony-Vektora-Horizontal-01-p-500.png)
94
+
95
+ ## Model description
96
+
97
+ The model was trained on 600M tokens. We use novel methods to expose the model to this corpus during training, blending a variety of legal reading comprehension tasks, as well as general language data.
98
+
99
+
100
+ ## Legal Evaluation Results
101
+
102
+ We evaluate on the legal splits of the MMLU benchmark, as well as LexGLUE. While both are multiple option benchmarks, prompts were adapted so that the models output a single answer. In some cases, additional post-processing was required.
103
+
104
+ Benchmarks for which the labels were A-E multiple-choice options use an accuracy mertic. Benchmarks that have a closed list of options (e.g. Unfair-ToS) use a balanced-accuracy metric, as classes may not be balanced.
105
+
106
+ | Model / Benchmark | International Law (MMLU) | Jurisprudence (MMLU) | Professional law (MMLU) | ECtHR A (LexGlue) | LEDGAR (LexGlue) | CaseHOLD (LexGlue) | Unfair-ToS (LexGlue) |
107
+ |:-----------------------------------|:--------------------------|:----------------------|:-------------------------|:-------------------|:------------------|:--------------------|:-----------------------|
108
+ | Mistral-7B-Instruct-v0.2 | 73.6% | 69.4% | 41.2% | 67.5% | 50.6% | 56.3% | 36.6% |
109
+ | AdaptLLM | 57.0% | 52.8% | 36.1% | 51.9% | 46.3% | 50.0% | 51.3% |
110
+ | Saul-7B | 69.4% | 63.0% | **43.2%** | **71.2%** | 55.9% | 65.8% | 80.3% |
111
+ |<tr style="background-color:yellow;"><td>Cimphony-7B</td><td>**80.2%**</td><td>**70.4%**</td><td>41.6%</td><td>63.1%</td><td>**74.1%**</td><td>**77.6%**</td><td>**80.9%**</td></tr>|
112
+
113
+ ## Training and evaluation data
114
+
115
+ Following the framework presented in [AdaptLLM](https://huggingface.co/AdaptLLM/law-chat), we convert the raw legal text into reading comprehension. Taking inspiration from human learning via reading comprehension - practice after reading improves the ability to answer questions based on the learned knowledge.
116
+
117
+ We developed a high-quality prompt database, considering the capabilities we’d like the model to possess. LLMs were prompt with the raw text and a collection of prompts, and it returned answers, additional questions, and transformations relevant to the input data. With further post-processing of these outputs, we created our legal reading comprehension dataset.
118
+
119
+
120
+ | Domain | Dataset | Tokens | License |
121
+ |:-------------------|:--------------------|:------:|:------------|
122
+ | Legal | The Pile (FreeLaw) | 180M | MIT |
123
+ | Legal | LexGlue | 108M | CC-BY-4.0 |
124
+ | Legal | USClassActions | 12M | GPL-3.0 |
125
+ | Math (CoT) | AQUA-RAT | 3M | Apache-2.0 |
126
+ | Commonsense (CoT) | ECQA | 2.4M | Apache-2.0 |
127
+ | Reasoning (CoT) | EntailmentBank | 1.8M | Apache-2.0 |
128
+ | Chat | UltraChat | 90M | MIT |
129
+ | Code | Code-Feedback | 36M | Apache-2.0 |
130
+ | Instruction | OpenOrca | 180M | MIT |
131
+
132
+
133
+ ## Intended uses & limitations
134
+
135
+ This model can be used for use cases involving legal domain text generation.
136
+
137
+ As with any language model, users must not solely relay on model generations. This model has not gone through a human-feedback alignment (RLHF). The model may generate responses containing hallucinations and biases.
138
+
139
+ Example use:
140
+ ```python
141
+ from transformers import AutoModelForCausalLM, AutoTokenizer
142
+ from peft import PeftModel
143
+
144
+ tokenizer = AutoTokenizer.from_pretrained("iarbel/mistral-law-7b-beta")
145
+ model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
146
+ model = PeftModel.from_pretrained(model, "iarbel/mistral-law-7b-beta")
147
+
148
+ # Put your input here:
149
+ user_input = '''Question: What can you tell me about ex post facto laws?'''
150
+
151
+ # Apply the prompt template
152
+ prompt = tokenizer.apply_chat_template(user_input, tokenize=False)
153
+
154
+ inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
155
+ outputs = model.generate(input_ids=inputs, max_length=4096)[0]
156
+
157
+ answer_start = int(inputs.shape[-1])
158
+ pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)
159
+
160
+ print(f'### User Input:\n{user_input}\n\n### Assistant Output:\n{pred}')
161
+ ```
162
+
163
+ ## Training procedure
164
+
165
+ ### Training hyperparameters
166
+
167
+ The following hyperparameters were used during training:
168
+ - learning_rate: 0.0005
169
+ - train_batch_size: 8
170
+ - eval_batch_size: 24
171
+ - seed: 42
172
+ - distributed_type: multi-GPU
173
+ - num_devices: 4
174
+ - gradient_accumulation_steps: 4
175
+ - total_train_batch_size: 128
176
+ - total_eval_batch_size: 96
177
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
178
+ - lr_scheduler_type: cosine
179
+ - lr_scheduler_warmup_ratio: 0.05
180
+ - num_epochs: 1
181
+
182
+
183
+ ### Framework versions
184
+
185
+ - PEFT 0.8.2
186
+ - Transformers 4.37.2
187
+ - Pytorch 2.1.2+cu121
188
+ - Datasets 2.14.6
189
+ - Tokenizers 0.15.2
adapter_config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": null,
4
+ "base_model_name_or_path": "mistralai/Mistral-7B-v0.1",
5
+ "bias": "none",
6
+ "fan_in_fan_out": false,
7
+ "inference_mode": true,
8
+ "init_lora_weights": true,
9
+ "layers_pattern": null,
10
+ "layers_to_transform": null,
11
+ "loftq_config": {},
12
+ "lora_alpha": 64,
13
+ "lora_dropout": 0.05,
14
+ "megatron_config": null,
15
+ "megatron_core": "megatron.core",
16
+ "modules_to_save": null,
17
+ "peft_type": "LORA",
18
+ "r": 64,
19
+ "rank_pattern": {},
20
+ "revision": null,
21
+ "target_modules": [
22
+ "v_proj",
23
+ "o_proj",
24
+ "q_proj",
25
+ "down_proj",
26
+ "k_proj",
27
+ "up_proj"
28
+ ],
29
+ "task_type": "CAUSAL_LM",
30
+ "use_rslora": false
31
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:74ab9437033b622089e58b4cb3b83e59a4c7e3ca6693272696ba93923be598a8
3
+ size 260098992
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "mistralai/Mistral-7B-v0.1",
3
+ "architectures": [
4
+ "MistralForCausalLM"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 4096,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 14336,
13
+ "max_position_embeddings": 32768,
14
+ "model_type": "mistral",
15
+ "num_attention_heads": 32,
16
+ "num_hidden_layers": 32,
17
+ "num_key_value_heads": 8,
18
+ "rms_norm_eps": 1e-05,
19
+ "rope_theta": 10000.0,
20
+ "sliding_window": 4096,
21
+ "tie_word_embeddings": false,
22
+ "torch_dtype": "bfloat16",
23
+ "transformers_version": "4.37.2",
24
+ "use_cache": true,
25
+ "vocab_size": 32000
26
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<unk>",
17
+ "unk_token": {
18
+ "content": "<unk>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dadfd56d766715c61d2ef780a525ab43b8e6da4de6865bda3d95fdef5e134055
3
+ size 493443
tokenizer_config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {
5
+ "0": {
6
+ "content": "<unk>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "1": {
14
+ "content": "<s>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "2": {
22
+ "content": "</s>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ }
29
+ },
30
+ "additional_special_tokens": [],
31
+ "bos_token": "<s>",
32
+ "clean_up_tokenization_spaces": false,
33
+ "eos_token": "</s>",
34
+ "legacy": true,
35
+ "model_max_length": 4096,
36
+ "pad_token": "<unk>",
37
+ "sp_model_kwargs": {},
38
+ "spaces_between_special_tokens": false,
39
+ "tokenizer_class": "LlamaTokenizer",
40
+ "unk_token": "<unk>",
41
+ "use_default_system_prompt": false
42
+ }