File size: 7,473 Bytes
fa9237a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b27b59
 
4c6a038
fa9237a
476274a
fa9237a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c7cc200
fa9237a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc4ed05
fa9237a
fc4ed05
fa9237a
 
14e54f2
fa9237a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
license: apache-2.0
library_name: peft
tags:
- alignment-handbook
- trl
- sft
- generated_from_trainer
base_model: mistralai/Mistral-7B-v0.1
model-index:
- name: Cimphony-Mistral-Law-7B
  results:
  - task: 
      type: text-generation
    dataset: 
      type: cais/mmlu
      name: MMLU
    metrics:
    - name: International Law
      type: accuracy
      value: 0.802
      verified: false
  - task: 
      type: text-generation
    dataset: 
      type: cais/mmlu
      name: MMLU
    metrics:
    - name: Jurisprudence
      type: accuracy
      value: 0.704
      verified: false
  - task: 
      type: text-generation
    dataset: 
      type: cais/mmlu
      name: MMLU
    metrics:
    - name: Professional Law
      type: accuracy
      value: 0.416
      verified: false
  - task: 
      type: text-generation
    dataset: 
      type: coastalcph/lex_glue
      name: LexGLUE
    metrics:
    - name: ECtHR A
      type: balanced accuracy
      value: 0.631
      verified: false
  - task: 
      type: text-generation
    dataset: 
      type: coastalcph/lex_glue
      name: LexGLUE
    metrics:
    - name: LEDGAR
      type: balanced accuracy
      value: 0.741
      verified: false
  - task: 
      type: text-generation
    dataset: 
      type: coastalcph/lex_glue
      name: LexGLUE
    metrics:
    - name: CaseHOLD
      type: accuracy
      value: 0.776
      verified: false
  - task: 
      type: text-generation
    dataset: 
      type: coastalcph/lex_glue
      name: LexGLUE
    metrics:
    - name: Unfair-ToS
      type: balanced accuracy
      value: 0.809
      verified: false
      
pipeline_tag: text-generation
---

# Cimphony-Mistral-Law-7B

We introduce Cimphony-Mistral-Law-7B, a fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).

Cimphony’s LLMs present state-of-the-art performance on legal benchmarks, suppressing models trained on a much larger corpus with significantly more resources, even GPT-4, OpenAI’s flagship model.

Checkout and register on our [https://cimphony.ai](https://app.cimphony.ai/signup?callbackUrl=https://app.cimphony.ai/)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/657d36d3647c0211e7746ed9/Yjx96bC58SPgNwmDxx_yx.png)

## Model description

The model was trained on 600M tokens. We use novel methods to expose the model to this corpus during training, blending a variety of legal reading comprehension tasks, as well as general language data.


## Legal Evaluation Results

We evaluate on the legal splits of the MMLU benchmark, as well as LexGLUE. While both are multiple option benchmarks, prompts were adapted so that the models output a single answer. In some cases, additional post-processing was required.

Benchmarks for which the labels were A-E multiple-choice options use an accuracy mertic. Benchmarks that have a closed list of options (e.g. Unfair-ToS) use a balanced-accuracy metric, as classes may not be balanced.

| Model / Benchmark                  | International Law (MMLU)  | Jurisprudence (MMLU)  | Professional law (MMLU)  | ECtHR A (LexGlue)  | LEDGAR (LexGlue)  | CaseHOLD (LexGlue)  | Unfair-ToS (LexGlue)   |
|:-----------------------------------|:--------------------------|:----------------------|:-------------------------|:-------------------|:------------------|:--------------------|:-----------------------|
| Mistral-7B-Instruct-v0.2           | 73.6%                     | 69.4%                 | 41.2%                    |  67.5%             | 50.6%             |  56.3%              | 36.6%                  |
| AdaptLLM                           | 57.0%                     | 52.8%                 | 36.1%                    | 51.9%              | 46.3%             |  50.0%              | 51.3%                  |
| Saul-7B                            | 69.4%                     | 63.0%                 | **43.2%**                | **71.2%**          | 55.9%             | 65.8%               | 80.3%                  |
|<tr style="background-color:yellow;"><td>Cimphony-7B</td><td>**80.2%**</td><td>**70.4%**</td><td>41.6%</td><td>63.1%</td><td>**74.1%**</td><td>**77.6%**</td><td>**80.9%**</td></tr>|
 
## Training and evaluation data

Following the framework presented in [AdaptLLM](https://huggingface.co/AdaptLLM/law-chat), we convert the raw legal text into reading comprehension. Taking inspiration from human learning via reading comprehension - practice after reading improves the ability to answer questions based on the learned knowledge.

We developed a high-quality prompt database, considering the capabilities we’d like the model to possess. LLMs were prompt with the raw text and a collection of prompts, and it returned answers, additional questions, and transformations relevant to the input data. With further post-processing of these outputs, we created our legal reading comprehension dataset.


| Domain             | Dataset             | Tokens | License     |
|:-------------------|:--------------------|:------:|:------------|
| Legal              | The Pile (FreeLaw)  | 180M   | MIT         |
| Legal              | LexGlue (train split only) | 108M   | CC-BY-4.0   |
| Legal              | USClassActions      | 12M    | GPL-3.0     |
| Math (CoT)         | AQUA-RAT            | 3M     | Apache-2.0  |
| Commonsense (CoT)  | ECQA                | 2.4M   | Apache-2.0  |
| Reasoning (CoT)    | EntailmentBank      | 1.8M   | Apache-2.0  |
| Chat               | UltraChat           | 90M    | MIT         |
| Code               | Code-Feedback       | 36M    | Apache-2.0  |
| Instruction        | OpenOrca            | 180M   | MIT         |


## Intended uses & limitations

This model can be used for use cases involving legal domain text generation.

As with any language model, users must not solely relay on model generations. This model has not gone through a human-feedback alignment (RLHF). The model may generate responses containing hallucinations and biases.

Example use:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

tokenizer = AutoTokenizer.from_pretrained("cimphonyadmin/Cimphony-Mistral-Law-7B")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
model = PeftModel.from_pretrained(model, "cimphonyadmin/Cimphony-Mistral-Law-7B")

# Put your input here:
user_input = '''What can you tell me about ex post facto laws?'''

# Apply the prompt template
prompt = tokenizer.apply_chat_template(user_input, tokenize=False)

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
outputs = model.generate(input_ids=inputs, max_length=4096)[0]

answer_start = int(inputs.shape[-1])
pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

print(f'### User Input:\n{user_input}\n\n### Assistant Output:\n{pred}')
```

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0005
- train_batch_size: 8
- eval_batch_size: 24
- seed: 42
- distributed_type: multi-GPU
- num_devices: 4
- gradient_accumulation_steps: 4
- total_train_batch_size: 128
- total_eval_batch_size: 96
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 1


### Framework versions

- PEFT 0.8.2
- Transformers 4.37.2
- Pytorch 2.1.2+cu121
- Datasets 2.14.6
- Tokenizers 0.15.2