Update README.md

c7cc200 verified 6 months ago

7.47 kB

	---
	license: apache-2.0
	library_name: peft
	tags:
	- alignment-handbook
	- trl
	- sft
	- generated_from_trainer
	base_model: mistralai/Mistral-7B-v0.1
	model-index:
	- name: Cimphony-Mistral-Law-7B
	results:
	- task:
	type: text-generation
	dataset:
	type: cais/mmlu
	name: MMLU
	metrics:
	- name: International Law
	type: accuracy
	value: 0.802
	verified: false
	- task:
	type: text-generation
	dataset:
	type: cais/mmlu
	name: MMLU
	metrics:
	- name: Jurisprudence
	type: accuracy
	value: 0.704
	verified: false
	- task:
	type: text-generation
	dataset:
	type: cais/mmlu
	name: MMLU
	metrics:
	- name: Professional Law
	type: accuracy
	value: 0.416
	verified: false
	- task:
	type: text-generation
	dataset:
	type: coastalcph/lex_glue
	name: LexGLUE
	metrics:
	- name: ECtHR A
	type: balanced accuracy
	value: 0.631
	verified: false
	- task:
	type: text-generation
	dataset:
	type: coastalcph/lex_glue
	name: LexGLUE
	metrics:
	- name: LEDGAR
	type: balanced accuracy
	value: 0.741
	verified: false
	- task:
	type: text-generation
	dataset:
	type: coastalcph/lex_glue
	name: LexGLUE
	metrics:
	- name: CaseHOLD
	type: accuracy
	value: 0.776
	verified: false
	- task:
	type: text-generation
	dataset:
	type: coastalcph/lex_glue
	name: LexGLUE
	metrics:
	- name: Unfair-ToS
	type: balanced accuracy
	value: 0.809
	verified: false

	pipeline_tag: text-generation
	---

	# Cimphony-Mistral-Law-7B

	We introduce Cimphony-Mistral-Law-7B, a fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1).

	Cimphony’s LLMs present state-of-the-art performance on legal benchmarks, suppressing models trained on a much larger corpus with significantly more resources, even GPT-4, OpenAI’s flagship model.

	Checkout and register on our [https://cimphony.ai](https://app.cimphony.ai/signup?callbackUrl=https://app.cimphony.ai/)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/657d36d3647c0211e7746ed9/Yjx96bC58SPgNwmDxx_yx.png)

	## Model description

	The model was trained on 600M tokens. We use novel methods to expose the model to this corpus during training, blending a variety of legal reading comprehension tasks, as well as general language data.


	## Legal Evaluation Results

	We evaluate on the legal splits of the MMLU benchmark, as well as LexGLUE. While both are multiple option benchmarks, prompts were adapted so that the models output a single answer. In some cases, additional post-processing was required.

	Benchmarks for which the labels were A-E multiple-choice options use an accuracy mertic. Benchmarks that have a closed list of options (e.g. Unfair-ToS) use a balanced-accuracy metric, as classes may not be balanced.

	\| Model / Benchmark \| International Law (MMLU) \| Jurisprudence (MMLU) \| Professional law (MMLU) \| ECtHR A (LexGlue) \| LEDGAR (LexGlue) \| CaseHOLD (LexGlue) \| Unfair-ToS (LexGlue) \|
	\|:-----------------------------------\|:--------------------------\|:----------------------\|:-------------------------\|:-------------------\|:------------------\|:--------------------\|:-----------------------\|
	\| Mistral-7B-Instruct-v0.2 \| 73.6% \| 69.4% \| 41.2% \| 67.5% \| 50.6% \| 56.3% \| 36.6% \|
	\| AdaptLLM \| 57.0% \| 52.8% \| 36.1% \| 51.9% \| 46.3% \| 50.0% \| 51.3% \|
	\| Saul-7B \| 69.4% \| 63.0% \| 43.2% \| 71.2% \| 55.9% \| 65.8% \| 80.3% \|
	\|<tr style="background-color:yellow;"><td>Cimphony-7B</td><td>80.2%</td><td>70.4%</td><td>41.6%</td><td>63.1%</td><td>74.1%</td><td>77.6%</td><td>80.9%</td></tr>\|

	## Training and evaluation data

	Following the framework presented in [AdaptLLM](https://huggingface.co/AdaptLLM/law-chat), we convert the raw legal text into reading comprehension. Taking inspiration from human learning via reading comprehension - practice after reading improves the ability to answer questions based on the learned knowledge.

	We developed a high-quality prompt database, considering the capabilities we’d like the model to possess. LLMs were prompt with the raw text and a collection of prompts, and it returned answers, additional questions, and transformations relevant to the input data. With further post-processing of these outputs, we created our legal reading comprehension dataset.


	\| Domain \| Dataset \| Tokens \| License \|
	\|:-------------------\|:--------------------\|:------:\|:------------\|
	\| Legal \| The Pile (FreeLaw) \| 180M \| MIT \|
	\| Legal \| LexGlue (train split only) \| 108M \| CC-BY-4.0 \|
	\| Legal \| USClassActions \| 12M \| GPL-3.0 \|
	\| Math (CoT) \| AQUA-RAT \| 3M \| Apache-2.0 \|
	\| Commonsense (CoT) \| ECQA \| 2.4M \| Apache-2.0 \|
	\| Reasoning (CoT) \| EntailmentBank \| 1.8M \| Apache-2.0 \|
	\| Chat \| UltraChat \| 90M \| MIT \|
	\| Code \| Code-Feedback \| 36M \| Apache-2.0 \|
	\| Instruction \| OpenOrca \| 180M \| MIT \|


	## Intended uses & limitations

	This model can be used for use cases involving legal domain text generation.

	As with any language model, users must not solely relay on model generations. This model has not gone through a human-feedback alignment (RLHF). The model may generate responses containing hallucinations and biases.

	Example use:
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	tokenizer = AutoTokenizer.from_pretrained("cimphonyadmin/Cimphony-Mistral-Law-7B")
	model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1")
	model = PeftModel.from_pretrained(model, "cimphonyadmin/Cimphony-Mistral-Law-7B")

	# Put your input here:
	user_input = '''What can you tell me about ex post facto laws?'''

	# Apply the prompt template
	prompt = tokenizer.apply_chat_template(user_input, tokenize=False)

	inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).input_ids.to(model.device)
	outputs = model.generate(input_ids=inputs, max_length=4096)[0]

	answer_start = int(inputs.shape[-1])
	pred = tokenizer.decode(outputs[answer_start:], skip_special_tokens=True)

	print(f'### User Input:\n{user_input}\n\n### Assistant Output:\n{pred}')
	```

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.0005
	- train_batch_size: 8
	- eval_batch_size: 24
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 4
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 128
	- total_eval_batch_size: 96
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_ratio: 0.05
	- num_epochs: 1


	### Framework versions

	- PEFT 0.8.2
	- Transformers 4.37.2
	- Pytorch 2.1.2+cu121
	- Datasets 2.14.6
	- Tokenizers 0.15.2