Adding Evaluation Results (#2)

91ad99e verified 10 months ago

10.8 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- mistral
	- instruct
	- finetune
	- chatml
	- gpt4
	- synthetic data
	- distillation
	- dpo
	- rlhf
	- laser
	datasets:
	- mlabonne/chatml_dpo_pairs
	base_model: teknium/OpenHermes-2.5-Mistral-7B
	model-index:
	- name: NeuralHermes-2.5-Mistral-7B-laser
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 66.38
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/NeuralHermes-2.5-Mistral-7B-laser
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 85.09
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/NeuralHermes-2.5-Mistral-7B-laser
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 63.43
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/NeuralHermes-2.5-Mistral-7B-laser
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 54.95
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/NeuralHermes-2.5-Mistral-7B-laser
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 78.14
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/NeuralHermes-2.5-Mistral-7B-laser
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 55.72
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mlabonne/NeuralHermes-2.5-Mistral-7B-laser
	name: Open LLM Leaderboard
	---

	<center><img src="https://i.imgur.com/gUlEJuU.jpeg"></center>

	# NeuralHermes 2.5 - Mistral 7B - LASER

	This is an experimental LASER version of NeuralHermes using [laserRMT](https://github.com/cognitivecomputations/laserRMT), based on [this paper](https://arxiv.org/pdf/2312.13558.pdf).

	\| Model \|AGIEval\|GPT4All\|TruthfulQA\|Bigbench\|Average\|
	\|------------------------------------------------------------------------------------------------------\|------:\|------:\|---------:\|-------:\|------:\|
	\|[NeuralHermes-2.5-Mistral-7B-laser](https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B-laser)\| 43.54\| 73.44\| 55.26\| 42.24\| 53.62\|
	\|[NeuralHermes-2.5-Mistral-7B](https://huggingface.co/mlabonne/NeuralHermes-2.5-Mistral-7B) \| 43.67\| 73.24\| 55.37\| 41.76\| 53.51\|

	Fernando Fernandes Neto and Eric Hartford. "Optimizing Large Language Models Using Layer-Selective Rank Reduction and Random Matrix Theory." 2024.

	NeuralHermes is an [teknium/OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B) model that has been further fine-tuned with Direct Preference Optimization (DPO) using the [mlabonne/chatml_dpo_pairs](https://huggingface.co/datasets/mlabonne/chatml_dpo_pairs) dataset. It surpasses the original model on several benchmarks (see results).

	It is directly inspired by the RLHF process described by [Intel/neural-chat-7b-v3-1](https://huggingface.co/Intel/neural-chat-7b-v3-1)'s authors to improve performance. I used the same dataset and reformatted it to apply the ChatML template.

	The code to train this model is available on [Google Colab](https://colab.research.google.com/drive/15iFBr1xWgztXvhrj5I9fBv20c7CFOPBE?usp=sharing) and [GitHub](https://github.com/mlabonne/llm-course/tree/main). It required an A100 GPU for about an hour.

	## Results

	### AGIEval
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------\|------:\|--------\|----:\|---\|-----:\|
	\|agieval_aqua_rat \| 0\|acc \|21.26\|± \| 2.57\|
	\| \| \|acc_norm\|22.83\|± \| 2.64\|
	\|agieval_logiqa_en \| 0\|acc \|39.32\|± \| 1.92\|
	\| \| \|acc_norm\|40.71\|± \| 1.93\|
	\|agieval_lsat_ar \| 0\|acc \|25.65\|± \| 2.89\|
	\| \| \|acc_norm\|25.65\|± \| 2.89\|
	\|agieval_lsat_lr \| 0\|acc \|48.82\|± \| 2.22\|
	\| \| \|acc_norm\|50.00\|± \| 2.22\|
	\|agieval_lsat_rc \| 0\|acc \|58.36\|± \| 3.01\|
	\| \| \|acc_norm\|57.25\|± \| 3.02\|
	\|agieval_sat_en \| 0\|acc \|74.27\|± \| 3.05\|
	\| \| \|acc_norm\|73.30\|± \| 3.09\|
	\|agieval_sat_en_without_passage\| 0\|acc \|43.69\|± \| 3.46\|
	\| \| \|acc_norm\|42.23\|± \| 3.45\|
	\|agieval_sat_math \| 0\|acc \|37.27\|± \| 3.27\|
	\| \| \|acc_norm\|36.36\|± \| 3.25\|

	Average: 43.54%

	### GPT4All
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|-------------\|------:\|--------\|----:\|---\|-----:\|
	\|arc_challenge\| 0\|acc \|57.76\|± \| 1.44\|
	\| \| \|acc_norm\|60.32\|± \| 1.43\|
	\|arc_easy \| 0\|acc \|83.84\|± \| 0.76\|
	\| \| \|acc_norm\|81.10\|± \| 0.80\|
	\|boolq \| 1\|acc \|86.70\|± \| 0.59\|
	\|hellaswag \| 0\|acc \|63.15\|± \| 0.48\|
	\| \| \|acc_norm\|82.55\|± \| 0.38\|
	\|openbookqa \| 0\|acc \|34.40\|± \| 2.13\|
	\| \| \|acc_norm\|45.20\|± \| 2.23\|
	\|piqa \| 0\|acc \|81.94\|± \| 0.90\|
	\| \| \|acc_norm\|82.97\|± \| 0.88\|
	\|winogrande \| 0\|acc \|75.22\|± \| 1.21\|

	Average: 73.44%

	### TruthfulQA
	\| Task \|Version\|Metric\|Value\| \|Stderr\|
	\|-------------\|------:\|------\|----:\|---\|-----:\|
	\|truthfulqa_mc\| 1\|mc1 \|37.70\|± \| 1.70\|
	\| \| \|mc2 \|55.26\|± \| 1.52\|

	Average: 55.26%

	### Bigbench
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------------------------\|------:\|---------------------\|----:\|---\|-----:\|
	\|bigbench_causal_judgement \| 0\|multiple_choice_grade\|53.16\|± \| 3.63\|
	\|bigbench_date_understanding \| 0\|multiple_choice_grade\|65.31\|± \| 2.48\|
	\|bigbench_disambiguation_qa \| 0\|multiple_choice_grade\|34.11\|± \| 2.96\|
	\|bigbench_geometric_shapes \| 0\|multiple_choice_grade\|27.02\|± \| 2.35\|
	\| \| \|exact_str_match \| 0.28\|± \| 0.28\|
	\|bigbench_logical_deduction_five_objects \| 0\|multiple_choice_grade\|27.80\|± \| 2.01\|
	\|bigbench_logical_deduction_seven_objects \| 0\|multiple_choice_grade\|19.86\|± \| 1.51\|
	\|bigbench_logical_deduction_three_objects \| 0\|multiple_choice_grade\|48.33\|± \| 2.89\|
	\|bigbench_movie_recommendation \| 0\|multiple_choice_grade\|41.40\|± \| 2.20\|
	\|bigbench_navigate \| 0\|multiple_choice_grade\|50.00\|± \| 1.58\|
	\|bigbench_reasoning_about_colored_objects \| 0\|multiple_choice_grade\|65.00\|± \| 1.07\|
	\|bigbench_ruin_names \| 0\|multiple_choice_grade\|46.21\|± \| 2.36\|
	\|bigbench_salient_translation_error_detection \| 0\|multiple_choice_grade\|27.25\|± \| 1.41\|
	\|bigbench_snarks \| 0\|multiple_choice_grade\|70.72\|± \| 3.39\|
	\|bigbench_sports_understanding \| 0\|multiple_choice_grade\|65.72\|± \| 1.51\|
	\|bigbench_temporal_sequences \| 0\|multiple_choice_grade\|30.40\|± \| 1.46\|
	\|bigbench_tracking_shuffled_objects_five_objects \| 0\|multiple_choice_grade\|22.56\|± \| 1.18\|
	\|bigbench_tracking_shuffled_objects_seven_objects\| 0\|multiple_choice_grade\|17.09\|± \| 0.90\|
	\|bigbench_tracking_shuffled_objects_three_objects\| 0\|multiple_choice_grade\|48.33\|± \| 2.89\|

	Average: 42.24%

	Average score: 53.62%

	## Usage

	You can run this model using [LM Studio](https://lmstudio.ai/) or any other frontend.

	You can also run this model using the following code:

	```python
	import transformers
	from transformers import AutoTokenizer

	# Format prompt
	message = [
	{"role": "system", "content": "You are a helpful assistant chatbot."},
	{"role": "user", "content": "What is a Large Language Model?"}
	]
	tokenizer = AutoTokenizer.from_pretrained(new_model)
	prompt = tokenizer.apply_chat_template(message, add_generation_prompt=True, tokenize=False)

	# Create pipeline
	pipeline = transformers.pipeline(
	"text-generation",
	model="mlabonne/NeuralHermes-2.5-Mistral-7B-laser",
	tokenizer=tokenizer
	)

	# Generate text
	sequences = pipeline(
	prompt,
	do_sample=True,
	temperature=0.7,
	top_p=0.9,
	num_return_sequences=1,
	max_length=200,
	)
	print(sequences[0]['generated_text'])
	```
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_mlabonne__NeuralHermes-2.5-Mistral-7B-laser)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|67.29\|
	\|AI2 Reasoning Challenge (25-Shot)\|66.38\|
	\|HellaSwag (10-Shot) \|85.09\|
	\|MMLU (5-Shot) \|63.43\|
	\|TruthfulQA (0-shot) \|54.95\|
	\|Winogrande (5-shot) \|78.14\|
	\|GSM8k (5-shot) \|55.72\|