Junrulu
/

Llama-3-8B-Instruct-Iterative-SamPO

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Llama-3-8B-Instruct-Iterative-SamPO / README.md

Junrulu's picture

Update README.md

05b23e3 verified 7 months ago

|

2.47 kB

	---
	model-index:
	- name: Junrulu/Llama-3-8B-Instruct-Iterative-SamPO
	results: []
	datasets:
	- HuggingFaceH4/ultrafeedback_binarized
	language:
	- en
	base_model: meta-llama/Meta-Llama-3-8B-Instruct
	license: llama3
	---

	# Model Card for Llama-3-8B-Instruct-Iterative-SamPO

	This repository provides a fine-tuned version of Llama-3-8B-Instruct, using our proposed [SamPO](https://github.com/LuJunru/SamPO) algorithm. We obey all licenses mentioned in llama3's work.

	## Performance

	\| Model \| GSM8K \| IFEval \| PiQA \| MMLU \| TruthfulQA \| AlpacaEval2 \| LC AlpacaEval2 \| Length in Tokens \|
	\| ----- \| ------\| ------ \| ---- \| ---- \| ---------- \| ----------- \| -------------- \| ---------------- \|
	\| Llama3-8B-Instruct \| 75.06 \| 49.40 \| 80.69 \| 63.85 \| 36.47 \| 22.57 \| 22.92 \| 421 \|
	\| Llama3-8B-Instruct-DPO \| 75.59 \| 51.80 \| 81.94 \| 64.06 \| 40.39 \| 23.34 \| 23.20 \| 422 \|
	\| Llama3-8B-Instruct-Iterative-DPO \| 74.91 \| 52.52 \| 81.66 \| 64.02 \| 39.90 \| 23.92 \| 25.50 \| 403 \|
	\| Llama3-8B-Instruct-Iterative-SamPO \| 77.81 \| 60.55 \| 81.18 \| 64.12 \| 44.07 \| 30.68 \| 35.14 \| 377 \|

	## Evaluation Details
	Five conditional benchmarks, using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness):
	- GSM8K: 8-shot, report strict match
	- IFEval: 3-shot, report instruction-level strict accuracy
	- PiQA: 3-shot, report accuracy
	- MMLU: 0-shot, report normalized accuracy
	- TruthfulQA: 3-shot, report accuracy of single-true mc1 setting

	One open-ended benchmark, using official [alpaca_eval](https://github.com/tatsu-lab/alpaca_eval/):
	- AlpacaEval2: win rate (%) judged by GPT-4-turbo between the model's outputs vs. the GPT-4-turbo's response
	- LC AlpacaEval2: length-debiased win rate (%) of AlpacaEval2
	- Length in Tokens: the average output length of AlpacaEval2, calculated in tokens with Llama3's tokenizer

	## Input Format

	The model is trained to use the following format:
	```
	<\|start_header_id\|>user<\|end_header_id\|>

	{PROMPT}<\|eot_id\|>
	<\|start_header_id\|>assistant<\|end_header_id\|>

	{Response}
	```

	## Training hyperparameters

	The following hyperparameters were used during DPO/SamPO training:
	- DPO beta: 0.1
	- learning_rate: 4e-7 * sqrt(Num of Nodes)
	- total_train_batch_size: 128 * Num of Nodes
	- optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_ratio: 0.1
	- Weight Decay: 0.0
	- num_epochs: 3.0
	- Specifically add above input format over training samples