metadata

model-index:
  - name: Junrulu/Llama-3-8B-Instruct-Iterative-SamPO
    results: []
datasets:
  - HuggingFaceH4/ultrafeedback_binarized
language:
  - en
base_model: meta-llama/Meta-Llama-3-8B-Instruct
license: llama3

Model Card for Llama-3-8B-Instruct-Iterative-SamPO

This repository provides a fine-tuned version of Llama-3-8B-Instruct, using our proposed SamPO algorithm. We obey all licenses mentioned in llama3's work.

Performance

Model	GSM8K	IFEval	PiQA	MMLU	TruthfulQA	AlpacaEval2	LC AlpacaEval2	Length in Tokens
Llama3-8B-Instruct	75.06	49.40	80.69	63.85	36.47	22.57	22.92	421
Llama3-8B-Instruct-DPO	75.59	51.80	81.94	64.06	40.39	23.34	23.20	422
Llama3-8B-Instruct-Iterative-DPO	74.91	52.52	81.66	64.02	39.90	23.92	25.50	403
Llama3-8B-Instruct-Iterative-SamPO	77.81	60.55	81.18	64.12	44.07	30.68	35.14	377

Evaluation Details

Five conditional benchmarks, using lm-evaluation-harness:

GSM8K: 8-shot, report strict match
IFEval: 3-shot, report instruction-level strict accuracy
PiQA: 3-shot, report accuracy
MMLU: 0-shot, report normalized accuracy
TruthfulQA: 3-shot, report accuracy of single-true mc1 setting

One open-ended benchmark, using official alpaca_eval:

AlpacaEval2: win rate (%) judged by GPT-4-turbo between the model's outputs vs. the GPT-4-turbo's response
LC AlpacaEval2: length-debiased win rate (%) of AlpacaEval2
Length in Tokens: the average output length of AlpacaEval2, calculated in tokens with Llama3's tokenizer

Input Format

The model is trained to use the following format:

<|start_header_id|>user<|end_header_id|>

{PROMPT}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

{Response}

Training hyperparameters

The following hyperparameters were used during DPO/SamPO training:

DPO beta: 0.1
learning_rate: 4e-7 * sqrt(Num of Nodes)
total_train_batch_size: 128 * Num of Nodes
optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
Weight Decay: 0.0
num_epochs: 3.0
Specifically add above input format over training samples