Junrulu's picture
Update README.md
35869f7 verified
|
raw
history blame
2.58 kB
metadata
model-index:
  - name: Junrulu/Llama-3-8B-Instruct-Iterative-SamPO
    results: []
datasets:
  - HuggingFaceH4/ultrafeedback_binarized
language:
  - en
base_model: meta-llama/Meta-Llama-3-8B-Instruct
license: llama3

Model Card for Llama-3-8B-Instruct-Iterative-SamPO

This repository provides a fine-tuned version of Llama-3-8B-Instruct, using our proposed SamPO algorithm: Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence. We obey all licenses mentioned in llama3's work.

Performance

Model GSM8K IFEval PiQA MMLU TruthfulQA AlpacaEval2 LC AlpacaEval2 Length in Tokens
Llama3-8B-Instruct 75.06 49.40 80.69 63.85 36.47 22.57 22.92 421
Llama3-8B-Instruct-DPO 75.59 51.80 81.94 64.06 40.39 23.34 23.20 422
Llama3-8B-Instruct-Iterative-DPO 74.91 52.52 81.66 64.02 39.90 23.92 25.50 403
Llama3-8B-Instruct-Iterative-SamPO 77.81 60.55 81.18 64.12 44.07 30.68 35.14 377

Evaluation Details

Five conditional benchmarks, using lm-evaluation-harness:

  • GSM8K: 8-shot, report strict match
  • IFEval: 3-shot, report instruction-level strict accuracy
  • PiQA: 3-shot, report accuracy
  • MMLU: 0-shot, report normalized accuracy
  • TruthfulQA: 3-shot, report accuracy of single-true mc1 setting

One open-ended benchmark, using official alpaca_eval:

  • AlpacaEval2: win rate (%) judged by GPT-4-turbo between the model's outputs vs. the GPT-4-turbo's response
  • LC AlpacaEval2: length-debiased win rate (%) of AlpacaEval2
  • Length in Tokens: the average output length of AlpacaEval2, calculated in tokens with Llama3's tokenizer

Input Format

The model is trained to use the following format:

<|start_header_id|>user<|end_header_id|>

{PROMPT}<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

{Response}

Training hyperparameters

The following hyperparameters were used during DPO/SamPO training:

  • DPO beta: 0.1
  • learning_rate: 4e-7 * sqrt(Num of Nodes)
  • total_train_batch_size: 128 * Num of Nodes
  • optimizer: AdamW with beta1 0.9, beta2 0.999 and epsilon 1e-8
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • Weight Decay: 0.0
  • num_epochs: 3.0
  • Specifically add above input format over training samples