AALF commited on
Commit
cc4b0f3
1 Parent(s): d8b395a

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ library_name: transformers
4
+ pipeline_tag: text-generation
5
+ base_model: google/gemma-2-27b-it
6
+ tags:
7
+ - alignment-handbook
8
+ - generated_from_trainer
9
+ ---
10
+
11
+ # gemma-2-27b-it-simpo-beta10-gamma5-lr8e-7
12
+
13
+ ## Implementation Details
14
+ We first followed the [SimPO](https://github.com/princeton-nlp/SimPO) framework to apply [On-Policy Preference Data Generation](https://github.com/princeton-nlp/SimPO/tree/main/on_policy_data_gen) on the [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) dataset using the [google/gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it) model. We then selected prompts where the chosen reward was at least 0.01 higher than the rejected reward, resulting in 37,040 training data points.
15
+
16
+ Model training was conducted using 8x80G A800 GPUs, leveraging the [alignment-handbook](https://github.com/huggingface/alignment-handbook) library. We used `deepspeed_zero_stage3` with optimizer offloading to the CPU. The `SimPOTrainer` arguments were as follows:
17
+
18
+ ```bash
19
+ # SimPOTrainer arguments
20
+ bf16: true
21
+ beta: 10
22
+ gamma_beta_ratio: 0.5
23
+ gradient_accumulation_steps: 8
24
+ gradient_checkpointing: true
25
+ gradient_checkpointing_kwargs:
26
+ use_reentrant: true
27
+ hub_model_id: simpo-exps
28
+ learning_rate: 8.0e-7
29
+ log_level: info
30
+ logging_steps: 1
31
+ lr_scheduler_type: cosine
32
+ max_length: 2048
33
+ max_prompt_length: 1800
34
+ num_train_epochs: 1
35
+ optim: adamw_torch
36
+ output_dir: outputs/gemma-2-27b-it-SimPO
37
+ run_name: gemma-2-27b-it-SimPO
38
+ per_device_train_batch_size: 2
39
+ push_to_hub: false
40
+ save_strategy: "steps"
41
+ save_steps: 100
42
+ save_total_limit: 20
43
+ seed: 42
44
+ warmup_ratio: 0.1
45
+ save_only_model: true
46
+ ```
47
+
48
+ ## Citation
49
+
50
+ gemma model:
51
+ ```
52
+ @article{gemma_2024,
53
+ title={Gemma},
54
+ url={https://www.kaggle.com/m/3301},
55
+ DOI={10.34740/KAGGLE/M/3301},
56
+ publisher={Kaggle},
57
+ author={Gemma Team},
58
+ year={2024}
59
+ }
60
+ ```
61
+
62
+ SimPO paper:
63
+ ```
64
+ @article{meng2024simpo,
65
+ title={{SimPO}: Simple preference optimization with a reference-free reward},
66
+ author={Meng, Yu and Xia, Mengzhou and Chen, Danqi},
67
+ journal={arXiv preprint arXiv:2405.14734},
68
+ year={2024}
69
+ }
70
+ ```
71
+
72
+ UltraFeedback paper:
73
+ ```
74
+ @article{cui2023ultrafeedback,
75
+ title={{UltraFeedback}: Boosting language models with high-quality feedback},
76
+ author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong},
77
+ journal={arXiv preprint arXiv:2310.01377},
78
+ year={2023}
79
+ }
80
+ ```
81
+
82
+ ArmoRM paper:
83
+ ```
84
+ @article{wang2024interpretable,
85
+ title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts},
86
+ author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong},
87
+ journal={arXiv preprint arXiv:2406.12845},
88
+ year={2024}
89
+ }
90
+ ```