AALF's picture
Upload README.md
cc4b0f3 verified
|
raw
history blame
2.78 kB
---
license: gemma
library_name: transformers
pipeline_tag: text-generation
base_model: google/gemma-2-27b-it
tags:
- alignment-handbook
- generated_from_trainer
---
# gemma-2-27b-it-simpo-beta10-gamma5-lr8e-7
## Implementation Details
We first followed the [SimPO](https://github.com/princeton-nlp/SimPO) framework to apply [On-Policy Preference Data Generation](https://github.com/princeton-nlp/SimPO/tree/main/on_policy_data_gen) on the [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) dataset using the [google/gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it) model. We then selected prompts where the chosen reward was at least 0.01 higher than the rejected reward, resulting in 37,040 training data points.
Model training was conducted using 8x80G A800 GPUs, leveraging the [alignment-handbook](https://github.com/huggingface/alignment-handbook) library. We used `deepspeed_zero_stage3` with optimizer offloading to the CPU. The `SimPOTrainer` arguments were as follows:
```bash
# SimPOTrainer arguments
bf16: true
beta: 10
gamma_beta_ratio: 0.5
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: true
hub_model_id: simpo-exps
learning_rate: 8.0e-7
log_level: info
logging_steps: 1
lr_scheduler_type: cosine
max_length: 2048
max_prompt_length: 1800
num_train_epochs: 1
optim: adamw_torch
output_dir: outputs/gemma-2-27b-it-SimPO
run_name: gemma-2-27b-it-SimPO
per_device_train_batch_size: 2
push_to_hub: false
save_strategy: "steps"
save_steps: 100
save_total_limit: 20
seed: 42
warmup_ratio: 0.1
save_only_model: true
```
## Citation
gemma model:
```
@article{gemma_2024,
title={Gemma},
url={https://www.kaggle.com/m/3301},
DOI={10.34740/KAGGLE/M/3301},
publisher={Kaggle},
author={Gemma Team},
year={2024}
}
```
SimPO paper:
```
@article{meng2024simpo,
title={{SimPO}: Simple preference optimization with a reference-free reward},
author={Meng, Yu and Xia, Mengzhou and Chen, Danqi},
journal={arXiv preprint arXiv:2405.14734},
year={2024}
}
```
UltraFeedback paper:
```
@article{cui2023ultrafeedback,
title={{UltraFeedback}: Boosting language models with high-quality feedback},
author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong},
journal={arXiv preprint arXiv:2310.01377},
year={2023}
}
```
ArmoRM paper:
```
@article{wang2024interpretable,
title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts},
author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong},
journal={arXiv preprint arXiv:2406.12845},
year={2024}
}
```