--- license: gemma library_name: transformers pipeline_tag: text-generation base_model: google/gemma-2-27b-it tags: - alignment-handbook - generated_from_trainer --- # gemma-2-27b-it-simpo-beta10-gamma5-lr8e-7 ## Implementation Details We first followed the [SimPO](https://github.com/princeton-nlp/SimPO) framework to apply [On-Policy Preference Data Generation](https://github.com/princeton-nlp/SimPO/tree/main/on_policy_data_gen) on the [HuggingFaceH4/ultrafeedback_binarized](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized) dataset using the [google/gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it) model. We then selected prompts where the chosen reward was at least 0.01 higher than the rejected reward, resulting in 37,040 training data points. Model training was conducted using 8x80G A800 GPUs, leveraging the [alignment-handbook](https://github.com/huggingface/alignment-handbook) library. We used `deepspeed_zero_stage3` with optimizer offloading to the CPU. The `SimPOTrainer` arguments were as follows: ```bash # SimPOTrainer arguments bf16: true beta: 10 gamma_beta_ratio: 0.5 gradient_accumulation_steps: 8 gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: true hub_model_id: simpo-exps learning_rate: 8.0e-7 log_level: info logging_steps: 1 lr_scheduler_type: cosine max_length: 2048 max_prompt_length: 1800 num_train_epochs: 1 optim: adamw_torch output_dir: outputs/gemma-2-27b-it-SimPO run_name: gemma-2-27b-it-SimPO per_device_train_batch_size: 2 push_to_hub: false save_strategy: "steps" save_steps: 100 save_total_limit: 20 seed: 42 warmup_ratio: 0.1 save_only_model: true ``` ## Citation gemma model: ``` @article{gemma_2024, title={Gemma}, url={https://www.kaggle.com/m/3301}, DOI={10.34740/KAGGLE/M/3301}, publisher={Kaggle}, author={Gemma Team}, year={2024} } ``` SimPO paper: ``` @article{meng2024simpo, title={{SimPO}: Simple preference optimization with a reference-free reward}, author={Meng, Yu and Xia, Mengzhou and Chen, Danqi}, journal={arXiv preprint arXiv:2405.14734}, year={2024} } ``` UltraFeedback paper: ``` @article{cui2023ultrafeedback, title={{UltraFeedback}: Boosting language models with high-quality feedback}, author={Cui, Ganqu and Yuan, Lifan and Ding, Ning and Yao, Guanming and Zhu, Wei and Ni, Yuan and Xie, Guotong and Liu, Zhiyuan and Sun, Maosong}, journal={arXiv preprint arXiv:2310.01377}, year={2023} } ``` ArmoRM paper: ``` @article{wang2024interpretable, title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts}, author={Wang, Haoxiang and Xiong, Wei and Xie, Tengyang and Zhao, Han and Zhang, Tong}, journal={arXiv preprint arXiv:2406.12845}, year={2024} } ```