|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
tags: |
|
- generated_from_trainer |
|
base_model: microsoft/phi-2 |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# outputs |
|
This model is a fine-tuned version of [microsoft/phi-2](https://huggingface.co/microsoft/phi-2) using [trl](https://github.com/huggingface/trl) on [ultrafeedback dataset](https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized). |
|
|
|
# What's new |
|
A test for [ORPO: Monolithic Preference Optimization without Reference Model](https://arxiv.org/pdf/2403.07691.pdf) method using trl library. |
|
|
|
## How to reproduce |
|
```bash |
|
accelerate launch --config_file=/path/to/trl/examples/accelerate_configs/deepspeed_zero2.yaml \ |
|
--num_processes 8 \ |
|
/path/to/trl/scripts/orpo.py \ |
|
--model_name_or_path="microsoft/phi-2" \ |
|
--per_device_train_batch_size 1 \ |
|
--max_steps 8000 \ |
|
--learning_rate 8e-5 \ |
|
--gradient_accumulation_steps 1 \ |
|
--logging_steps 20 \ |
|
--eval_steps 2000 \ |
|
--output_dir="orpo-lora-phi2" \ |
|
--optim rmsprop \ |
|
--warmup_steps 150 \ |
|
--bf16 \ |
|
--logging_first_step \ |
|
--no_remove_unused_columns \ |
|
--use_peft \ |
|
--lora_r=16 \ |
|
--lora_alpha=16 \ |
|
--dataset HuggingFaceH4/ultrafeedback_binarized |
|
``` |