Llama-3 8B RLHF checkpoint trained by OpenRLHF Using the models and datasets: - Base SFT model: https://huggingface.co/OpenLLMAI/Llama-3-8b-sft-mixture - Reward model: https://huggingface.co/OpenLLMAI/Llama-3-8b-rm-mixture - Prompt dataset: https://huggingface.co/datasets/OpenLLMAI/prompt-collection-v0.1 Training Hyperparameters ``` Actor Learning Rate: 5e-7 Critic Learning Rate: 9e-6 Learning Rate Scheduler: Cosine with 0.03 Warmup PPO epoch: 1 Training Batch Size: 128 Experience Buffer Size: 1024 Reward Normalization: True Max Prompt Length: 2048 Max Response Length: 2048 Max Samples: 100k (To save GPU resources) Number of Samples per Prompt: 1 ``` Evaluation ``` Chat-Arena-Hard ------------------------------------------- llama-3-8b-sft | score: 5.6 llama-3-8b-rlhf-100k | score: 20.5 ``` Training logs