--- {} ---

Critique-out-Loud Reward Models (CLoud)

CLoud

| Paper | Tweet |

--- ## Introduction Critique-out-Loud reward models are reward models that can reason explicitly about the quality of an input through producing Chain-of-Thought like critiques of an input before predicting a reward. In classic reward model training, the reward model is trained as a reward head initialized on top of the base LLM. Without LM capabilities, classic reward models act as encoders and must predict rewards within a single forward pass through the model, meaning reasoning must happen implicitly. In contrast, CLoud reward models are trained to both produce explicit reasoning about quality and to score based on these critique reasoning traces. CLoud reward models lead to large gains for pairwise preference modeling on RewardBench, and also lead to large gains in win rate when used as the scoring model in Best-of-N sampling on ArenaHard. ## Todo - [x] Release models and inference examples - [ ] Post example training run logs - [ ] Add ArenaHard evaluation code - [ ] Add VLLM support for inference ## Table of Contents - [Introduction](#introduction) - [Todo](#todo) - [Table of Contents](#table-of-contents) - [Setup](#setup) - [Model Weights](#model-weights) - [Inference](#inference) - [Dataset](#dataset) - [Training](#training) - [CLoud Training](#cloud-training) - [Classic Training](#classic-training) - [Evaluation](#evaluation) - [Citation](#citation) ## Setup ```bash git clone https://github.com/zankner/CLoud cd CLoud pip install -e . ``` Optional: base docker image used during development `mosaicml/pytorch:2.3.0_cu121-python3.11-ubuntu20.04` ## Model Weights | Base Model | RM Type | Hugging Face Repo | | ---------- | --------------- |--------------------------------------------------------------------- | | Llama3-8B | Classic | [ankner/Llama3-8B-Classic-RM](https://huggingface.co/ankner/Llama3-8B-Classic-RM) | | Llama3-8B | CLoud | [ankner/Llama3-8B-CLoud-RM](https://huggingface.co/ankner/Llama3-8B-CLoud-RM) | | Llama3-70B | Classic | [ankner/Llama3-70B-Classic-RM](https://huggingface.co/ankner/Llama3-70B-Classic-RM) | | Llama3-70B | CLoud | [ankner/Llama3-70B-CLoud-RM](https://huggingface.co/ankner/Llama3-70B-CLoud-RM) | ## Inference We provide a gradio demo which can be run as follows: `gradio cloud/demo.py`. By default this will demo `ankner/Llama3-8B-CLoud-RM`, but you can change the model loaded in the script. If you want to perform inference on your own data, please refer to the following example: ```python from cloud.model import CLoudRewardModel from transformers import AutoTokenizer model_name = "ankner/Llama3-8B-Cloud-RM" # Replace with RM trained with this repo model = CLoudRewardModel.from_pretrained(model_name, device_map="cuda") tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left") user_prompt = [ "Write me a story", "What is the capital of the moon?" ] assistant_response = [ "No I don't want to do that.", "Since the moon is made out of cheese, the capital is mozzerella." ] rewards, critiques = model.predict_reward(user_prompt, assistant_response, tokenizer) for reward, critique in zip(rewards, critiques): print("Critique:") print(critique) print("Reward:") print(reward) print("=" * 100) ``` ## Dataset We provide code to reconstruct the datasets used in the paper. There are two datasets to build for training, one with oracle critiques meant to simmulate human feedback and one with self-generated critiques. To build the oracle critique dataset run: ```bash python cloud/data/build_official_ultra_llama.py --mode oracle ``` To build the self-generated critique dataset run: ```bash python cloud/data/build_official_ultra_llama.py --mode self-gen --model-size {model-size} ``` where ```{model-size}``` is the size of the model you are using (e.g. 8b, 70b).
Build your own dataset from scratch 1. Build prompts - You can use any dataset you like as long as it has ```prompt``` and ```id``` columns. If you would like to build prompts from UltraFeedback and UltraInteract as we do in the paper run: ```bash python cloud/data/build_ultra_prompts.py --save-name {name-to-save-as} ``` 2. Build chosen / rejected responses ```bash python cloud/data/build_judgements.py --gen-model {model-generating-responses} --judge-model {model-judging-responses} --base-dataset {path-to-prompt-dataset} --save-name {name-to-save-as} ``` The above command requires a hosted generating and judging model. To host the models using vllm run: ```bash python -m vllm.entrypoints.openai.api_server --model {path-to-gen/judge-model} --dtype bfloat16 --tensor-parallel-size {num-gpus} --port {8000 for gen and 8001 for judge} ``` 3. Build critiques ```bash python cloud/data/generate_oracle_critiques.py --judge-model {model-generating-critiques} --base-dataset {path-to-responses-dataset} --save-name {name-to-save-as} ``` Again, this command assumes a hosted critique model. To host the critique model you can use the above vllm command (This time just use port 8000 for the judge model).
## Training Before training, you must run the [setup script](#setup) and build the [datasets](#dataset). The training configs are located in the ```cloud/train/configs/``` folder. We have already set the optimal hyperparameters that we found for each model as reported in the paper. The only parameter that needs to be set is the ```variables.micro_batch_size``` parameter, in accordance with your GPU memory. If you want to log the training runs, uncomment the ```loggers``` section in the config and fill in your wandb settings. Checkpoints will be saved throughout training to the ```save_folder``` parameter, which is ```ckpts/${variables.run_name}``` by default. The final checkpoint will contain a folder ```hf``` where the huggingface model is saved. > **Warning**: The below training scripts for both CLoud and Classic prefill the dataset names to be the datasets we release. If you would like to train on your own dataset, you will need to follow the directions to build said dataset in the [dataset section](#dataset) and change the ```variables.dataset_path``` parameter in the training configs. ### CLoud Training 1. The first step is to finetune the base model to produce critiques: ```bash composer -n {num_gpus} cloud/train/train.py cloud/train/configs/{model_size}_critique_sft.yaml ``` Replace ```{model_size}``` with the size of the model you are training (e.g. 8b, 70b). 2. (Optional if you want to use the self-generated data we release) After the critique SFT model is trained, you need to regenerate the dataset with the critiques. To do so, you first need to serve the critique SFT model. To do so locally using vllm run: ```bash python -m vllm.entrypoints.openai.api_server --model {path-to-critique-sft-model} --dtype bfloat16 --tensor-parallel-size {num-gpus} ``` Then run the data building script: ```bash python cloud/data/generate_self_critiques.py --model {path-to-critique-sft-model} --base-dataset {path-to-base-dataset} --upload-name {path-to-save-dataset} ``` 3. After building the self-generated dataset, we can train the CLoud model: ```bash composer -n {num_gpus} cloud/train/train.py cloud/train/configs/{model_size}_cloud.yaml ``` ### Classic Training To train a classic reward model, you can use the following command: ```bash composer -n {num_gpus} cloud/train/train.py cloud/train/configs/{model_size}_classic.yaml ``` ## Evaluation To run evaluation for a given benchmark run the following command: ```bash python cloud/eval/eval.py --model-path {path-to-model} --benchmark {benchmark-name} ``` Currently, we only support the RewardBench benchmark. ## Citation If you found our work useful please consider citing it: ```bibtex @misc{ankner2024critiqueoutloudrewardmodels, title={Critique-out-Loud Reward Models}, author={Zachary Ankner and Mansheej Paul and Brandon Cui and Jonathan D. Chang and Prithviraj Ammanabrolu}, year={2024}, eprint={2408.11791}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2408.11791}, } ```