|
--- |
|
library_name: transformers |
|
tags: |
|
- reward-model |
|
- RLHF |
|
datasets: |
|
- PKU-Alignment/PKU-SafeRLHF-30K |
|
language: |
|
- en |
|
base_model: |
|
- Nagi-ovo/llama-3-8b-dpo-full |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
## Overview |
|
This reward model is trained to predict human preferences between pairs of responses to various prompts. It is designed to be used as part of a Reinforcement Learning from Human Feedback (RLHF) pipeline. |
|
|
|
## Model Architecture |
|
- Base Model: Llama3-8B with SFT & DPO |
|
- Output: Single scalar reward value |
|
- Parameters: 8B |
|
- Training Framework: DeepSpeed + TRL |
|
|
|
## Example Usage |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
import torch |
|
|
|
device = 'cuda:0' |
|
model_name = "Nagi-ovo/Llama-3-8B-RM" |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained( |
|
model_name, |
|
load_in_4bit=True, |
|
bnb_4bit_quant_type="nf4", |
|
) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
|
|
|
SYSTEM_PROMPT = "You are a helpful assistant" |
|
|
|
def format_prompt_answer(prompt, answer): |
|
"""Format the input for reward model evaluation""" |
|
return f"###System: {SYSTEM_PROMPT}\n###Question: {prompt}\n###Answer: {answer}<|end_of_text|>" |
|
|
|
def get_reward_score(prompt, answer): |
|
"""Get reward score for a given prompt-answer pair""" |
|
formatted_input = format_prompt_answer(prompt, answer) |
|
inputs = tokenizer(formatted_input, return_tensors='pt').to(device) |
|
|
|
with torch.no_grad(): |
|
output = model(inputs['input_ids']).logits |
|
|
|
return output.item() |
|
|
|
prompt = "How are you?" |
|
answer = "I'm doing great! Thank you for asking. How can I help you today?" |
|
|
|
score = get_reward_score(prompt, answer) |
|
print(f"Prompt: {prompt}") |
|
print(f"Answer: {answer}") |
|
print(f"Reward Score: {score}") |
|
``` |