efromomr
/

llm-course-hw2-reward-model

Text Classification

Generated from Trainer

text-generation-inference

Model card Files Files and versions

efromomr commited on Mar 9

Commit

9f4570d

·

verified ·

1 Parent(s): fe6c9c5

Update README.md

Files changed (1) hide show

README.md +19 -11

README.md CHANGED Viewed

@@ -15,23 +15,31 @@ licence: license
 This model is a fine-tuned version of [HuggingFaceTB/SmolLM-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct) on the [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) dataset.
 It has been trained using [TRL](https://github.com/huggingface/trl).
-## Quick start
-```python
-from transformers import pipeline
-question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
-generator = pipeline("text-generation", model="efromomr/llm-course-hw2-reward-model", device="cuda")
-output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
-print(output["generated_text"])
-```
-## Training procedure
-This model was trained with Reward.
 ### Framework versions

 This model is a fine-tuned version of [HuggingFaceTB/SmolLM-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct) on the [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) dataset.
 It has been trained using [TRL](https://github.com/huggingface/trl).
+This model is a reward model used for training efromomr/llm-course-hw2-ppo.
+train_loss: 0.07687531913222831
+##Usage example
+DEVICE = torch.device('cuda')
+tokenizer = AutoTokenizer.from_pretrained(llm-course-hw2-reward-model)
+reward_model = AutoModelForSequenceClassification.from_pretrained(llm-course-hw2-reward-model, num_labels = 1)
+reward_model.config.pad_token_id = tokenizer.pad_token_id
+reward_model = reward_model.to(DEVICE)
+reward_model.eval()
+inputs_chosen = tokenizer.apply_chat_template('Any text'], tokenize=False)
+inputs_chosen = tokenizer(inputs_chosen, return_tensors="pt").to(DEVICE)
+score_chosen = reward_model(**inputs_chosen).logits[0].cpu().detach()
+print(score_chosen)
+#0.223
 ### Framework versions