efromomr commited on
Commit
9f4570d
·
verified ·
1 Parent(s): fe6c9c5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -11
README.md CHANGED
@@ -15,23 +15,31 @@ licence: license
15
  This model is a fine-tuned version of [HuggingFaceTB/SmolLM-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct) on the [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) dataset.
16
  It has been trained using [TRL](https://github.com/huggingface/trl).
17
 
18
- ## Quick start
19
 
20
- ```python
21
- from transformers import pipeline
22
 
23
- question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
24
- generator = pipeline("text-generation", model="efromomr/llm-course-hw2-reward-model", device="cuda")
25
- output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
26
- print(output["generated_text"])
27
- ```
28
 
29
- ## Training procedure
30
 
31
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
 
34
- This model was trained with Reward.
35
 
36
  ### Framework versions
37
 
 
15
  This model is a fine-tuned version of [HuggingFaceTB/SmolLM-135M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM-135M-Instruct) on the [HumanLLMs/Human-Like-DPO-Dataset](https://huggingface.co/datasets/HumanLLMs/Human-Like-DPO-Dataset) dataset.
16
  It has been trained using [TRL](https://github.com/huggingface/trl).
17
 
18
+ This model is a reward model used for training efromomr/llm-course-hw2-ppo.
19
 
20
+ train_loss: 0.07687531913222831
 
21
 
 
 
 
 
 
22
 
23
+ ##Usage example
24
 
25
+ DEVICE = torch.device('cuda')
26
+
27
+ tokenizer = AutoTokenizer.from_pretrained(llm-course-hw2-reward-model)
28
+ reward_model = AutoModelForSequenceClassification.from_pretrained(llm-course-hw2-reward-model, num_labels = 1)
29
+ reward_model.config.pad_token_id = tokenizer.pad_token_id
30
+ reward_model = reward_model.to(DEVICE)
31
+ reward_model.eval()
32
+
33
+ inputs_chosen = tokenizer.apply_chat_template('Any text'], tokenize=False)
34
+ inputs_chosen = tokenizer(inputs_chosen, return_tensors="pt").to(DEVICE)
35
+
36
+
37
+ score_chosen = reward_model(**inputs_chosen).logits[0].cpu().detach()
38
+ print(score_chosen)
39
+
40
+ #0.223
41
 
42
 
 
43
 
44
  ### Framework versions
45