AdamG012 commited on
Commit
e79fa60
·
1 Parent(s): 75aa9a0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -40,7 +40,7 @@ This pipeline can be broken up into three key steps:
40
 
41
  2. **Reward Model (RM) fine-tuning:** See [here](https://huggingface.co/FSALab/fsalab-chat-opt-350m-reward-deepspeed)
42
 
43
- 3. **Reinforcement-learning from Human feedback (RLHF) fine-tuning:** At the completion of the prior two steps, the final RLHF fine-tuning can be initiated. This involves the collection of both the *fine-tuned model* from step 1 and the *reward model** from step 2 and train them on the data-set with comparisons. This generates both an [actor](https://huggingface.co/FSALab/fsalab-chat-opt-1.3b-rlhf-actor-deepspeed) and [critic](https://huggingface.co/FSALab/fsalab-chat-opt-1.3b-rlhf-actor-deepspeed).
44
 
45
  To view the details behind each step head into their respective links and view the model card there.
46
 
 
40
 
41
  2. **Reward Model (RM) fine-tuning:** See [here](https://huggingface.co/FSALab/fsalab-chat-opt-350m-reward-deepspeed)
42
 
43
+ 3. **Reinforcement-learning from Human feedback (RLHF) fine-tuning:** At the completion of the prior two steps, the final RLHF fine-tuning can be initiated. This involves the collection of both the *fine-tuned model* from step 1 and the *reward model** from step 2 and train them on the data-set with comparisons. This generates both an [actor](https://huggingface.co/FSALab/fsalab-chat-opt-1.3b-rlhf-actor-deepspeed) and [critic](https://huggingface.co/FSALab/fsalab-chat-opt-1.3b-rlhf-actor-deepspeed). . This generates both an actor and critic model. I also generate an actor model with an exponential moving average (EMA) which is known to improve conversational response quality.
44
 
45
  To view the details behind each step head into their respective links and view the model card there.
46