where is the example ?
where is readme? wanna try it
Hi, thanks for your interest; it's the same model interface as https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2
The model was trained to predict the best answer among two.
A forward pass of the frozen model provides a loss which can be used a signal for RLHF, there are some codes to do that but I didn't explore them.
thank you ,you're being very helpful. :)
seems to me ,the output is higher than this model 's you mentioned ealier[https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2].
is it true in general , your model is better in terms of discriminating performance?
thank you ,you're being very helpful. :)
seems to me ,the output is higher than this model 's you mentioned ealier[https://huggingface.co/OpenAssistant/reward-model-deberta-v3-large-v2].
is it true in general , your model is better in terms of discriminating performance?
The absolute value of the output doesn't really count, it's all about the gradient, which is related to the difference of value for different inputs
so the absolute value of the output doesn't matter that much but the comparison result of the two?Thank you for your prompt attention to this matter
so the absolute value of the output doesn't matter that much but the comparison result of the two?Thank you for your prompt attention to this matter
Yes. The reward model provides a score of how "good" some generated text is. We can then optimize a generator so that it generate "better" things by using the reward model output as the opposite of the loss
See the RLHF paper by openai