This is the token-wise reward based ppo model introduced in the preprint Segmenting Text and Learning Their Rewards for Improved RLHF in Language Models (https://arxiv.org/abs/2501.02790). For more details, please visit our repository at https://github.com/yinyueqin/DenseRewardRLHF-PPO.

Downloads last month
11
Safetensors
Model size
3.82B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for yyqoni/Phi-3-mini-4k-token-ppo-60k

Finetuned
(170)
this model

Dataset used to train yyqoni/Phi-3-mini-4k-token-ppo-60k

Collection including yyqoni/Phi-3-mini-4k-token-ppo-60k