RL Zero: Zero-Shot Language to Behaviors without any Supervision Paper • 2412.05718 • Published Dec 7, 2024 • 4 • 2
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms Paper • 2406.02900 • Published Jun 5, 2024 • 11
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms Paper • 2406.02900 • Published Jun 5, 2024 • 11
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0175-alpha-0-step-79872 Text Generation • Updated May 18, 2024 • 23
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0175-alpha-0-step-59904 Text Generation • Updated May 18, 2024 • 21
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0175-alpha-0-step-19968 Text Generation • Updated May 18, 2024 • 23
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0375-alpha-0-step-59904 Text Generation • Updated May 18, 2024 • 25
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0375-alpha-0-step-39936 Text Generation • Updated May 18, 2024 • 23
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0375-alpha-0-step-79872 Text Generation • Updated May 18, 2024 • 21
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0175-alpha-0-step-39936 Text Generation • Updated May 18, 2024 • 24
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0175-alpha-0-LATEST Text Generation • Updated May 18, 2024 • 21
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0375-alpha-0-step-19968 Text Generation • Updated May 18, 2024 • 23
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.025-alpha-0-step-59904 Text Generation • Updated May 18, 2024 • 26
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.01-alpha-0-step-39936 Text Generation • Updated May 18, 2024 • 23
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.01-alpha-0-step-59904 Text Generation • Updated May 18, 2024 • 23
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.0375-alpha-0-LATEST Text Generation • Updated May 18, 2024 • 23
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.01-alpha-0-step-79872 Text Generation • Updated May 18, 2024 • 23
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.01-alpha-0-step-19968 Text Generation • Updated May 18, 2024 • 27
hsikchi/pythia-6.9b-goldrm_tldr-dpo-beta-0.025-alpha-0-step-39936 Text Generation • Updated May 18, 2024 • 23