AlignmentResearch/robust_llm_oskar-024b_clf_spam_Qwen2.5-7B_s-2_adv_tr_gcg_t-2 Updated 12 days ago • 48
AlignmentResearch/robust_llm_oskar-024b_clf_spam_Qwen2.5-3B_s-2_adv_tr_gcg_t-2 Updated 29 days ago • 89
AlignmentResearch/robust_llm_oskar-024b_clf_spam_Qwen2.5-3B_s-1_adv_tr_gcg_t-1 Updated 12 days ago • 6
AlignmentResearch/robust_llm_oskar-024b_clf_spam_Qwen2.5-7B_s-1_adv_tr_gcg_t-1 Updated 12 days ago • 47
AlignmentResearch/robust_llm_oskar-015f_clf_harmless_pythia-12b_s-2_adv_tr_gcg_t-2 Updated 11 days ago
Invariance in Policy Optimisation and Partial Identifiability in Reward Learning Paper • 2203.07475 • Published Mar 14, 2022