Spaces:

allenai
/

reward-bench

Running

App Files Files Community

reward-bench / src /md.py

natolambert

update

9ceb843 over 1 year ago

raw

history blame

3.05 kB

	ABOUT_TEXT = """
	We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
	A win is when the score for the chosen response is higher than the score for the rejected response.

	### Subset summary

	\| Subset \| Num. Samples (Pre-filtering, post-filtering) \| Description \|
	\| :--------------------- \| :------------------------------------------: \| :---------------------------------------------------------------- \|
	\| alpacaeval-easy \| 805 \| Great model vs poor model \|
	\| alpacaeval-length \| 805 \| Good model vs low model, equal length \|
	\| alpacaeval-hard \| 805 \| Great model vs baseline model \|
	\| mt-bench-easy \| 28, 28 \| MT Bench 10s vs 1s \|
	\| mt-bench-medium \| 45, 40 \| MT Bench 9s vs 2-5s \|
	\| mt-bench-hard \| 45, 37 \| MT Bench 7-8 vs 5-6 \|
	\| refusals-dangerous \| 505 \| Dangerous response vs no response \|
	\| refusals-offensive \| 704 \| Offensive response vs no response \|
	\| llmbar-natural \| 100 \| (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs \|
	\| llmbar-adver-neighbor \| 134 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response \|
	\| llmbar-adver-GPTInst \| 92 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response \|
	\| llmbar-adver-GPTOut \| 47 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses \|
	\| llmbar-adver-manual \| 46 \| (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected \|
	\| XSTest \| 450 \| TODO curate \|
	\| (?) repetitiveness \| \| \|
	\| (?) grammar \| \| \|


	For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).
	"""

	ABOUT_TEXT = """
	We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt.
	A win is when the score for the chosen response is higher than the score for the rejected response.

	### Subset summary

	\| Subset \| Num. Samples (Pre-filtering, post-filtering) \| Description \|
	\| :--------------------- \| :------------------------------------------: \| :---------------------------------------------------------------- \|
	\| alpacaeval-easy \| 805 \| Great model vs poor model \|
	\| alpacaeval-length \| 805 \| Good model vs low model, equal length \|
	\| alpacaeval-hard \| 805 \| Great model vs baseline model \|
	\| mt-bench-easy \| 28, 28 \| MT Bench 10s vs 1s \|
	\| mt-bench-medium \| 45, 40 \| MT Bench 9s vs 2-5s \|
	\| mt-bench-hard \| 45, 37 \| MT Bench 7-8 vs 5-6 \|
	\| refusals-dangerous \| 505 \| Dangerous response vs no response \|
	\| refusals-offensive \| 704 \| Offensive response vs no response \|
	\| llmbar-natural \| 100 \| (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs \|
	\| llmbar-adver-neighbor \| 134 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response \|
	\| llmbar-adver-GPTInst \| 92 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response \|
	\| llmbar-adver-GPTOut \| 47 \| (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses \|
	\| llmbar-adver-manual \| 46 \| (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected \|
	\| XSTest \| 450 \| TODO curate \|
	\| (?) repetitiveness \| \| \|
	\| (?) grammar \| \| \|


	For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev).
	"""