Spaces:

ibm
/

llm-rank-themselves

Running

App Files Files Community

llm-rank-themselves / assets /instructions.md

rahulnair23

instruction fix

ea18bb3 6 months ago

preview code

raw

history blame contribute delete

2.64 kB

	<h1 style='color: purple;'> Using on your data </h1>

	Source code is available as a pip installable python package.

	## Installation

	Use of a virtual enviroment is recommended.
	```bash
	conda create -n selfrank python=3.10
	```
	Activate the virtual environment
	```bash
	conda activate selfrank
	```

	and then install,
	```bash
	pip install git+https://huggingface.co/spaces/ibm/llm-rank-themselves.git
	```

	## Usage

	Start by gathering model inferences for the same question/prompt across all models you want to rank. The ranking method expects a pandas dataframe, with a row for each prompt, and a column for each model, i.e.
	\| \| M1 \| M2 \| M3 \| ... \|
	\|:-----------\|:-----\|:-----\|:-----\|:------\|
	\| Q1 \| a \| a \| b \| ... \|
	\| Q2 \| a \| b \| b \| ... \|
	\| ... \| ... \| ... \| ... \| ... \|


	With this data, the self ranking procedure can be invoked as follows:

	```python
	import pandas as pd
	from selfrank.algos.iterative import SelfRank # The full ranking algorithm
	from selfrank.algos.greedy import SelfRankGreedy # The greedy version
	from selfrank.algos.triplet import rouge, equality

	f = "inferences.csv"
	df = pd.read_csv(f)

	models_to_rank = df.columns.tolist()
	evaluator = rouge
	true_ranking = None

	r = SelfRank(models_to_rank, evaluator, true_ranking)
	# or, for the greedy version
	# r = SelfRankGreedy(models_to_rank, evaluator, true_ranking)
	r.fit(adf)
	print(r.ranking)
	```

	This should output the estimated ranking (best to worst): `['M5', 'M2', 'M1', ...]`. If true rankings are known, evaluation measures can be computed by `r.measure(metric='rbo')` (for rank-biased overlap) or `r.measure(metric='mapk')` for mean-average precision.

	We provide implementations of few evaluation function, i.e. the function the judge model uses to evaluate the contestant models. While `rouge` is recommended for generative tasks like summarization, `equality` would be more appropriate for multiple choice settings (like MMLU) or classification tasks with a discrete set of outcomes.

	You can also pass any arbitrary function to the ranker as long as it follows the following signature:
	```python
	def user_function(a: str, b:str, c:str, df:pd.DataFrame) -> int:
	"""
	use model c to evaluate a vs. b
	df: is a dataframe with inferences of all models
	returns 1 if a is preferred or 0 if b is preferred
	"""

	# Is this example, we count number of times a/b is the same as c
	ties = df[a] == df[b]
	a_wins = sum((df[a] == df[c]) & ~(ties))
	b_wins = sum((df[b] == df[c]) & ~(ties))

	if a_wins >= b_wins:
	return 1
	else:
	return 0

	```
	<br>