Spaces:
Running
Running
<h1 style='color: purple;'> Using on your data </h1> | |
Source code is available as a pip installable python package. | |
## Installation | |
Use of a virtual enviroment is recommended. | |
```bash | |
conda create -n selfrank python=3.10 | |
``` | |
Activate the virtual environment | |
```bash | |
conda activate selfrank | |
``` | |
and then install, | |
```bash | |
pip install git+https://huggingface.co/spaces/ibm/llm-rank-themselves.git | |
``` | |
## Usage | |
Start by gathering model inferences for the same question/prompt across all models you want to rank. The ranking method expects a pandas dataframe, with a row for each prompt, and a column for each model, i.e. | |
| | M1 | M2 | M3 | ... | | |
|:-----------|:-----|:-----|:-----|:------| | |
| Q1 | a | a | b | ... | | |
| Q2 | a | b | b | ... | | |
| ... | ... | ... | ... | ... | | |
With this data, the self ranking procedure can be invoked as follows: | |
```python | |
import pandas as pd | |
from selfrank.algos.iterative import SelfRank # The full ranking algorithm | |
from selfrank.algos.greedy import SelfRankGreedy # The greedy version | |
from selfrank.algos.triplet import rouge, equality | |
f = "inferences.csv" | |
df = pd.read_csv(f) | |
models_to_rank = df.columns.tolist() | |
evaluator = rouge | |
true_ranking = None | |
r = SelfRank(models_to_rank, evaluator, true_ranking) | |
# or, for the greedy version | |
# r = SelfRankGreedy(models_to_rank, evaluator, true_ranking) | |
r.fit(adf) | |
print(r.ranking) | |
``` | |
This should output the estimated ranking (best to worst): `['M5', 'M2', 'M1', ...]`. If true rankings are known, evaluation measures can be computed by `r.measure(metric='rbo')` (for rank-biased overlap) or `r.measure(metric='mapk')` for mean-average precision. | |
We provide implementations of few evaluation function, i.e. the function the judge model uses to evaluate the contestant models. While `rouge` is recommended for generative tasks like summarization, `equality` would be more appropriate for multiple choice settings (like MMLU) or classification tasks with a discrete set of outcomes. | |
You can also pass any arbitrary function to the ranker as long as it follows the following signature: | |
```python | |
def user_function(a: str, b:str, c:str, df:pd.DataFrame) -> int: | |
""" | |
use model c to evaluate a vs. b | |
df: is a dataframe with inferences of all models | |
returns 1 if a is preferred or 0 if b is preferred | |
""" | |
# Is this example, we count number of times a/b is the same as c | |
ties = df[a] == df[b] | |
a_wins = sum((df[a] == df[c]) & ~(ties)) | |
b_wins = sum((df[b] == df[c]) & ~(ties)) | |
if a_wins >= b_wins: | |
return 1 | |
else: | |
return 0 | |
``` | |
<br> |