Spaces:
Sleeping
Sleeping
File size: 13,985 Bytes
5657307 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 |
# Using the `evaluator`
The `Evaluator` classes allow to evaluate a triplet of model, dataset, and metric. The models wrapped in a pipeline, responsible for handling all preprocessing and post-processing and out-of-the-box, `Evaluator`s support transformers pipelines for the supported tasks, but custom pipelines can be passed, as showcased in the section [Using the `evaluator` with custom pipelines](custom_evaluator).
Currently supported tasks are:
- `"text-classification"`: will use the [`TextClassificationEvaluator`].
- `"token-classification"`: will use the [`TokenClassificationEvaluator`].
- `"question-answering"`: will use the [`QuestionAnsweringEvaluator`].
- `"image-classification"`: will use the [`ImageClassificationEvaluator`].
- `"text-generation"`: will use the [`TextGenerationEvaluator`].
- `"text2text-generation"`: will use the [`Text2TextGenerationEvaluator`].
- `"summarization"`: will use the [`SummarizationEvaluator`].
- `"translation"`: will use the [`TranslationEvaluator`].
- `"automatic-speech-recognition"`: will use the [`AutomaticSpeechRecognitionEvaluator`].
- `"audio-classification"`: will use the [`AudioClassificationEvaluator`].
To run an `Evaluator` with several tasks in a single call, use the [EvaluationSuite](evaluation_suite), which runs evaluations on a collection of `SubTask`s.
Each task has its own set of requirements for the dataset format and pipeline output, make sure to check them out for your custom use case. Let's have a look at some of them and see how you can use the evaluator to evalute a single or multiple of models, datasets, and metrics at the same time.
## Text classification
The text classification evaluator can be used to evaluate text models on classification datasets such as IMDb. Beside the model, data, and metric inputs it takes the following optional inputs:
- `input_column="text"`: with this argument the column with the data for the pipeline can be specified.
- `label_column="label"`: with this argument the column with the labels for the evaluation can be specified.
- `label_mapping=None`: the label mapping aligns the labels in the pipeline output with the labels need for evaluation. E.g. the labels in `label_column` can be integers (`0`/`1`) whereas the pipeline can produce label names such as `"positive"`/`"negative"`. With that dictionary the pipeline outputs are mapped to the labels.
By default the `"accuracy"` metric is computed.
### Evaluate models on the Hub
There are several ways to pass a model to the evaluator: you can pass the name of a model on the Hub, you can load a `transformers` model and pass it to the evaluator or you can pass an initialized `transformers.Pipeline`. Alternatively you can pass any callable function that behaves like a `pipeline` call for the task in any framework.
So any of the following works:
```py
from datasets import load_dataset
from evaluate import evaluator
from transformers import AutoModelForSequenceClassification, pipeline
data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(1000))
task_evaluator = evaluator("text-classification")
# 1. Pass a model name or path
eval_results = task_evaluator.compute(
model_or_pipeline="lvwerra/distilbert-imdb",
data=data,
label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
# 2. Pass an instantiated model
model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb")
eval_results = task_evaluator.compute(
model_or_pipeline=model,
data=data,
label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
# 3. Pass an instantiated pipeline
pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb")
eval_results = task_evaluator.compute(
model_or_pipeline=pipe,
data=data,
label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)
```
<Tip>
Without specifying a device, the default for model inference will be the first GPU on the machine if one is available, and else CPU. If you want to use a specific device you can pass `device` to `compute` where -1 will use the GPU and a positive integer (starting with 0) will use the associated CUDA device.
</Tip>
The results will look as follows:
```python
{
'accuracy': 0.918,
'latency_in_seconds': 0.013,
'samples_per_second': 78.887,
'total_time_in_seconds': 12.676
}
```
Note that evaluation results include both the requested metric, and information about the time it took to obtain predictions through the pipeline.
<Tip>
The time performances can give useful indication on model speed for inference but should be taken with a grain of salt: they include all the processing that goes on in the pipeline. This may include tokenizing, post-processing, that may be different depending on the model. Furthermore, it depends a lot on the hardware you are running the evaluation on and you may be able to improve the performance by optimizing things like the batch size.
</Tip>
### Evaluate multiple metrics
With the [`combine`] function one can bundle several metrics into an object that behaves like a single metric. We can use this to evaluate several metrics at once with the evaluator:
```python
import evaluate
eval_results = task_evaluator.compute(
model_or_pipeline="lvwerra/distilbert-imdb",
data=data,
metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]),
label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)
```
The results will look as follows:
```python
{
'accuracy': 0.918,
'f1': 0.916,
'precision': 0.9147,
'recall': 0.9187,
'latency_in_seconds': 0.013,
'samples_per_second': 78.887,
'total_time_in_seconds': 12.676
}
```
Next let's have a look at token classification.
## Token Classification
With the token classification evaluator one can evaluate models for tasks such as NER or POS tagging. It has the following specific arguments:
- `input_column="text"`: with this argument the column with the data for the pipeline can be specified.
- `label_column="label"`: with this argument the column with the labels for the evaluation can be specified.
- `label_mapping=None`: the label mapping aligns the labels in the pipeline output with the labels need for evaluation. E.g. the labels in `label_column` can be integers (`0`/`1`) whereas the pipeline can produce label names such as `"positive"`/`"negative"`. With that dictionary the pipeline outputs are mapped to the labels.
- `join_by=" "`: While most datasets are already tokenized the pipeline expects a string. Thus the tokens need to be joined before passing to the pipeline. By default they are joined with a whitespace.
Let's have a look how we can use the evaluator to benchmark several models.
### Benchmarking several models
Here is an example where several models can be compared thanks to the `evaluator` in only a few lines of code, abstracting away the preprocessing, inference, postprocessing, metric computation:
```python
import pandas as pd
from datasets import load_dataset
from evaluate import evaluator
from transformers import pipeline
models = [
"xlm-roberta-large-finetuned-conll03-english",
"dbmdz/bert-large-cased-finetuned-conll03-english",
"elastic/distilbert-base-uncased-finetuned-conll03-english",
"dbmdz/electra-large-discriminator-finetuned-conll03-english",
"gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner",
"philschmid/distilroberta-base-ner-conll2003",
"Jorgeutd/albert-base-v2-finetuned-ner",
]
data = load_dataset("conll2003", split="validation").shuffle().select(range(1000))
task_evaluator = evaluator("token-classification")
results = []
for model in models:
results.append(
task_evaluator.compute(
model_or_pipeline=model, data=data, metric="seqeval"
)
)
df = pd.DataFrame(results, index=models)
df[["overall_f1", "overall_accuracy", "total_time_in_seconds", "samples_per_second", "latency_in_seconds"]]
```
The result is a table that looks like this:
| model | overall_f1 | overall_accuracy | total_time_in_seconds | samples_per_second | latency_in_seconds |
|:-------------------------------------------------------------------|-------------:|-------------------:|------------------------:|---------------------:|---------------------:|
| Jorgeutd/albert-base-v2-finetuned-ner | 0.941 | 0.989 | 4.515 | 221.468 | 0.005 |
| dbmdz/bert-large-cased-finetuned-conll03-english | 0.962 | 0.881 | 11.648 | 85.850 | 0.012 |
| dbmdz/electra-large-discriminator-finetuned-conll03-english | 0.965 | 0.881 | 11.456 | 87.292 | 0.011 |
| elastic/distilbert-base-uncased-finetuned-conll03-english | 0.940 | 0.989 | 2.318 | 431.378 | 0.002 |
| gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner | 0.947 | 0.991 | 2.376 | 420.873 | 0.002 |
| philschmid/distilroberta-base-ner-conll2003 | 0.961 | 0.994 | 2.436 | 410.579 | 0.002 |
| xlm-roberta-large-finetuned-conll03-english | 0.969 | 0.882 | 11.996 | 83.359 | 0.012 |
### Visualizing results
You can feed in the `results` list above into the `plot_radar()` function to visualize different aspects of their performance and choose the model that is the best fit, depending on the metric(s) that are relevant to your use case:
```python
import evaluate
from evaluate.visualization import radar_plot
>>> plot = radar_plot(data=results, model_names=models, invert_range=["latency_in_seconds"])
>>> plot.show()
```
<div class="flex justify-center">
<img src="https://huggingface.co/datasets/evaluate/media/resolve/main/viz.png" width="400"/>
</div>
Don't forget to specify `invert_range` for metrics for which smaller is better (such as the case for latency in seconds).
If you want to save the plot locally, you can use the `plot.savefig()` function with the option `bbox_inches='tight'`, to make sure no part of the image gets cut off.
## Question Answering
With the question-answering evaluator one can evaluate models for QA without needing to worry about the complicated pre- and post-processing that's required for these models. It has the following specific arguments:
- `question_column="question"`: the name of the column containing the question in the dataset
- `context_column="context"`: the name of the column containing the context
- `id_column="id"`: the name of the column cointaing the identification field of the question and answer pair
- `label_column="answers"`: the name of the column containing the answers
- `squad_v2_format=None`: whether the dataset follows the format of squad_v2 dataset where a question may have no answer in the context. If this parameter is not provided, the format will be automatically inferred.
Let's have a look how we can evaluate QA models and compute confidence intervals at the same time.
### Confidence intervals
Every evaluator comes with the options to compute confidence intervals using [bootstrapping](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html). Simply pass `strategy="bootstrap"` and set the number of resanmples with `n_resamples`.
```python
from datasets import load_dataset
from evaluate import evaluator
task_evaluator = evaluator("question-answering")
data = load_dataset("squad", split="validation[:1000]")
eval_results = task_evaluator.compute(
model_or_pipeline="distilbert-base-uncased-distilled-squad",
data=data,
metric="squad",
strategy="bootstrap",
n_resamples=30
)
```
Results include confidence intervals as well as error estimates as follows:
```python
{
'exact_match':
{
'confidence_interval': (79.67, 84.54),
'score': 82.30,
'standard_error': 1.28
},
'f1':
{
'confidence_interval': (85.30, 88.88),
'score': 87.23,
'standard_error': 0.97
},
'latency_in_seconds': 0.0085,
'samples_per_second': 117.31,
'total_time_in_seconds': 8.52
}
```
## Image classification
With the image classification evaluator we can evaluate any image classifier. It uses the same keyword arguments at the text classifier:
- `input_column="image"`: the name of the column containing the images as PIL ImageFile
- `label_column="label"`: the name of the column containing the labels
- `label_mapping=None`: We want to map class labels defined by the model in the pipeline to values consistent with those defined in the `label_column`
Let's have a look at how can evaluate image classification models on large datasets.
### Handling large datasets
The evaluator can be used on large datasets! Below, an example shows how to use it on ImageNet-1k for image classification. Beware that this example will require to download ~150 GB.
```python
data = load_dataset("imagenet-1k", split="validation", use_auth_token=True)
pipe = pipeline(
task="image-classification",
model="facebook/deit-small-distilled-patch16-224"
)
task_evaluator = evaluator("image-classification")
eval_results = task_evaluator.compute(
model_or_pipeline=pipe,
data=data,
metric="accuracy",
label_mapping=pipe.model.config.label2id
)
```
Since we are using `datasets` to store data we make use of a technique called memory mappings. This means that the dataset is never fully loaded into memory which saves a lot of RAM. Running the above code only uses roughly 1.5 GB of RAM while the validation split is more than 30 GB big.
|