File size: 13,985 Bytes
5657307
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
# Using the `evaluator`

The `Evaluator` classes allow to evaluate a  triplet of model, dataset, and metric. The models wrapped in a pipeline, responsible for handling all preprocessing and post-processing and out-of-the-box, `Evaluator`s support transformers pipelines for the supported tasks, but custom pipelines can be passed, as showcased in the section [Using the `evaluator` with custom pipelines](custom_evaluator).

Currently supported tasks are:
- `"text-classification"`: will use the [`TextClassificationEvaluator`].
- `"token-classification"`: will use the [`TokenClassificationEvaluator`].
- `"question-answering"`: will use the [`QuestionAnsweringEvaluator`].
- `"image-classification"`: will use the [`ImageClassificationEvaluator`].
- `"text-generation"`: will use the [`TextGenerationEvaluator`].
- `"text2text-generation"`: will use the [`Text2TextGenerationEvaluator`].
- `"summarization"`: will use the [`SummarizationEvaluator`].
- `"translation"`: will use the [`TranslationEvaluator`].
- `"automatic-speech-recognition"`: will use the [`AutomaticSpeechRecognitionEvaluator`].
- `"audio-classification"`: will use the [`AudioClassificationEvaluator`].

To run an `Evaluator` with several tasks in a single call, use the [EvaluationSuite](evaluation_suite), which runs evaluations on a collection of `SubTask`s.

Each task has its own set of requirements for the dataset format and pipeline output, make sure to check them out for your custom use case. Let's have a look at some of them and see how you can use the evaluator to evalute a single or multiple of models, datasets, and metrics at the same time.

## Text classification

The text classification evaluator can be used to evaluate text models on classification datasets such as IMDb. Beside the model, data, and metric inputs it takes the following optional inputs:

- `input_column="text"`: with this argument the column with the data for the pipeline can be specified.
- `label_column="label"`: with this argument the column with the labels for the evaluation can be specified.
- `label_mapping=None`: the label mapping aligns the labels in the pipeline output with the labels need for evaluation. E.g. the labels in `label_column` can be integers (`0`/`1`) whereas the pipeline can produce label names such as `"positive"`/`"negative"`. With that dictionary the pipeline outputs are mapped to the labels.

By default the `"accuracy"` metric is computed.

### Evaluate models on the Hub

There are several ways to pass a model to the evaluator: you can pass the name of a model on the Hub, you can load a `transformers` model and pass it to the evaluator or you can pass an initialized `transformers.Pipeline`. Alternatively you can pass any callable function that behaves like a `pipeline` call for the task in any framework.

So any of the following works:

```py
from datasets import load_dataset
from evaluate import evaluator
from transformers import AutoModelForSequenceClassification, pipeline

data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(1000))
task_evaluator = evaluator("text-classification")

# 1. Pass a model name or path
eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb",
    data=data,
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)

# 2. Pass an instantiated model
model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb")

eval_results = task_evaluator.compute(
    model_or_pipeline=model,
    data=data,
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)

# 3. Pass an instantiated pipeline
pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb")

eval_results = task_evaluator.compute(
    model_or_pipeline=pipe,
    data=data,
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)
```
<Tip>

Without specifying a device, the default for model inference will be the first GPU on the machine if one is available, and else CPU. If you want to use a specific device you can pass `device` to `compute` where -1 will use the GPU and a positive integer (starting with 0) will use the associated CUDA device.

</Tip>


The results will look as follows:
```python
{
    'accuracy': 0.918,
    'latency_in_seconds': 0.013,
    'samples_per_second': 78.887,
    'total_time_in_seconds': 12.676
}
```

Note that evaluation results include both the requested metric, and information about the time it took to obtain predictions through the pipeline.

<Tip>

The time performances can give useful indication on model speed for inference but should be taken with a grain of salt: they include all the processing that goes on in the pipeline. This may include tokenizing, post-processing, that may be different depending on the model. Furthermore, it depends a lot on the hardware you are running the evaluation on and you may be able to improve the performance by optimizing things like the batch size.

</Tip>

### Evaluate multiple metrics

With the [`combine`] function one can bundle several metrics into an object that behaves like a single metric. We can use this to evaluate several metrics at once with the evaluator:

```python
import evaluate

eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb",
    data=data,
    metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]),
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)

```
The results will look as follows:
```python
{
    'accuracy': 0.918,
    'f1': 0.916,
    'precision': 0.9147,
    'recall': 0.9187,
    'latency_in_seconds': 0.013,
    'samples_per_second': 78.887,
    'total_time_in_seconds': 12.676
}
```

Next let's have a look at token classification.

## Token Classification

With the token classification evaluator one can evaluate models for tasks such as NER or POS tagging. It has the following specific arguments:

- `input_column="text"`: with this argument the column with the data for the pipeline can be specified.
- `label_column="label"`: with this argument the column with the labels for the evaluation can be specified.
- `label_mapping=None`: the label mapping aligns the labels in the pipeline output with the labels need for evaluation. E.g. the labels in `label_column` can be integers (`0`/`1`) whereas the pipeline can produce label names such as `"positive"`/`"negative"`. With that dictionary the pipeline outputs are mapped to the labels.
- `join_by=" "`: While most datasets are already tokenized the pipeline expects a string. Thus the tokens need to be joined before passing to the pipeline. By default they are joined with a whitespace.

Let's have a look how we can use the evaluator to benchmark several models.

### Benchmarking several models

Here is an example where several models can be compared thanks to the `evaluator` in only a few lines of code, abstracting away the preprocessing, inference, postprocessing, metric computation:

```python
import pandas as pd
from datasets import load_dataset
from evaluate import evaluator
from transformers import pipeline

models = [
    "xlm-roberta-large-finetuned-conll03-english",
    "dbmdz/bert-large-cased-finetuned-conll03-english",
    "elastic/distilbert-base-uncased-finetuned-conll03-english",
    "dbmdz/electra-large-discriminator-finetuned-conll03-english",
    "gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner",
    "philschmid/distilroberta-base-ner-conll2003",
    "Jorgeutd/albert-base-v2-finetuned-ner",
]

data = load_dataset("conll2003", split="validation").shuffle().select(range(1000))
task_evaluator = evaluator("token-classification")

results = []
for model in models:
    results.append(
        task_evaluator.compute(
            model_or_pipeline=model, data=data, metric="seqeval"
            )
        )

df = pd.DataFrame(results, index=models)
df[["overall_f1", "overall_accuracy", "total_time_in_seconds", "samples_per_second", "latency_in_seconds"]]
```

The result is a table that looks like this:

|   model                                                            |   overall_f1 |   overall_accuracy |   total_time_in_seconds |   samples_per_second |   latency_in_seconds |
|:-------------------------------------------------------------------|-------------:|-------------------:|------------------------:|---------------------:|---------------------:|
| Jorgeutd/albert-base-v2-finetuned-ner                              |        0.941 |              0.989 |                   4.515 |              221.468 |                0.005 |
| dbmdz/bert-large-cased-finetuned-conll03-english                   |        0.962 |              0.881 |                  11.648 |               85.850 |                0.012 |
| dbmdz/electra-large-discriminator-finetuned-conll03-english        |        0.965 |              0.881 |                  11.456 |               87.292 |                0.011 |
| elastic/distilbert-base-uncased-finetuned-conll03-english          |        0.940 |              0.989 |                   2.318 |              431.378 |                0.002 |
| gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner |        0.947 |              0.991 |                   2.376 |              420.873 |                0.002 |
| philschmid/distilroberta-base-ner-conll2003                        |        0.961 |              0.994 |                   2.436 |              410.579 |                0.002 |
| xlm-roberta-large-finetuned-conll03-english                        |        0.969 |              0.882 |                  11.996 |               83.359 |                0.012 |


### Visualizing results

You can feed in the `results` list above into the `plot_radar()` function to visualize different aspects of their performance and choose the model that is the best fit, depending on the metric(s) that are relevant to your use case:

```python
import evaluate
from evaluate.visualization import radar_plot

>>> plot = radar_plot(data=results, model_names=models, invert_range=["latency_in_seconds"])
>>> plot.show()
```

<div class="flex justify-center">
    <img src="https://huggingface.co/datasets/evaluate/media/resolve/main/viz.png" width="400"/>
</div>


Don't forget to specify `invert_range` for metrics for which smaller is better (such as the case for latency in seconds).

If you want to save the plot locally, you can use the `plot.savefig()` function with the option `bbox_inches='tight'`, to make sure no part of the image gets cut off.


## Question Answering

With the question-answering evaluator one can evaluate models for QA without needing to worry about the complicated pre- and post-processing that's required for these models. It has the following specific arguments:


- `question_column="question"`: the name of the column containing the question in the dataset
- `context_column="context"`: the name of the column containing the context
- `id_column="id"`: the name of the column cointaing the identification field of the question and answer pair
- `label_column="answers"`: the name of the column containing the answers
- `squad_v2_format=None`: whether the dataset follows the format of squad_v2 dataset where a question may have no answer in the context. If this parameter is not provided, the format will be automatically inferred.

Let's have a look how we can evaluate QA models and compute confidence intervals at the same time.

### Confidence intervals

Every evaluator comes with the options to compute confidence intervals using [bootstrapping](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.bootstrap.html). Simply pass `strategy="bootstrap"` and set the number of resanmples with `n_resamples`.

```python
from datasets import load_dataset
from evaluate import evaluator

task_evaluator = evaluator("question-answering")

data = load_dataset("squad", split="validation[:1000]")
eval_results = task_evaluator.compute(
    model_or_pipeline="distilbert-base-uncased-distilled-squad",
    data=data,
    metric="squad",
    strategy="bootstrap",
    n_resamples=30
)
```

Results include confidence intervals as well as error estimates as follows:

```python
{
    'exact_match':
    {
        'confidence_interval': (79.67, 84.54),
        'score': 82.30,
        'standard_error': 1.28
    },
    'f1':
    {
        'confidence_interval': (85.30, 88.88),
        'score': 87.23,
        'standard_error': 0.97
    },
    'latency_in_seconds': 0.0085,
    'samples_per_second': 117.31,
    'total_time_in_seconds': 8.52
 }
```

## Image classification

With the image classification evaluator we can evaluate any image classifier. It uses the same keyword arguments at the text classifier:

- `input_column="image"`: the name of the column containing the images as PIL ImageFile
- `label_column="label"`: the name of the column containing the labels
- `label_mapping=None`: We want to map class labels defined by the model in the pipeline to values consistent with those defined in the `label_column`

Let's have a look at how can evaluate image classification models on large datasets.

### Handling large datasets

The evaluator can be used on large datasets! Below, an example shows how to use it on ImageNet-1k for image classification. Beware that this example will require to download ~150 GB.

```python
data = load_dataset("imagenet-1k", split="validation", use_auth_token=True)

pipe = pipeline(
    task="image-classification",
    model="facebook/deit-small-distilled-patch16-224"
)

task_evaluator = evaluator("image-classification")
eval_results = task_evaluator.compute(
    model_or_pipeline=pipe,
    data=data,
    metric="accuracy",
    label_mapping=pipe.model.config.label2id
)
```

Since we are using `datasets` to store data we make use of a technique called memory mappings. This means that the dataset is never fully loaded into memory which saves a lot of RAM. Running the above code only uses roughly 1.5 GB of RAM while the validation split is more than 30 GB big.