File size: 6,773 Bytes
9e36d64
1b92067
 
 
 
 
 
b2dcfc6
9e36d64
1b92067
9e36d64
 
 
 
1b92067
 
b2dcfc6
 
 
1b92067
 
b2dcfc6
 
 
 
 
 
1b92067
 
b2dcfc6
 
 
1b92067
b2dcfc6
 
 
 
 
 
 
 
 
 
 
1b92067
 
b2dcfc6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b92067
 
 
b2dcfc6
 
 
 
1b92067
e094451
 
 
 
 
 
 
b2dcfc6
 
 
e094451
 
 
 
 
 
 
b2dcfc6
e094451
 
 
 
 
 
 
 
b2dcfc6
 
e094451
 
b2dcfc6
e094451
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b2dcfc6
 
e094451
 
 
 
 
 
 
 
 
 
 
 
b2dcfc6
1b92067
 
b2dcfc6
1b92067
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
title: VendiScore
datasets:
-  
tags:
- evaluate
- metric
description: "The Vendi Score is a metric for evaluating diversity in machine learning. See the project's README at https://github.com/vertaix/Vendi-Score for more information."
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
---

# Metric Card for VendiScore

The Vendi Score (VS) is a metric for evaluating diversity in machine learning.
The input to metric is a collection of samples and a pairwise similarity function, and the output is a number, which can be interpreted as the effective number of unique elements in the sample.
See the project's README at https://github.com/vertaix/Vendi-Score for more information.

## Metric Description
The Vendi Score (VS) is a metric for evaluating diversity in machine learning.
The input to metric is a collection of samples and a pairwise similarity function, and the output is a number, which can be interpreted as the effective number of unique elements in the sample.
Specifically, given a positive semi-definite matrix $K \in \mathbb{R}^{n \times n}$ of similarity scores, the score is defined as:
$$\mathrm{VS}(K) = \exp(-\mathrm{tr}(K/n \log K/n)) = \exp(-\sum_{i=1}^n \lambda_i \log \lambda_i),$$
where $\lambda_i$ are the eigenvalues of $K/n$ and $0 \log 0 = 0$.
That is, the Vendi Score is equal to the exponential of the von Neumann entropy of $K/n$, or the Shannon entropy of the eigenvalues, which is also known as the effective rank.

## How to Use
The Vendi Score is available as a Python package or in HuggingFace `evaluate`.
To use the Python package, see the instructions at https://github.com/vertaix/Vendi-Score.
To use the `evaluate` module, pass a list of samples and a similarity function or a string identifying a predefined class of similarity functions (see below).

```
>>> vendiscore = evaluate.load("danf0/vendiscore")
>>> samples = ["Look, Jane.",
               "See Spot.",
               "See Spot run.",
               "Run, Spot, run.",
	       "Jane sees Spot run."]
>>> results = vendiscore.compute(samples, k="ngram_overlap", ns=[1, 2])
>>> print(results)
{'VS': 3.90657...}
```

### Inputs
- **samples**: an iterable containing $n$ samples to score; an n x n similarity
       matrix K, or an n x d feature matrix X.
- **k**: a pairwise similarity function, or a string identifying a predefined 
       similarity function. If k is a pairwise similarity function, it should
       be symmetric and k(x, x) = 1.
       Options: ngram_overlap, text_embeddings, pixels, image_embeddings.
- **score_K**: if true, samples is an n x n similarity matrix K.
- **score_X**: if true, samples is an n x d feature matrix X.
- **score_dual**: if true,  samples is an n x d feature matrix X and we will
       compute the diversity score using the covariance matrix X @ X.T.
- **normalize**: if true, normalize the similarity scores.
- **model (optional)**: if k is "text_embeddings", a model mapping sentences to
       embeddings (output should be an object with an attribute called
       `pooler_output` or `last_hidden_state`). If k is "image_embeddings", a
       model mapping images to embeddings.
- **tokenizer (optional)**: if k is "text_embeddings" or "ngram_overlap", a
       tokenizer mapping strings to lists.
- **transform (optional)**: if k is "image_embeddings", a torchvision transform
       to apply to the samples.
- **model_path (optional)**: if k is "text_embeddings", the name of a model on
       the HuggingFace hub.
- **ns (optional)**: if k is "ngram_overlap", the values of n to calculate.
- **batch_size (optional)**: batch size to use if k is "text_embedding" or
       "image_embedding".
- **device (optional)**: a string (e.g. "cuda", "cpu") or torch.device 
       identifying the device to use if k is "text_embedding"
       or "image_embedding".


### Output Values

The output is a dictionary with one key, "VS".
Given n samples, the value of the Vendi Score ranges between 1 and n, with higher numbers indicating that the sample is more diverse.

### Examples

```
>>> import numpy as np
>>> vendiscore = evaluate.load("danf0/vendiscore")
>>> samples = [0, 0, 10, 10, 20, 20]
>>> k = lambda a, b: np.exp(-np.abs(a - b))
>>> vendiscore.compute(samples, k)
2.9999
```

If you already have precomputed a similarity matrix:
```
>>> K = np.array([[1.0, 0.9, 0.0],
                  [0.9, 1.0, 0.0],
                  [0.0, 0.0, 1.0]])
>>> vendiscore.compute(K, score_K=True)
2.1573
```

If your similarity function is a dot product between `n` normalized
`d`-dimensional embeddings `X`, and `d` < `n`, it is faster
to compute the Vendi Score using the covariance matrix, `X @ X.T`.
(If the rows of `X` are not normalized, set `normalize = True`.)
```
>>> X = np.array([[100, 0], [99, 1], [1, 99], [0, 100])
>>> vendiscore.compute(X, score_dual=True, normalize=True)
1.9989...
```

Image similarity can be calculated using inner products between pixel vectors or between embeddings from a neural network.
The default embeddings are from the pool-2048 layer of the torchvision version of the Inception v3 model; other embedding functions can be passed to the `model` argument.
```
>>> from torchvision import datasets
>>> mnist = datasets.MNIST("data/mnist", train=False, download=True)
>>> digits = [[x for x, y in mnist if y == c] for c in range(10)]
>>> pixel_vs = [vendiscore.compute(imgs, k="pixels") for imgs in digits]
>>> inception_vs = [vendiscore.compute(imgs, k="image_embeddings", batch_size=64, device="cuda") for imgs in digits]
>>> for y, (pvs, ivs) in enumerate(zip(pixel_vs, inception_vs)): print(f"{y}\t{pvs:.02f}\t{ivs:02f}")
0       7.68    3.45
1       5.31    3.50
2       12.18   3.62
3       9.97    2.97
4       11.10   3.75
5       13.51   3.16
6       9.06    3.63
7       9.58    4.07
8       9.69    3.74
9       8.56    3.43
```

Text similarity can be calculated using n-gram overlap or using inner products between embeddings from a neural network.
```
>>> sents = ["Look, Jane.",
             "See Spot.",
             "See Spot run.",
             "Run, Spot, run.",
	     "Jane sees Spot run."]
>>> ngram_vs = vendiscore.compute(sents, k="ngram_overlap", ns=[1, 2])
>>> bert_vs = vendiscore.compute(sents, k="text_embeddings", model_path="bert-base-uncased")
>>> simcse_vs = vendiscore.compute(sents, k="text_embeddings", model_path="princeton-nlp/unsup-simcse-bert-base-uncased")
>>> print(f"N-grams: {ngram_vs:.02f}, BERT: {bert_vs:.02f}, SimCSE: {simcse_vs:.02f})
N-grams: 3.91, BERT: 1.21, SimCSE: 2.81
```

## Limitations and Bias
The Vendi Score depends on the choice of similarity function. Care should be taken to select a similarity function that reflects the features that are relevant for defining diversity in a given application.

## Citation