jordyvl commited on
Commit
0c94397
1 Parent(s): ac1bb79

might be defunct now

Browse files
Files changed (3) hide show
  1. README.md +13 -2
  2. app.py +2 -1
  3. ece.py +119 -53
README.md CHANGED
@@ -20,7 +20,14 @@ pinned: false
20
  <!---
21
  *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
22
  -->
23
- `ECE` is a standard metric to evaluate top-1 prediction miscalibration. Generally, the lower the better.
 
 
 
 
 
 
 
24
 
25
 
26
  ## How to Use
@@ -30,6 +37,8 @@ pinned: false
30
  -->
31
 
32
 
 
 
33
  ### Inputs
34
  <!---
35
  *List all input arguments in the format below*
@@ -52,11 +61,12 @@ pinned: false
52
  *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
53
  -->
54
 
 
55
  ## Limitations and Bias
56
  <!---
57
  *Note any known limitations or biases that the metric has, with links and references if possible.*
58
  -->
59
- See [3],[4] and [5]
60
 
61
  ## Citation
62
  [1] Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
@@ -64,6 +74,7 @@ See [3],[4] and [5]
64
  [3] Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G. and Tran, D., 2019, June. Measuring Calibration in Deep Learning. In CVPR Workshops (Vol. 2, No. 7).
65
  [4] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
66
  [5] Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. and Schön, T., 2019, April. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467). PMLR.
 
67
 
68
  ## Further References
69
  *Add any useful further references.*
 
20
  <!---
21
  *Give a brief overview of this metric, including what task(s) it is usually used for, if any.*
22
  -->
23
+ Expected Calibration Error `ECE` is a standard metric to evaluate top-1 prediction miscalibration.
24
+ It measures the L^p norm difference between a model’s posterior and the true likelihood of being correct.
25
+ $$ ECE_p(f)^p= \mathbb{E}_{(X,Y)} \left[\|\mathbb{E}[Y = \hat{y} \mid f(X) = \hat{p}] - f(X)\|^p_p\right]$$, where $\hat{y} = \argmax_{y'}[f(X)]_y'$ is a class prediction with associated posterior probability $\hat{p}= \max_{y'}[f(X)]_y'$.
26
+
27
+ It is generally implemented as a binned estimator that discretizes predicted probabilities into a range of possible values (bins) for which conditional expectation can be estimated.
28
+
29
+ As a metric of calibration *error*, it holds that the lower, the better calibrated a model is.
30
+ For valid model comparisons, ensure to use the same keyword arguments.
31
 
32
 
33
  ## How to Use
 
37
  -->
38
 
39
 
40
+
41
+
42
  ### Inputs
43
  <!---
44
  *List all input arguments in the format below*
 
61
  *Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
62
  -->
63
 
64
+
65
  ## Limitations and Bias
66
  <!---
67
  *Note any known limitations or biases that the metric has, with links and references if possible.*
68
  -->
69
+ See [3],[4] and [5].
70
 
71
  ## Citation
72
  [1] Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
 
74
  [3] Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G. and Tran, D., 2019, June. Measuring Calibration in Deep Learning. In CVPR Workshops (Vol. 2, No. 7).
75
  [4] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
76
  [5] Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. and Schön, T., 2019, April. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467). PMLR.
77
+ [6] Allen-Zhu, Z., Li, Y. and Liang, Y., 2019. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32.
78
 
79
  ## Further References
80
  *Add any useful further references.*
app.py CHANGED
@@ -3,4 +3,5 @@ from evaluate.utils import launch_gradio_widget
3
 
4
 
5
  module = evaluate.load("jordyvl/ece")
6
- launch_gradio_widget(module)
 
 
3
 
4
 
5
  module = evaluate.load("jordyvl/ece")
6
+ launch_gradio_widget(module)
7
+
ece.py CHANGED
@@ -13,6 +13,8 @@
13
  # limitations under the License.
14
  """TODO: Add a description here."""
15
 
 
 
16
  import evaluate
17
  import datasets
18
  import numpy as np
@@ -29,7 +31,8 @@ year={2022}
29
 
30
  # TODO: Add description of the module here
31
  _DESCRIPTION = """\
32
- This new module is designed to solve this great ML task and is crafted with a lot of care.
 
33
  """
34
 
35
 
@@ -41,6 +44,9 @@ Args:
41
  should be a string with tokens separated by spaces.
42
  references: list of reference for each prediction. Each
43
  reference should be a string with tokens separated by spaces.
 
 
 
44
  Returns:
45
  accuracy: description of the first score,
46
  another_score: description of the second score,
@@ -55,14 +61,50 @@ Examples:
55
  """
56
 
57
  # TODO: Define external resources urls if needed
58
- BAD_WORDS_URL = "http://url/to/external/resource/bad_words.txt"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
 
 
 
60
 
61
- # TODO
 
62
 
63
- def bin_idx_dd(P, bins):
64
- oneDbins = np.digitize(P, bins) - 1 # since bins contains extra righmost&leftmost bins
65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  # Tie-breaking to the left for rightmost bin
67
  # Using `digitize`, values that fall on an edge are put in the right bin.
68
  # For the rightmost bin, we want values equal to the right
@@ -72,7 +114,7 @@ def bin_idx_dd(P, bins):
72
  # Find the rounding precision
73
  dedges_min = np.diff(bins).min()
74
  if dedges_min == 0:
75
- raise ValueError('The smallest edge difference is numerically 0.')
76
 
77
  decimal = int(-np.log10(dedges_min)) + 6
78
 
@@ -87,48 +129,49 @@ def bin_idx_dd(P, bins):
87
 
88
 
89
  def manual_binned_statistic(P, y_correct, bins, statistic="mean"):
90
-
91
- binnumbers = bin_idx_dd(np.expand_dims(P, 0), bins)[0]
92
  result = np.empty([len(bins)], float)
93
- result.fill(np.nan)
94
 
95
- flatcount = np.bincount(binnumbers, None)
96
  a = flatcount.nonzero()
97
 
98
- if statistic == 'mean':
99
- flatsum = np.bincount(binnumbers, y_correct)
100
  result[a] = flatsum[a] / flatcount[a]
101
- return result, bins, binnumbers + 1 # fix for what happens in bin_idx_dd
 
102
 
103
- def CE_estimate(y_correct, P, bins=None, n_bins=10, p=1):
 
 
 
 
 
 
 
 
 
 
104
  """
105
  y_correct: binary (N x 1)
106
  P: normalized (N x 1) either max or per class
107
 
108
- Summary: weighted average over the accuracy/confidence difference of equal-range bins
109
  """
110
 
111
- # defaults:
112
- if bins is None:
113
- n_bins = n_bins
114
- bin_range = [0, 1]
115
- bins = np.linspace(bin_range[0], bin_range[1], n_bins + 1)
116
- # expected; equal range binning
117
- else:
118
- n_bins = len(bins) - 1
119
- bin_range = [min(bins), max(bins)]
120
-
121
- # average bin probability #55 for bin 50-60; mean per bin
122
- calibrated_acc = bins[1:] # right/upper bin edges
123
- # calibrated_acc = bin_centers(bins)
124
 
 
 
125
 
126
  empirical_acc, bin_edges, bin_assignment = manual_binned_statistic(P, y_correct, bins)
127
  bin_numbers, weights_ece = np.unique(bin_assignment, return_counts=True)
128
  anindices = bin_numbers - 1 # reduce bin counts; left edge; indexes right BY DEFAULT
129
 
130
  # Expected calibration error
131
- if p < np.inf: # Lp-CE
132
  CE = np.average(
133
  abs(empirical_acc[anindices] - calibrated_acc[anindices]) ** p,
134
  weights=weights_ece, # weighted average 1/binfreq
@@ -138,11 +181,14 @@ def CE_estimate(y_correct, P, bins=None, n_bins=10, p=1):
138
 
139
  return CE
140
 
141
- def top_CE(Y, P, **kwargs):
142
- y_correct = (Y == np.argmax(P, -1)).astype(int)
143
- p_max = np.max(P, -1)
144
- top_CE = CE_estimate(y_correct, p_max, **kwargs) # can choose n_bins and norm
145
- return top_CE
 
 
 
146
 
147
 
148
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
@@ -157,9 +203,18 @@ class ECE(evaluate.EvaluationModule):
157
  4. apply L^p norm distance and weights
158
  """
159
 
160
- #have to add to initialization here?
161
- #create bins using the params
162
- #create proxy
 
 
 
 
 
 
 
 
 
163
 
164
  def _info(self):
165
  # TODO: Specifies the evaluate.EvaluationModuleInfo object
@@ -170,15 +225,17 @@ class ECE(evaluate.EvaluationModule):
170
  citation=_CITATION,
171
  inputs_description=_KWARGS_DESCRIPTION,
172
  # This defines the format of each prediction and reference
173
- features=datasets.Features({
174
- 'predictions': datasets.Value('float32'),
175
- 'references': datasets.Value('int64'),
176
- }),
 
 
177
  # Homepage of the module for documentation
178
- homepage="http://module.homepage", #https://huggingface.co/spaces/jordyvl/ece
179
  # Additional links to the codebase or references
180
  codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
181
- reference_urls=["http://path.to.reference.url/new_module"]
182
  )
183
 
184
  def _download_and_prepare(self, dl_manager):
@@ -188,20 +245,24 @@ class ECE(evaluate.EvaluationModule):
188
 
189
  def _compute(self, predictions, references):
190
  """Returns the scores"""
191
- ECE = top_CE(references, predictions)
 
192
  return {
193
  "ECE": ECE,
194
  }
195
 
196
 
197
  def test_ECE():
198
- N = 10 #10 instances
199
- K = 5 #5 class problem
200
-
201
- def random_mc_instance(concentration=1):
202
- reference = np.argmax(np.random.dirichlet(([concentration for _ in range(K)])),-1)
203
- prediction = np.random.dirichlet(([concentration for _ in range(K)])) #probabilities
204
- #OH #return np.eye(K)[np.argmax(reference,-1)]
 
 
 
205
  return reference, prediction
206
 
207
  references, predictions = list(zip(*[random_mc_instance() for i in range(N)]))
@@ -210,5 +271,10 @@ def test_ECE():
210
  res = ECE()._compute(predictions, references)
211
  print(f"ECE: {res['ECE']}")
212
 
213
- if __name__ == '__main__':
214
- test_ECE()
 
 
 
 
 
 
13
  # limitations under the License.
14
  """TODO: Add a description here."""
15
 
16
+ # https://huggingface.co/spaces/jordyvl/ece
17
+
18
  import evaluate
19
  import datasets
20
  import numpy as np
 
31
 
32
  # TODO: Add description of the module here
33
  _DESCRIPTION = """\
34
+ This new module is designed to evaluate the calibration of a probabilistic classifier.
35
+ More concretely, we provide a binned empirical estimator of top-1 calibration error. [1]
36
  """
37
 
38
 
 
44
  should be a string with tokens separated by spaces.
45
  references: list of reference for each prediction. Each
46
  reference should be a string with tokens separated by spaces.
47
+
48
+
49
+
50
  Returns:
51
  accuracy: description of the first score,
52
  another_score: description of the second score,
 
61
  """
62
 
63
  # TODO: Define external resources urls if needed
64
+ BAD_WORDS_URL = ""
65
+
66
+
67
+ # Discretization and binning
68
+ def create_bins(n_bins=10, scheme="equal-range", bin_range=None, P=None):
69
+ assert scheme in [
70
+ "equal-range",
71
+ "equal-masss",
72
+ ], f"This binning scheme {scheme} is not implemented yet"
73
+
74
+ if bin_range is None:
75
+ if P is None:
76
+ bin_range = [0, 1] # no way to know range
77
+ else:
78
+ if scheme == "equal-range":
79
+ bin_range = [min(P), max(P)]
80
 
81
+ if scheme == "equal-range":
82
+ bins = np.linspace(bin_range[0], bin_range[1], n_bins + 1) # equal range
83
+ # bins = np.tile(np.linspace(bin_range[0], bin_range[1], n_bins + 1), (n_classes,1))
84
 
85
+ elif scheme == "equal-mass":
86
+ assert P.size >= n_bins, "Fewer points than bins"
87
 
88
+ # assume global equal mass binning; not discriminated per class
89
+ P = P.flatten()
90
 
91
+ # split sorted probabilities into groups of approx equal size
92
+ groups = np.array_split(np.sort(P), n_bins)
93
+ bin_upper_edges = list()
94
+
95
+ # rightmost entry per equal size group
96
+ for cur_group in range(n_bins - 1):
97
+ bin_upper_edges += [max(groups[cur_group])]
98
+ bin_upper_edges += [np.inf] # always +1 for right edges
99
+ bins = np.array(bin_upper_edges)
100
+
101
+ return bins
102
+
103
+
104
+ def discretize_into_bins(P, bins):
105
+ oneDbins = np.digitize(P, bins) - 1 # since bins contains extra righmost & leftmost bins
106
+
107
+ # Fix to scipy.binned_dd_statistic:
108
  # Tie-breaking to the left for rightmost bin
109
  # Using `digitize`, values that fall on an edge are put in the right bin.
110
  # For the rightmost bin, we want values equal to the right
 
114
  # Find the rounding precision
115
  dedges_min = np.diff(bins).min()
116
  if dedges_min == 0:
117
+ raise ValueError("The smallest edge difference is numerically 0.")
118
 
119
  decimal = int(-np.log10(dedges_min)) + 6
120
 
 
129
 
130
 
131
  def manual_binned_statistic(P, y_correct, bins, statistic="mean"):
132
+ bin_assignments = discretize_into_bins(np.expand_dims(P, 0), bins)[0]
 
133
  result = np.empty([len(bins)], float)
134
+ result.fill(np.nan) # cannot assume each bin will have observations
135
 
136
+ flatcount = np.bincount(bin_assignments, None)
137
  a = flatcount.nonzero()
138
 
139
+ if statistic == "mean":
140
+ flatsum = np.bincount(bin_assignments, y_correct)
141
  result[a] = flatsum[a] / flatcount[a]
142
+ return result, bins, bin_assignments + 1 # fix for what happens in discretize_into_bins
143
+
144
 
145
+ def bin_calibrated_accuracy(bins, proxy="upper-edge"):
146
+ assert proxy in ["center", "upper-edge"], f"Unsupported proxy{proxy}"
147
+
148
+ if proxy == "upper-edge":
149
+ return bins[1:]
150
+
151
+ if proxy == "center":
152
+ return bins[:-1] + np.diff(bins) / 2
153
+
154
+
155
+ def CE_estimate(y_correct, P, bins=None, p=1, proxy="upper-edge"):
156
  """
157
  y_correct: binary (N x 1)
158
  P: normalized (N x 1) either max or per class
159
 
160
+ Summary: weighted average over the accuracy/confidence difference of discrete bins of prediction probability
161
  """
162
 
163
+ n_bins = len(bins) - 1
164
+ bin_range = [min(bins), max(bins)]
 
 
 
 
 
 
 
 
 
 
 
165
 
166
+ # average bin probability #55 for bin 50-60, mean per bin; or right/upper bin edges
167
+ calibrated_acc = bin_calibrated_accuracy(bins, proxy="upper-edge")
168
 
169
  empirical_acc, bin_edges, bin_assignment = manual_binned_statistic(P, y_correct, bins)
170
  bin_numbers, weights_ece = np.unique(bin_assignment, return_counts=True)
171
  anindices = bin_numbers - 1 # reduce bin counts; left edge; indexes right BY DEFAULT
172
 
173
  # Expected calibration error
174
+ if p < np.inf: # L^p-CE
175
  CE = np.average(
176
  abs(empirical_acc[anindices] - calibrated_acc[anindices]) ** p,
177
  weights=weights_ece, # weighted average 1/binfreq
 
181
 
182
  return CE
183
 
184
+
185
+ def top_1_CE(Y, P, **kwargs):
186
+ y_correct = (Y == np.argmax(P, -1)).astype(int) # create condition y = ŷ € [K]
187
+ p_max = np.max(P, -1) # create as top-1 softmax probability € [0,1]
188
+ bins = create_bins(
189
+ n_bins=kwargs["n_bins"], bin_range=kwargs["bin_range"], scheme=kwargs["scheme"], P=p_max
190
+ )
191
+ return CE_estimate(y_correct, p_max, bins=bins, proxy=kwargs["proxy"])
192
 
193
 
194
  @evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
 
203
  4. apply L^p norm distance and weights
204
  """
205
 
206
+ # have to add to initialization here?
207
+ # create bins using the params
208
+ # create proxy
209
+
210
+ def __init__(self, n_bins=10, bin_range=None, scheme="equal-range", proxy="upper-edge", p=1):
211
+ super().__init__(self)
212
+
213
+ self.bin_range = bin_range
214
+ self.n_bins = n_bins
215
+ self.scheme = scheme
216
+ self.proxy = proxy
217
+ self.p = p
218
 
219
  def _info(self):
220
  # TODO: Specifies the evaluate.EvaluationModuleInfo object
 
225
  citation=_CITATION,
226
  inputs_description=_KWARGS_DESCRIPTION,
227
  # This defines the format of each prediction and reference
228
+ features=datasets.Features(
229
+ {
230
+ "predictions": datasets.Value("float32"),
231
+ "references": datasets.Value("int64"),
232
+ }
233
+ ),
234
  # Homepage of the module for documentation
235
+ homepage="https://huggingface.co/spaces/jordyvl/ece",
236
  # Additional links to the codebase or references
237
  codebase_urls=["http://github.com/path/to/codebase/of/new_module"],
238
+ reference_urls=["http://path.to.reference.url/new_module"],
239
  )
240
 
241
  def _download_and_prepare(self, dl_manager):
 
245
 
246
  def _compute(self, predictions, references):
247
  """Returns the scores"""
248
+
249
+ ECE = top_1_CE(references, predictions)
250
  return {
251
  "ECE": ECE,
252
  }
253
 
254
 
255
  def test_ECE():
256
+ N = 10 # N evaluation instances {(x_i,y_i)}_{i=1}^N
257
+ K = 5 # K class problem
258
+
259
+ def random_mc_instance(concentration=1, onehot=False):
260
+ reference = np.argmax(
261
+ np.random.dirichlet(([concentration for _ in range(K)])), -1
262
+ ) # class targets
263
+ prediction = np.random.dirichlet(([concentration for _ in range(K)])) # probabilities
264
+ if onehot:
265
+ reference = np.eye(K)[np.argmax(reference, -1)]
266
  return reference, prediction
267
 
268
  references, predictions = list(zip(*[random_mc_instance() for i in range(N)]))
 
271
  res = ECE()._compute(predictions, references)
272
  print(f"ECE: {res['ECE']}")
273
 
274
+
275
+ if __name__ == "__main__":
276
+ test_ECE()
277
+
278
+
279
+ # if scheme == "equal-mass":
280
+ # raise AssertionError("Need to calculate based on P") #so cannot instantiate yet