Spaces:

jordyvl
/

ece

Runtime error

App Files Files Community

ece / README.md

jordyvl

might be defunct now

0c94397 over 2 years ago

preview code

raw

history blame

3.9 kB

	---
	title: ECE
	datasets:
	-
	tags:
	- evaluate
	- metric
	description: binned estimator of expected calibration error
	sdk: gradio
	sdk_version: 3.0.2
	app_file: app.py
	pinned: false
	---

	# Metric Card for ECE

	*Module Card Instructions:* Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.

	## Metric Description
	<!---
	Give a brief overview of this metric, including what task(s) it is usually used for, if any.
	-->
	Expected Calibration Error `ECE` is a standard metric to evaluate top-1 prediction miscalibration.
	It measures the L^p norm difference between a model’s posterior and the true likelihood of being correct.
	$$ ECE_p(f)^p= \mathbb{E}_{(X,Y)} \left[\\|\mathbb{E}[Y = \hat{y} \mid f(X) = \hat{p}] - f(X)\\|^p_p\right]$$, where $\hat{y} = \argmax_{y'}[f(X)]_y'$ is a class prediction with associated posterior probability $\hat{p}= \max_{y'}[f(X)]_y'$.

	It is generally implemented as a binned estimator that discretizes predicted probabilities into a range of possible values (bins) for which conditional expectation can be estimated.

	As a metric of calibration error, it holds that the lower, the better calibrated a model is.
	For valid model comparisons, ensure to use the same keyword arguments.


	## How to Use
	<!---
	Give general statement of how to use the metric
	Provide simplest possible example for using the metric
	-->




	### Inputs
	<!---
	List all input arguments in the format below
	- input_field (type): Definition of input, with explanation if necessary. State any default value(s).
	-->

	### Output Values
	<!---
	Explain what this metric outputs and provide an example of what the metric output looks like. Modules should return a dictionary with one or multiple key-value pairs, e.g. {"bleu" : 6.02}

	State the range of possible values that the metric's output can take, as well as what in that range is considered good. For example: "This metric can take on any value between 0 and 100, inclusive. Higher scores are better."

	#### Values from Popular Papers
	Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.
	-->


	### Examples
	<!---
	Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.
	-->


	## Limitations and Bias
	<!---
	Note any known limitations or biases that the metric has, with links and references if possible.
	-->
	See [3],[4] and [5].

	## Citation
	[1] Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
	[2] Guo, C., Pleiss, G., Sun, Y. and Weinberger, K.Q., 2017, July. On calibration of modern neural networks. In International Conference on Machine Learning (pp. 1321-1330). PMLR.
	[3] Nixon, J., Dusenberry, M.W., Zhang, L., Jerfel, G. and Tran, D., 2019, June. Measuring Calibration in Deep Learning. In CVPR Workshops (Vol. 2, No. 7).
	[4] Kumar, A., Liang, P.S. and Ma, T., 2019. Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
	[5] Vaicenavicius, J., Widmann, D., Andersson, C., Lindsten, F., Roll, J. and Schön, T., 2019, April. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 3459-3467). PMLR.
	[6] Allen-Zhu, Z., Li, Y. and Liang, Y., 2019. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32.

	## Further References
	Add any useful further references.