Spaces:

evaluate-metric
/

competition_math

Running

App Files Files Community

lvwerra HF staff commited on May 20, 2022

Commit

e347d8a

1 Parent(s): f9591cd

Update Space (evaluate main: 828c6327)

Browse files

Files changed (4) hide show

README.md +110 -5
app.py +6 -0
competition_math.py +95 -0
requirements.txt +4 -0

README.md CHANGED Viewed

@@ -1,12 +1,117 @@
 ---
-title: Competition_math
-emoji: 👀
-colorFrom: red
-colorTo: pink
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
+title: Competition MATH
+emoji: 🤗
+colorFrom: blue
+colorTo: red
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
 ---
+# Metric Card for Competition MATH
+## Metric description
+This metric is used to assess performance on the [Mathematics Aptitude Test of Heuristics (MATH) dataset](https://huggingface.co/datasets/competition_math).
+It first canonicalizes the inputs (e.g., converting `1/2` to `\\frac{1}{2}`) and then computes accuracy.
+## How to use
+This metric takes two arguments:
+`predictions`: a list of predictions to score. Each prediction is a string that contains natural language and LaTeX.
+`references`: list of reference for each prediction. Each reference is a string that contains natural language and LaTeX.
+```python
+>>> from evaluate import load
+>>> math = load("competition_math")
+>>> references = ["\\frac{1}{2}"]
+>>> predictions = ["1/2"]
+>>> results = math.compute(references=references, predictions=predictions)
+```
+N.B. To be able to use Competition MATH, you need to install the `math_equivalence` dependency using `pip install git+https://github.com/hendrycks/math.git`.
+## Output values
+This metric returns a dictionary that contains the [accuracy](https://huggingface.co/metrics/accuracy) after canonicalizing inputs, on a scale between 0.0 and 1.0.
+### Values from popular papers
+The [original MATH dataset paper](https://arxiv.org/abs/2103.03874) reported accuracies ranging from 3.0% to 6.9% by different large language models.
+More recent progress on the dataset can be found on the [dataset leaderboard](https://paperswithcode.com/sota/math-word-problem-solving-on-math).
+## Examples
+Maximal values (full match):
+```python
+>>> from evaluate import load
+>>> math = load("competition_math")
+>>> references = ["\\frac{1}{2}"]
+>>> predictions = ["1/2"]
+>>> results = math.compute(references=references, predictions=predictions)
+>>> print(results)
+{'accuracy': 1.0}
+```
+Minimal values (no match):
+```python
+>>> from evaluate import load
+>>> math = load("competition_math")
+>>> references = ["\\frac{1}{2}"]
+>>> predictions = ["3/4"]
+>>> results = math.compute(references=references, predictions=predictions)
+>>> print(results)
+{'accuracy': 0.0}
+```
+Partial match:
+```python
+>>> from evaluate import load
+>>> math = load("competition_math")
+>>> references = ["\\frac{1}{2}","\\frac{3}{4}"]
+>>> predictions = ["1/5", "3/4"]
+>>> results = math.compute(references=references, predictions=predictions)
+>>> print(results)
+{'accuracy': 0.5}
+```
+## Limitations and bias
+This metric is limited to datasets with the same format as the [Mathematics Aptitude Test of Heuristics (MATH) dataset](https://huggingface.co/datasets/competition_math), and is meant to evaluate the performance of large language models at solving mathematical problems.
+N.B. The MATH dataset also assigns levels of difficulty to different problems, so disagregating model performance by difficulty level (similarly to what was done in the [original paper](https://arxiv.org/abs/2103.03874) can give a better indication of how a given model does on a given difficulty of math problem, compared to overall accuracy.
+## Citation
+```bibtex
+@article{hendrycksmath2021,
+  title={Measuring Mathematical Problem Solving With the MATH Dataset},
+  author={Dan Hendrycks
+    and Collin Burns
+    and Saurav Kadavath
+    and Akul Arora
+    and Steven Basart
+    and Eric Tang
+    and Dawn Song
+    and Jacob Steinhardt},
+  journal={arXiv preprint arXiv:2103.03874},
+  year={2021}
+}
+```
+## Further References
+- [MATH dataset](https://huggingface.co/datasets/competition_math)
+- [MATH leaderboard](https://paperswithcode.com/sota/math-word-problem-solving-on-math)
+- [MATH paper](https://arxiv.org/abs/2103.03874)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("competition_math")
+launch_gradio_widget(module)

competition_math.py ADDED Viewed

	@@ -0,0 +1,95 @@

+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Accuracy metric for the Mathematics Aptitude Test of Heuristics (MATH) dataset."""
+import datasets
+import math_equivalence  # From: git+https://github.com/hendrycks/math.git
+import evaluate
+_CITATION = """\
+@article{hendrycksmath2021,
+  title={Measuring Mathematical Problem Solving With the MATH Dataset},
+  author={Dan Hendrycks
+    and Collin Burns
+    and Saurav Kadavath
+    and Akul Arora
+    and Steven Basart
+    and Eric Tang
+    and Dawn Song
+    and Jacob Steinhardt},
+  journal={arXiv preprint arXiv:2103.03874},
+  year={2021}
+}
+"""
+_DESCRIPTION = """\
+This metric is used to assess performance on the Mathematics Aptitude Test of Heuristics (MATH) dataset.
+It first canonicalizes the inputs (e.g., converting "1/2" to "\\frac{1}{2}") and then computes accuracy.
+"""
+_KWARGS_DESCRIPTION = r"""
+Calculates accuracy after canonicalizing inputs.
+Args:
+    predictions: list of predictions to score. Each prediction
+        is a string that contains natural language and LaTex.
+    references: list of reference for each prediction. Each
+        reference is a string that contains natural language
+        and LaTex.
+Returns:
+    accuracy: accuracy after canonicalizing inputs
+        (e.g., converting "1/2" to "\\frac{1}{2}")
+Examples:
+    >>> metric = evaluate.load("competition_math")
+    >>> results = metric.compute(references=["\\frac{1}{2}"], predictions=["1/2"])
+    >>> print(results)
+    {'accuracy': 1.0}
+"""
+@datasets.utils.file_utils.add_end_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class CompetitionMathMetric(evaluate.EvaluationModule):
+    """Accuracy metric for the MATH dataset."""
+    def _info(self):
+        return evaluate.EvaluationModuleInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("string"),
+                    "references": datasets.Value("string"),
+                }
+            ),
+            # Homepage of the metric for documentation
+            homepage="https://github.com/hendrycks/math",
+            # Additional links to the codebase or references
+            codebase_urls=["https://github.com/hendrycks/math"],
+        )
+    def _compute(self, predictions, references):
+        """Returns the scores"""
+        n_correct = 0.0
+        for i, j in zip(predictions, references):
+            n_correct += 1.0 if math_equivalence.is_equiv(i, j) else 0.0
+        accuracy = n_correct / len(predictions)
+        return {
+            "accuracy": accuracy,
+        }

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+# TODO: fix github to release
+git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
+datasets~=2.0
+git+https://github.com/hendrycks/math.git