Spaces:

evaluate-metric
/

nist_mt

Running

App Files Files Community

lvwerra HF staff commited on Dec 6, 2022

Commit

99711f3

•

1 Parent(s): b7ae728

Update Space (evaluate main: 2253a6e1)

Browse files

Files changed (5) hide show

README.md +75 -5
app.py +6 -0
nist_mt.py +132 -0
requirements.txt +2 -0
tests.py +34 -0

README.md CHANGED Viewed

@@ -1,12 +1,82 @@
 ---
-title: Nist Mt
-emoji: ⚡
-colorFrom: pink
 colorTo: red
 sdk: gradio
-sdk_version: 3.12.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: NIST_MT
+emoji: 🤗
+colorFrom: purple
 colorTo: red
 sdk: gradio
+sdk_version: 3.0.2
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
+- machine-translation
+description:
+  DARPA commissioned NIST to develop an MT evaluation facility based on the BLEU score.
 ---
+# Metric Card for NIST's MT metric
+## Metric Description
+DARPA commissioned NIST to develop an MT evaluation facility based on the BLEU
+score. The official script used by NIST to compute BLEU and NIST score is
+mteval-14.pl. The main differences are:
+ - BLEU uses geometric mean of the ngram overlaps, NIST uses arithmetic mean.
+ - NIST has a different brevity penalty
+ - NIST score from mteval-14.pl has a self-contained tokenizer (in the Hugging Face implementation we rely on NLTK's
+implementation of the NIST-specific tokenizer)
+## Intended Uses
+NIST was developed for machine translation evaluation.
+## How to Use
+```python
+import evaluate
+nist_mt = evaluate.load("nist_mt")
+hypothesis1 = "It is a guide to action which ensures that the military always obeys the commands of the party"
+reference1 = "It is a guide to action that ensures that the military will forever heed Party commands"
+reference2 = "It is the guiding principle which guarantees the military forces always being under the command of the Party"
+nist_mt.compute(hypothesis1, [reference1, reference2])
+# {'nist_mt': 3.3709935957649324}
+```
+### Inputs
+- **predictions**: tokenized predictions to score. For sentence-level NIST, a list of tokens (str);
+     for corpus-level NIST, a list (sentences) of lists of tokens (str)
+- **references**:  potentially multiple tokenized references for each prediction.  For sentence-level NIST, a
+     list (multiple potential references) of list of tokens (str); for corpus-level NIST, a list (corpus) of lists
+     (multiple potential references) of lists of tokens (str)
+- **n**: highest n-gram order
+- **tokenize_kwargs**: arguments passed to the tokenizer (see: https://github.com/nltk/nltk/blob/90fa546ea600194f2799ee51eaf1b729c128711e/nltk/tokenize/nist.py#L139)
+### Output Values
+- **nist_mt** (`float`): NIST score
+Output Example:
+```python
+{'nist_mt': 3.3709935957649324}
+```
+## Citation
+```bibtex
+@inproceedings{10.5555/1289189.1289273,
+    author = {Doddington, George},
+    title = {Automatic Evaluation of Machine Translation Quality Using N-Gram Co-Occurrence Statistics},
+    year = {2002},
+    publisher = {Morgan Kaufmann Publishers Inc.},
+    address = {San Francisco, CA, USA},
+    booktitle = {Proceedings of the Second International Conference on Human Language Technology Research},
+    pages = {138–145},
+    numpages = {8},
+    location = {San Diego, California},
+    series = {HLT '02}
+}
+```
+## Further References
+This Hugging Face implementation uses [the NLTK implementation](https://github.com/nltk/nltk/blob/develop/nltk/translate/nist_score.py)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("nist_mt")
+launch_gradio_widget(module)

nist_mt.py ADDED Viewed

	@@ -0,0 +1,132 @@

+# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""NLTK's NIST implementation on both the sentence and corpus level"""
+from typing import Dict, Optional
+import datasets
+import nltk
+from datasets import Sequence, Value
+try:
+    nltk.data.find("perluniprops")
+except LookupError:
+    nltk.download("perluniprops", quiet=True)  # NISTTokenizer requirement
+from nltk.tokenize.nist import NISTTokenizer
+from nltk.translate.nist_score import corpus_nist, sentence_nist
+import evaluate
+_CITATION = """\
+@inproceedings{10.5555/1289189.1289273,
+    author = {Doddington, George},
+    title = {Automatic Evaluation of Machine Translation Quality Using N-Gram Co-Occurrence Statistics},
+    year = {2002},
+    publisher = {Morgan Kaufmann Publishers Inc.},
+    address = {San Francisco, CA, USA},
+    booktitle = {Proceedings of the Second International Conference on Human Language Technology Research},
+    pages = {138–145},
+    numpages = {8},
+    location = {San Diego, California},
+    series = {HLT '02}
+}
+"""
+_DESCRIPTION = """\
+DARPA commissioned NIST to develop an MT evaluation facility based on the BLEU
+score. The official script used by NIST to compute BLEU and NIST score is
+mteval-14.pl. The main differences are:
+ - BLEU uses geometric mean of the ngram precisions, NIST uses arithmetic mean.
+ - NIST has a different brevity penalty
+ - NIST score from mteval-14.pl has a self-contained tokenizer (in the Hugging Face implementation we rely on NLTK's
+implementation of the NIST-specific tokenizer)
+"""
+_KWARGS_DESCRIPTION = """
+Computes NIST score of translated segments against one or more references.
+Args:
+    predictions: predictions to score (list of str)
+    references: potentially multiple references for each prediction (list of str or list of list of str)
+    n: highest n-gram order
+    lowercase: whether to lowercase the data (only applicable if 'western_lang' is True)
+    western_lang: whether the current language is a Western language, which will enable some specific tokenization
+ rules with respect to, e.g., punctuation
+Returns:
+    'nist_mt': nist_mt score
+Examples:
+    >>> nist_mt = evaluate.load("nist_mt")
+    >>> hypothesis = "It is a guide to action which ensures that the military always obeys the commands of the party"
+    >>> reference1 = "It is a guide to action that ensures that the military will forever heed Party commands"
+    >>> reference2 = "It is the guiding principle which guarantees the military forces always being under the command of the Party"
+    >>> reference3 = "It is the practical guide for the army always to heed the directions of the party"
+    >>> nist_mt.compute(predictions=[hypothesis], references=[[reference1, reference2, reference3]])
+    {'nist_mt': 3.3709935957649324}
+    >>> nist_mt.compute(predictions=[hypothesis], references=[reference1])
+    {'nist_mt': 2.4477124183006533}
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class NistMt(evaluate.Metric):
+    """A wrapper around NLTK's NIST implementation."""
+    def _info(self):
+        return evaluate.MetricInfo(
+            module_type="metric",
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=[
+                datasets.Features(
+                    {
+                        "predictions": Value("string", id="prediction"),
+                        "references": Sequence(Value("string", id="reference"), id="references"),
+                    }
+                ),
+                datasets.Features(
+                    {
+                        "predictions": Value("string", id="prediction"),
+                        "references": Value("string", id="reference"),
+                    }
+                ),
+            ],
+            homepage="https://www.nltk.org/api/nltk.translate.nist_score.html",
+            codebase_urls=["https://github.com/nltk/nltk/blob/develop/nltk/translate/nist_score.py"],
+            reference_urls=["https://en.wikipedia.org/wiki/NIST_(metric)"],
+        )
+    def _compute(self, predictions, references, n: int = 5, lowercase=False, western_lang=True):
+        tokenizer = NISTTokenizer()
+        # Account for single reference cases: references always need to have one more dimension than predictions
+        if isinstance(references[0], str):
+            references = [[ref] for ref in references]
+        predictions = [
+            tokenizer.tokenize(pred, return_str=False, lowercase=lowercase, western_lang=western_lang)
+            for pred in predictions
+        ]
+        references = [
+            [
+                tokenizer.tokenize(ref, return_str=False, lowercase=lowercase, western_lang=western_lang)
+                for ref in ref_sentences
+            ]
+            for ref_sentences in references
+        ]
+        return {"nist_mt": corpus_nist(list_of_references=references, hypotheses=predictions, n=n)}

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ git+https://github.com/huggingface/evaluate@2253a6e12a4b4c2c05ef77b84ea6c0f1188ac926
2	+ nltk

tests.py ADDED Viewed

	@@ -0,0 +1,34 @@

+from _pytest.fixtures import fixture
+from nist_mt import Nist_mt
+nist = Nist_mt()
+@fixture
+def hypothesis_sent():
+    return "It is a guide to action which ensures that the military always obeys the commands of the party"
+@fixture
+def reference_sent1():
+    return "It is a guide to action that ensures that the military will forever heed Party commands"
+@fixture
+def reference_sent2():
+    return (
+        "It is the guiding principle which guarantees the military forces always being under the command of the Party"
+    )
+@fixture
+def reference_sent3():
+    return "It is the practical guide for the army always to heed the directions of the party"
+def test_nist_sentence(hypothesis_sent, reference_sent1, reference_sent2, reference_sent3):
+    nist_score = nist.compute(
+        predictions=[hypothesis_sent], references=[[reference_sent1, reference_sent2, reference_sent3]]
+    )
+    assert abs(nist_score["nist_mt"] - 3.3709935957649324) < 1e-6