Spaces:

evaluate-metric
/

google_bleu

Running

App Files Files Community

lvwerra HF Staff commited on May 20, 2022

Commit

d807f7c

1 Parent(s): 617842a

Update Space (evaluate main: 828c6327)

Browse files

Files changed (5) hide show

README.md +129 -4
app.py +6 -0
google_bleu.py +156 -0
requirements.txt +4 -0
tokenizer_13a.py +100 -0

README.md CHANGED Viewed

@@ -1,12 +1,137 @@
 ---
-title: Google_bleu
-emoji: 📈
-colorFrom: red
 colorTo: red
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference

 ---
+title: Google BLEU
+emoji: 🤗
+colorFrom: blue
 colorTo: red
 sdk: gradio
 sdk_version: 3.0.2
 app_file: app.py
 pinned: false
+tags:
+- evaluate
+- metric
 ---
+# Metric Card for Google BLEU
+## Metric Description
+The BLEU score has some undesirable properties when used for single sentences, as it was designed to be a corpus measure. The Google BLEU score is designed to limit these undesirable properties when used for single sentences.
+To calculate this score, all sub-sequences of 1, 2, 3 or 4 tokens in output and target sequence (n-grams) are recorded. The precision and recall, described below, are then computed.
+- **precision:** the ratio of the number of matching n-grams to the number of total n-grams in the generated output sequence
+- **recall:** the ratio of the number of matching n-grams to the number of total n-grams in the target (ground truth) sequence
+The minimum value of precision and recall is then returned as the score.
+## Intended Uses
+This metric is generally used to evaluate machine translation models. It is especially used when scores of individual (prediction, reference) sentence pairs are needed, as opposed to when averaging over the (prediction, reference) scores for a whole corpus. That being said, it can also be used when averaging over the scores for a whole corpus.
+Because it performs better on individual sentence pairs as compared to BLEU, Google BLEU has also been used in RL experiments.
+## How to Use
+This metric takes a list of predicted sentences, as well as a list of references.
+```python
+sentence1 = "the cat sat on the mat"
+sentence2 = "the cat ate the mat"
+google_bleu = evaluate.load("google_bleu")
+result = google_bleu.compute(predictions=[sentence1], references=[[sentence2]])
+print(result)
+>>> {'google_bleu': 0.3333333333333333}
+```
+### Inputs
+- **predictions** (list of str): list of translations to score.
+- **references** (list of list of str): list of lists of references for each translation.
+- **tokenizer** : approach used for tokenizing `predictions` and `references`.
+The default tokenizer is `tokenizer_13a`, a minimal tokenization approach that is equivalent to `mteval-v13a`, used by WMT. This can be replaced by any function that takes a string as input and returns a list of tokens as output.
+- **min_len** (int): The minimum order of n-gram this function should extract. Defaults to 1.
+- **max_len** (int): The maximum order of n-gram this function should extract. Defaults to 4.
+### Output Values
+This metric returns the following in a dict:
+- **google_bleu** (float): google_bleu score
+The output format is as follows:
+```python
+{'google_bleu': google_bleu score}
+```
+This metric can take on values from 0 to 1, inclusive. Higher scores are better, with 0 indicating no matches, and 1 indicating a perfect match.
+Note that this score is symmetrical when switching output and target. This means that, given two sentences, `sentence1` and `sentence2`, whatever score is output when `sentence1` is the predicted sentence and  `sencence2` is the reference sentence will be the same as when the sentences are swapped and `sentence2` is the predicted sentence while `sentence1` is the reference sentence. In code, this looks like:
+```python
+predictions = "the cat sat on the mat"
+references = "the cat ate the mat"
+google_bleu = evaluate.load("google_bleu")
+result_a = google_bleu.compute(predictions=[predictions], references=[[references]])
+result_b = google_bleu.compute(predictions=[predictions], references=[[references]])
+print(result_a == result_b)
+>>> True
+```
+#### Values from Popular Papers
+### Examples
+Example with one reference per sample:
+```python
+>>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
+>>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat'], ['he was interested in world history because he read the book']]
+>>> google_bleu = evaluate.load("google_bleu")
+>>> results = google_bleu.compute(predictions=predictions, references=references)
+>>> print(round(results["google_bleu"], 2))
+0.44
+```
+Example with multiple references for the first sample:
+```python
+>>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
+>>> references  = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], ['he was interested in world history because he read the book']]
+>>> google_bleu = evaluate.load("google_bleu")
+>>> results = google_bleu.compute(predictions=predictions, references=references)
+>>> print(round(results["google_bleu"], 2))
+0.61
+```
+Example with multiple references for the first sample, and with `min_len` adjusted to `2`, instead of the default `1`, which means that the function extracts n-grams of length `2`:
+```python
+>>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
+>>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], ['he was interested in world history because he read the book']]
+>>> google_bleu = evaluate.load("google_bleu")
+>>> results = google_bleu.compute(predictions=predictions, references=references, min_len=2)
+>>> print(round(results["google_bleu"], 2))
+0.53
+```
+Example with multiple references for the first sample, with `min_len` adjusted to `2`, instead of the default `1`, and `max_len` adjusted to `6` instead of the default `4`:
+```python
+>>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', 'he read the book because he was interested in world history']
+>>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', 'It is a guide to action that ensures that the rubber duck will never heed the cat commands', 'It is the practical guide for the rubber duck army never to heed the directions of the cat'], ['he was interested in world history because he read the book']]
+>>> google_bleu = evaluate.load("google_bleu")
+>>> results = google_bleu.compute(predictions=predictions,references=references, min_len=2, max_len=6)
+>>> print(round(results["google_bleu"], 2))
+0.4
+```
+## Limitations and Bias
+The GoogleBLEU metric does not come with a predefined tokenization function; previous versions simply used `split()` to split the input strings into tokens. Using a tokenizer such as the default one, `tokenizer_13a`, makes results more standardized and reproducible. The BLEU and sacreBLEU metrics also use this default tokenizer.
+## Citation
+```bibtex
+@misc{wu2016googles,
+title={Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation},
+author={Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto Kazawa and Keith Stevens and George Kurian and Nishant Patil and Wei Wang and Cliff Young and Jason Smith and Jason Riesa and Alex Rudnick and Oriol Vinyals and Greg Corrado and Macduff Hughes and Jeffrey Dean},
+year={2016},
+eprint={1609.08144},
+archivePrefix={arXiv},
+primaryClass={cs.CL}
+}
+```
+## Further References
+- This Hugging Face implementation uses the [nltk.translate.gleu_score implementation](https://www.nltk.org/_modules/nltk/translate/gleu_score.html)

app.py ADDED Viewed

	@@ -0,0 +1,6 @@

+import evaluate
+from evaluate.utils import launch_gradio_widget
+module = evaluate.load("google_bleu")
+launch_gradio_widget(module)

google_bleu.py ADDED Viewed

	@@ -0,0 +1,156 @@

+# Copyright 2020 The HuggingFace Evaluate Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" Google BLEU (aka GLEU) metric. """
+from typing import Dict, List
+import datasets
+from nltk.translate import gleu_score
+import evaluate
+from evaluate import EvaluationModuleInfo
+from .tokenizer_13a import Tokenizer13a
+_CITATION = """\
+@misc{wu2016googles,
+      title={Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation},
+      author={Yonghui Wu and Mike Schuster and Zhifeng Chen and Quoc V. Le and Mohammad Norouzi and Wolfgang Macherey
+              and Maxim Krikun and Yuan Cao and Qin Gao and Klaus Macherey and Jeff Klingner and Apurva Shah and Melvin
+              Johnson and Xiaobing Liu and Łukasz Kaiser and Stephan Gouws and Yoshikiyo Kato and Taku Kudo and Hideto
+              Kazawa and Keith Stevens and George Kurian and Nishant Patil and Wei Wang and Cliff Young and
+              Jason Smith and Jason Riesa and Alex Rudnick and Oriol Vinyals and Greg Corrado and Macduff Hughes
+              and Jeffrey Dean},
+      year={2016},
+      eprint={1609.08144},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+"""
+_DESCRIPTION = """\
+The BLEU score has some undesirable properties when used for single
+sentences, as it was designed to be a corpus measure. We therefore
+use a slightly different score for our RL experiments which we call
+the 'GLEU score'. For the GLEU score, we record all sub-sequences of
+1, 2, 3 or 4 tokens in output and target sequence (n-grams). We then
+compute a recall, which is the ratio of the number of matching n-grams
+to the number of total n-grams in the target (ground truth) sequence,
+and a precision, which is the ratio of the number of matching n-grams
+to the number of total n-grams in the generated output sequence. Then
+GLEU score is simply the minimum of recall and precision. This GLEU
+score's range is always between 0 (no matches) and 1 (all match) and
+it is symmetrical when switching output and target. According to
+our experiments, GLEU score correlates quite well with the BLEU
+metric on a corpus level but does not have its drawbacks for our per
+sentence reward objective.
+"""
+_KWARGS_DESCRIPTION = """\
+Computes corpus-level Google BLEU (GLEU) score of translated segments against one or more references.
+Instead of averaging the sentence level GLEU scores (i.e. macro-average precision), Wu et al. (2016) sum up the matching
+tokens and the max of hypothesis and reference tokens for each sentence, then compute using the aggregate values.
+Args:
+    predictions (list of str): list of translations to score.
+    references (list of list of str): list of lists of references for each translation.
+    tokenizer : approach used for tokenizing `predictions` and `references`.
+        The default tokenizer is `tokenizer_13a`, a minimal tokenization approach that is equivalent to `mteval-v13a`, used by WMT.
+        This can be replaced by any function that takes a string as input and returns a list of tokens as output.
+    min_len (int): The minimum order of n-gram this function should extract. Defaults to 1.
+    max_len (int): The maximum order of n-gram this function should extract. Defaults to 4.
+Returns:
+    'google_bleu': google_bleu score
+Examples:
+    Example 1:
+        >>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', \
+        'he read the book because he was interested in world history']
+        >>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat'], \
+        ['he was interested in world history because he read the book']]
+        >>> google_bleu = evaluate.load("google_bleu")
+        >>> results = google_bleu.compute(predictions=predictions, references=references)
+        >>> print(round(results["google_bleu"], 2))
+        0.44
+    Example 2:
+        >>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', \
+        'he read the book because he was interested in world history']
+        >>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', \
+        'It is a guide to action that ensures that the rubber duck will never heed the cat commands', \
+        'It is the practical guide for the rubber duck army never to heed the directions of the cat'], \
+        ['he was interested in world history because he read the book']]
+        >>> google_bleu = evaluate.load("google_bleu")
+        >>> results = google_bleu.compute(predictions=predictions, references=references)
+        >>> print(round(results["google_bleu"], 2))
+        0.61
+    Example 3:
+        >>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', \
+        'he read the book because he was interested in world history']
+        >>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', \
+        'It is a guide to action that ensures that the rubber duck will never heed the cat commands', \
+        'It is the practical guide for the rubber duck army never to heed the directions of the cat'], \
+        ['he was interested in world history because he read the book']]
+        >>> google_bleu = evaluate.load("google_bleu")
+        >>> results = google_bleu.compute(predictions=predictions, references=references, min_len=2)
+        >>> print(round(results["google_bleu"], 2))
+        0.53
+    Example 4:
+        >>> predictions = ['It is a guide to action which ensures that the rubber duck always disobeys the commands of the cat', \
+        'he read the book because he was interested in world history']
+        >>> references = [['It is the guiding principle which guarantees the rubber duck forces never being under the command of the cat', \
+        'It is a guide to action that ensures that the rubber duck will never heed the cat commands', \
+        'It is the practical guide for the rubber duck army never to heed the directions of the cat'], \
+        ['he was interested in world history because he read the book']]
+        >>> google_bleu = evaluate.load("google_bleu")
+        >>> results = google_bleu.compute(predictions=predictions,references=references, min_len=2, max_len=6)
+        >>> print(round(results["google_bleu"], 2))
+        0.4
+"""
+@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
+class GoogleBleu(evaluate.EvaluationModule):
+    def _info(self) -> EvaluationModuleInfo:
+        return evaluate.EvaluationModuleInfo(
+            description=_DESCRIPTION,
+            citation=_CITATION,
+            inputs_description=_KWARGS_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "predictions": datasets.Value("string", id="sequence"),
+                    "references": datasets.Sequence(datasets.Value("string", id="sequence"), id="references"),
+                }
+            ),
+        )
+    def _compute(
+        self,
+        predictions: List[str],
+        references: List[List[str]],
+        tokenizer=Tokenizer13a(),
+        min_len: int = 1,
+        max_len: int = 4,
+    ) -> Dict[str, float]:
+        references = [[tokenizer(r) for r in ref] for ref in references]
+        predictions = [tokenizer(p) for p in predictions]
+        return {
+            "google_bleu": gleu_score.corpus_gleu(
+                list_of_references=references, hypotheses=predictions, min_len=min_len, max_len=max_len
+            )
+        }

requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+# TODO: fix github to release
+git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
+datasets~=2.0
+nltk

tokenizer_13a.py ADDED Viewed

	@@ -0,0 +1,100 @@

+# Source: https://github.com/mjpost/sacrebleu/blob/master/sacrebleu/tokenizers/tokenizer_13a.py
+# Copyright 2020 SacreBLEU Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import re
+from functools import lru_cache
+class BaseTokenizer:
+    """A base dummy tokenizer to derive from."""
+    def signature(self):
+        """
+        Returns a signature for the tokenizer.
+        :return: signature string
+        """
+        return "none"
+    def __call__(self, line):
+        """
+        Tokenizes an input line with the tokenizer.
+        :param line: a segment to tokenize
+        :return: the tokenized line
+        """
+        return line
+class TokenizerRegexp(BaseTokenizer):
+    def signature(self):
+        return "re"
+    def __init__(self):
+        self._re = [
+            # language-dependent part (assuming Western languages)
+            (re.compile(r"([\{-\~\[-\` -\&\(-\+\:-\@\/])"), r" \1 "),
+            # tokenize period and comma unless preceded by a digit
+            (re.compile(r"([^0-9])([\.,])"), r"\1 \2 "),
+            # tokenize period and comma unless followed by a digit
+            (re.compile(r"([\.,])([^0-9])"), r" \1 \2"),
+            # tokenize dash when preceded by a digit
+            (re.compile(r"([0-9])(-)"), r"\1 \2 "),
+            # one space only between words
+            # NOTE: Doing this in Python (below) is faster
+            # (re.compile(r'\s+'), r' '),
+        ]
+    @lru_cache(maxsize=2**16)
+    def __call__(self, line):
+        """Common post-processing tokenizer for `13a` and `zh` tokenizers.
+        :param line: a segment to tokenize
+        :return: the tokenized line
+        """
+        for (_re, repl) in self._re:
+            line = _re.sub(repl, line)
+        # no leading or trailing spaces, single space within words
+        # return ' '.join(line.split())
+        # This line is changed with regards to the original tokenizer (seen above) to return individual words
+        return line.split()
+class Tokenizer13a(BaseTokenizer):
+    def signature(self):
+        return "13a"
+    def __init__(self):
+        self._post_tokenizer = TokenizerRegexp()
+    @lru_cache(maxsize=2**16)
+    def __call__(self, line):
+        """Tokenizes an input line using a relatively minimal tokenization
+        that is however equivalent to mteval-v13a, used by WMT.
+        :param line: a segment to tokenize
+        :return: the tokenized line
+        """
+        # language-independent part:
+        line = line.replace("<skipped>", "")
+        line = line.replace("-\n", "")
+        line = line.replace("\n", " ")
+        if "&" in line:
+            line = line.replace("&quot;", '"')
+            line = line.replace("&amp;", "&")
+            line = line.replace("&lt;", "<")
+            line = line.replace("&gt;", ">")
+        return self._post_tokenizer(f" {line} ")