Spaces:
Runtime error
Runtime error
allow references to be simple list
Browse files- README.md +35 -12
- dataflow_match.py +2 -2
- my_codebleu.py +1 -1
README.md
CHANGED
@@ -12,25 +12,42 @@ pinned: false
|
|
12 |
|
13 |
# Metric Card for CodeBLEU
|
14 |
|
15 |
-
***Module Card Instructions:*** *Fill out the following subsections. Feel free to take a look at existing metric cards if you'd like examples.*
|
16 |
-
|
17 |
## Metric Description
|
18 |
-
|
|
|
|
|
|
|
|
|
19 |
|
20 |
## How to Use
|
21 |
-
*Give general statement of how to use the metric*
|
22 |
|
23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
### Inputs
|
26 |
-
|
27 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
### Output Values
|
30 |
|
31 |
-
|
32 |
-
|
33 |
-
|
|
|
|
|
34 |
|
35 |
#### Values from Popular Papers
|
36 |
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
|
@@ -39,10 +56,16 @@ pinned: false
|
|
39 |
*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
|
40 |
|
41 |
## Limitations and Bias
|
42 |
-
|
43 |
|
44 |
## Citation
|
45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
## Further References
|
48 |
*Add any useful further references.*
|
|
|
12 |
|
13 |
# Metric Card for CodeBLEU
|
14 |
|
|
|
|
|
15 |
## Metric Description
|
16 |
+
|
17 |
+
CodeBLEU from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/code-to-code-trans/evaluator)
|
18 |
+
and from article [CodeBLEU: a Method for Automatic Evaluation of Code Synthesis](https://arxiv.org/abs/2009.10297)
|
19 |
+
|
20 |
+
NOTE: currently works on Linux machines only due to dependency on languages.so.
|
21 |
|
22 |
## How to Use
|
|
|
23 |
|
24 |
+
```python
|
25 |
+
src = 'class AcidicSwampOoze(MinionCard):§ def __init__(self):§ super().__init__("Acidic Swamp Ooze", 2, CHARACTER_CLASS.ALL, CARD_RARITY.COMMON, battlecry=Battlecry(Destroy(), WeaponSelector(EnemyPlayer())))§§ def create_minion(self, player):§ return Minion(3, 2)§'
|
26 |
+
tgt = 'class AcidSwampOoze(MinionCard):§ def __init__(self):§ super().__init__("Acidic Swamp Ooze", 2, CHARACTER_CLASS.ALL, CARD_RARITY.COMMON, battlecry=Battlecry(Destroy(), WeaponSelector(EnemyPlayer())))§§ def create_minion(self, player):§ return Minion(3, 2)§'
|
27 |
+
src = src.replace("Β§","\n")
|
28 |
+
tgt = tgt.replace("Β§","\n")
|
29 |
+
res = module.compute(predictions = [tgt], references = [[src]])
|
30 |
+
print(res)
|
31 |
+
#{'CodeBLEU': 0.9473264567644872, 'ngram_match_score': 0.8915993127600096, 'weighted_ngram_match_score': 0.8977065142979394, 'syntax_match_score': 1.0, 'dataflow_match_score': 1.0}
|
32 |
+
```
|
33 |
|
34 |
### Inputs
|
35 |
+
- **predictions** (`list` of `str`s): Translations to score.
|
36 |
+
- **references** (`list` of `list`s of `str`s): references for each translation.
|
37 |
+
- **lang** programming language in ['java','js','c_sharp','php','go','python','ruby']
|
38 |
+
- **tokenizer**: approach used for standardizing `predictions` and `references`.
|
39 |
+
The default tokenizer is `tokenizer_13a`, a relatively minimal tokenization approach that is however equivalent to `mteval-v13a`, used by WMT.
|
40 |
+
This can be replaced by another tokenizer from a source such as [SacreBLEU](https://github.com/mjpost/sacrebleu/tree/master/sacrebleu/tokenizers).
|
41 |
+
- **params**: str, weights for averaging(see CodeBLEU paper).
|
42 |
+
Defaults to equal weights "0.25,0.25,0.25,0.25".
|
43 |
|
44 |
### Output Values
|
45 |
|
46 |
+
- CodeBLEU: resulting score,
|
47 |
+
- ngram_match_score: See paper CodeBLEU,
|
48 |
+
- weighted_ngram_match_score: See paper CodeBLEU,
|
49 |
+
- syntax_match_score: See paper CodeBLEU,
|
50 |
+
- dataflow_match_score: See paper CodeBLEU,
|
51 |
|
52 |
#### Values from Popular Papers
|
53 |
*Give examples, preferrably with links to leaderboards or publications, to papers that have reported this metric, along with the values they have reported.*
|
|
|
56 |
*Give code examples of the metric being used. Try to include examples that clear up any potential ambiguity left from the metric description above. If possible, provide a range of examples that show both typical and atypical results, as well as examples where a variety of input parameters are passed.*
|
57 |
|
58 |
## Limitations and Bias
|
59 |
+
Linux OS only. See above a set of programming languages supported.
|
60 |
|
61 |
## Citation
|
62 |
+
```bibtex
|
63 |
+
@InProceedings{huggingface:module,
|
64 |
+
title = {CodeBLEU: A Metric for Evaluating Code Generation},
|
65 |
+
authors={Sedykh, Ivan},
|
66 |
+
year={2022}
|
67 |
+
}
|
68 |
+
```
|
69 |
|
70 |
## Further References
|
71 |
*Add any useful further references.*
|
dataflow_match.py
CHANGED
@@ -36,11 +36,11 @@ def corpus_dataflow_match(references, candidates, lang, langso_dir):
|
|
36 |
candidate = candidates[i]
|
37 |
for reference in references_sample:
|
38 |
try:
|
39 |
-
candidate=remove_comments_and_docstrings(candidate,
|
40 |
except:
|
41 |
pass
|
42 |
try:
|
43 |
-
reference=remove_comments_and_docstrings(reference,
|
44 |
except:
|
45 |
pass
|
46 |
|
|
|
36 |
candidate = candidates[i]
|
37 |
for reference in references_sample:
|
38 |
try:
|
39 |
+
candidate=remove_comments_and_docstrings(candidate,lang)
|
40 |
except:
|
41 |
pass
|
42 |
try:
|
43 |
+
reference=remove_comments_and_docstrings(reference,lang)
|
44 |
except:
|
45 |
pass
|
46 |
|
my_codebleu.py
CHANGED
@@ -24,7 +24,7 @@ def calc_codebleu(predictions, references, lang, tokenizer=None, params='0.25,0.
|
|
24 |
alpha, beta, gamma, theta = [float(x) for x in params.split(',')]
|
25 |
|
26 |
# preprocess inputs
|
27 |
-
references = [[x.strip() for x in ref] for ref in references]
|
28 |
hypothesis = [x.strip() for x in predictions]
|
29 |
|
30 |
if not len(references) == len(hypothesis):
|
|
|
24 |
alpha, beta, gamma, theta = [float(x) for x in params.split(',')]
|
25 |
|
26 |
# preprocess inputs
|
27 |
+
references = [[x.strip() for x in ref] if type(ref) == list else [ref.strip()] for ref in references]
|
28 |
hypothesis = [x.strip() for x in predictions]
|
29 |
|
30 |
if not len(references) == len(hypothesis):
|