Spaces:
Sleeping
Sleeping
File size: 9,011 Bytes
3d3f3b0 fda1312 2fd78bc fda1312 2fd78bc 3d3f3b0 fda1312 3d3f3b0 fda1312 2fd78bc fda1312 2fd78bc 0c7da9b 2fd78bc fda1312 2fd78bc fda1312 2fd78bc 032ea0d 2fd78bc 032ea0d 2fd78bc 032ea0d 2fd78bc 032ea0d 2fd78bc 032ea0d 2fd78bc 7293ac9 2fd78bc 032ea0d 2fd78bc 032ea0d 2fd78bc 032ea0d 2fd78bc 7293ac9 491c03f f13942b 6b34a48 7293ac9 42e1223 fda1312 2fd78bc fda1312 2fd78bc fda1312 2fd78bc fda1312 2fd78bc fda1312 2fd78bc fda1312 2fd78bc fda1312 2fd78bc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
---
title: RestrictedPython Code Eval
datasets:
- N/A (eval module only)
tags:
- evaluate
- metric
description: "Same logic as the built-in `code_eval`, but compiling and running the code using `RestrictedPython`"
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
---
# Metric Card for RestrictedPython Code Eval
## Metric Description
A code-based evaluation metric, with the same logic as [`code_eval`](https://huggingface.co/spaces/evaluate-metric/code_eval).
## How to Use
```python
from evaluate import load
code_eval = load("guydav/restrictedpython_code_eval")
test_cases = ["assert add(2,3)==5"]
candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2], use_safe_builtins=True)
```
N.B.
This metric exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. Before running this metric and once you've taken the necessary precautions, you will need to set the `HF_ALLOW_CODE_EVAL` environment variable. Use it at your own risk:
```python
import os
os.environ["HF_ALLOW_CODE_EVAL"] = "1"`
```
### Inputs
The following arguments are inherited from the basic `code_eval`:
**`predictions`** (`List[List[str]]`): a list of candidates to evaluate. Each candidate should be a list of strings with several code candidates to solve the problem.
**`references`** (`List[str]`): a list with a test for each prediction. Each test should evaluate the correctness of a code candidate.
**`k`** (`List[int]`): number of code candidates to consider in the evaluation. The default value is `[1, 10, 100]`.
**`num_workers`** (`int`): the number of workers used to evaluate the candidate programs (The default value is `4`).
**`timeout`** (`float`): The maximum time taken to produce a prediction before it is considered a "timeout". The default value is `3.0` (i.e. 3 seconds).
In addition, this metric supports three additional arguments, specifying which imports should be made available, and controlling other apsects of `RestrictedPython` behavior:
**`use_safe_builtins`** (`bool`): Whether or not to allow the usage of [`RestrictedPython.safe_builtins`](https://github.com/zopefoundation/RestrictedPython/blob/c31c133844ac2308f5cc930e934a7227a2a6a77b/src/RestrictedPython/Guards.py#L23), defaults to True
**`use_limited_builtins`** (`bool`): Whether or not to allow the usage of [`RestrictedPython.limited_builtins`](https://github.com/zopefoundation/RestrictedPython/blob/c31c133844ac2308f5cc930e934a7227a2a6a77b/src/RestrictedPython/Limits.py#L14), which provides limited implementations of `range`, `list`, and `tuple`. defaults to True.
**`use_utility_builtins`** (`bool`): Whether or not to allow the usage of [`RestrictedPython.utility_builtins`](https://github.com/zopefoundation/RestrictedPython/blob/c31c133844ac2308f5cc930e934a7227a2a6a77b/src/RestrictedPython/Utilities.py#L19), which includes the `string`, `math`, `random`, and `set` packages, among others. Defaults to True.
**`additional_globals`** (`Dict[str, Any] | None`): Any additional `globals` to make available to the code. Defaults to None.
**`additional_locals`** (`Dict[str, Any] | None`): Any additional `locals` to make available to the code. Defaults to None.
**`allowed_imports`** (`List[str] | None`): A list of allowed imports. Defaults to None.
**`allow_str_format`**: (`bool`): Whether or not to allow the use of `str.format`. Defaults to False, as it's considered [harmful](http://lucumr.pocoo.org/2016/12/29/careful-with-str-format/).
**`allow_underscore_variable_names`**: (`bool`): Whether or not to allow the use of variable names starting with an underscore. Defaults to False, as it's considered [harmful](https://stackoverflow.com/questions/1301346/what-is-the-meaning-of-a-single-and-a-double-underscore-before-an-object-name).
**`return_output`**: (`bool`): Whether or not to return the output of the code. Defaults to False.
**`output_variable`**: (`str`): The name of the variable to return the output of. Defaults to `'output'`.
As the new arguments are optional, this could be used as a drop-in replacement for `code_eval`.
Additionally, this metric sets several different `globals` if they are not provided as additional globals. The full list of globals set is: `__metaclass__, __name__, _getiter_, _iter_unpack_sequence_, _getitem_, getattr, _write_, _inplacevar_, _print_`. See the code for additional details.
### Output Values
Identical to `code_eval`:
The Code Eval metric outputs two things:
`pass_at_k`: a dictionary with the pass rates for each k value defined in the arguments.
`results`: a dictionary with granular results of each unit test.
### Values from popular papers
The [original CODEX paper](https://arxiv.org/pdf/2107.03374.pdf) reported that the CODEX-12B model had a pass@k score of 28.8% at `k=1`, 46.8% at `k=10` and 72.3% at `k=100`. However, since the CODEX model is not open source, it is hard to verify these numbers.
### Examples
Copied from the `code_eval` model card:
Full match at `k=1`:
```python
from evaluate import load
code_eval = load("code_eval")
test_cases = ["assert add(2,3)==5"]
candidates = [["def add(a, b): return a+b"]]
pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1])
print(pass_at_k)
{'pass@1': 1.0}
```
No match for k = 1:
```python
from evaluate import load
code_eval = load("code_eval")
test_cases = ["assert add(2,3)==5"]
candidates = [["def add(a,b): return a*b"]]
pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1])
print(pass_at_k)
{'pass@1': 0.0}
```
Partial match at k=1, full match at k=2:
```python
from evaluate import load
code_eval = load("code_eval")
test_cases = ["assert add(2,3)==5"]
candidates = [["def add(a, b): return a+b", "def add(a,b): return a*b"]]
pass_at_k, results = code_eval.compute(references=test_cases, predictions=candidates, k=[1, 2])
print(pass_at_k)
{'pass@1': 0.5, 'pass@2': 1.0}
```
## Limitations and Bias
From the original `code_eval` model card:
As per the warning included in the metric code itself:
> This program exists to execute untrusted model-generated code. Although it is highly unlikely that model-generated code will do something overtly malicious in response to this test suite, model-generated code may act destructively due to a lack of model capability or alignment. Users are strongly encouraged to sandbox this evaluation suite so that it does not perform destructive actions on their host or network. For more information on how OpenAI sandboxes its code, see the accompanying paper. Once you have read this disclaimer and taken appropriate precautions, uncomment the following line and proceed at your own risk:
More information about the limitations of the code can be found on the [Human Eval Github repository](https://github.com/openai/human-eval).
Additionally, this metric does not currently allow for custom `RestrictedPython` policies -- so any code that depends on non-default libraries or packages may fail for that reason.
**TODO**: Add a `use_custom_builtins` argument that allows users to specify their own `RestrictedPython` policy. See the RestrictedPython [documentation](https://restrictedpython.readthedocs.io/en/latest/usage/policy.html#implementing-a-policy) for additional details.
## Citation
Based on the original `code_eval` metric, which cites:
```bibtex
@misc{chen2021evaluating,
title={Evaluating Large Language Models Trained on Code},
author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan \
and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards \
and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray \
and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf \
and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray \
and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser \
and Mohammad Bavarian and Clemens Winter and Philippe Tillet \
and Felipe Petroski Such and Dave Cummings and Matthias Plappert \
and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss \
and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak \
and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain \
and William Saunders and Christopher Hesse and Andrew N. Carr \
and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa \
and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati \
and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei \
and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
year={2021},
eprint={2107.03374},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```
## Further References
- The original `code_eval` metric: https://huggingface.co/spaces/evaluate-metric/code_eval
- RestrictedPython: https://restrictedpython.readthedocs.io/en/latest/index.html
|