loubnabnl HF staff commited on
Commit
53c9467
·
1 Parent(s): f6fae6e
Files changed (1) hide show
  1. evaluation/intro.txt +21 -23
evaluation/intro.txt CHANGED
@@ -4,6 +4,7 @@ In most papers, 200 candidate program completions are sampled, and pass@1, pass@
4
 
5
  | Model | pass@1 | pass@10 | pass@100|
6
  |-------|--------|---------|---------|
 
7
  |CodeParrot (1.5B) | 3.58% | 8.03% | 14.96% |
8
  |||||
9
  |InCoder (6.7B) | 15.2% | 27.8% | 47.00% |
@@ -15,24 +16,8 @@ In most papers, 200 candidate program completions are sampled, and pass@1, pass@
15
  |GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
16
  |GPT-J (6B)| 11.62% | 15.74% | 27.74% |
17
 
18
-
19
- To better understand how pass@k metric works, we will illustrate it with some examples. We select 4 problems from the HumanEval dataset and see how the model performs and which code completions pass the unit tests. We will use CodeParrot 🦜 with the three problem below:
20
-
21
- ```python
22
-
23
- from typing import List
24
-
25
-
26
- def has_close_elements(numbers: List[float], threshold: float) -> bool:
27
- """ Check if in given list of numbers, are any two numbers closer to each other than
28
- given threshold.
29
- >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
30
- False
31
- >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)
32
- True
33
- """
34
-
35
- ````
36
 
37
  ```python
38
 
@@ -47,7 +32,6 @@ def separate_paren_groups(paren_string: str) -> List[str]:
47
  >>> separate_paren_groups('( ) (( )) (( )( ))')
48
  ['()', '(())', '(()())']
49
  """
50
-
51
  ````
52
 
53
  ```python
@@ -61,15 +45,29 @@ def truncate_number(number: float) -> float:
61
  >>> truncate_number(3.5)
62
  0.5
63
  """
64
-
65
  ````
66
 
67
- For each problem, instead of 200 candidate solutions, we will only generate 20 samples for illustration purposes. We use Nucleus sampling with `top-p=0.95` and `temperature=0.2`. For more details about decoding strategies for language generation, we recommend this [blog](https://huggingface.co/blog/how-to-generate). We will compute pass@1, pass@5 and pass@10, each correspending to unit test pass rate when selecting respectively 1, 5 and 10 samples from the candidate solutions.
68
 
69
  ```
70
 
71
- scores
72
 
73
  ````
74
 
75
- If we take a closer look at the unit test results for each candidate solution in the three tasks, we find that only 3 passed the test which corresponds to `1/30 = 0.333`, our pass@1, the scores pass@5 and pass@10 are higher, because the more samples we select from the candidate solutions, the more likely we are to include the correct solution. Without surprise pass@10 is '2/3=0.73': if we select all candidates two tasks out of three get solved.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
 
5
  | Model | pass@1 | pass@10 | pass@100|
6
  |-------|--------|---------|---------|
7
+ |CodeParrot (110M) | 3.80% | 6.57% | 12.78% |
8
  |CodeParrot (1.5B) | 3.58% | 8.03% | 14.96% |
9
  |||||
10
  |InCoder (6.7B) | 15.2% | 27.8% | 47.00% |
 
16
  |GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
17
  |GPT-J (6B)| 11.62% | 15.74% | 27.74% |
18
 
19
+ <br/>
20
+ To better understand how pass@k metric works, we will illustrate it with some examples. We select two problems from the HumanEval dataset and see how the model performs and which code completions pass the unit tests. We will use CodeParrot 🦜 (110M) with the two problems below:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ```python
23
 
 
32
  >>> separate_paren_groups('( ) (( )) (( )( ))')
33
  ['()', '(())', '(()())']
34
  """
 
35
  ````
36
 
37
  ```python
 
45
  >>> truncate_number(3.5)
46
  0.5
47
  """
 
48
  ````
49
 
50
+ For each problem, instead of 200 candidate solutions, we will only generate 20 samples for illustration purposes. We use Nucleus sampling with `top-p=0.95` and `temperature=0.2`. For more details about decoding strategies for language generation, we recommend this [blog](https://huggingface.co/blog/how-to-generate). We will compute pass@1, pass@10 and pass@20, each correspending to unit test pass rate when selecting respectively 1, 10 and 20 samples from the candidate solutions.
51
 
52
  ```
53
 
54
+ Results: {'pass@1': 0.0750, 'pass@10': 0.4473, 'pass@20': 0.5}
55
 
56
  ````
57
 
58
+ If we take a closer look at the unit test results for each candidate solution in the three tasks, we find that only 3 passed the test for the second problem, and none did for the first problem. This means that we have 3 correct solutions among 40, which corresponds to our pass@1 value `3/40 = 0.075`. The scores pass@10 and pass@20 are higher, because the more samples we select from the candidate completions, the more likely we are to include the correct implementation. As
59
+ for pass@20, it is '1/2=0.5', if we select all 20 candidates for each problem, the second problem get solved wich gives 50% success rate. If you are curious about the candidate solutions that passed the tests, they all implemented this function:
60
+
61
+ ```python
62
+
63
+ def truncate_number(number: float) -> float:
64
+ """ Given a positive floating point number, it can be decomposed into
65
+ and integer part (largest integer smaller than given number) and decimals
66
+ (leftover part always smaller than 1).
67
+
68
+ Return the decimal part of the number.
69
+ >>> truncate_number(3.5)
70
+ 0.5
71
+ """
72
+ return number % 1
73
+ ```