Spaces:
Running
Running
Update content.py
Browse files- content.py +1 -1
content.py
CHANGED
@@ -90,7 +90,7 @@ We use the following metrics for following tasks:
|
|
90 |
On every task, for every metric we compute test for statistical significance at α=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
|
91 |
We use the following tests, with varying statistical power:
|
92 |
- accuracy and exact-match: one-tailed paired t-test,
|
93 |
-
- average area under the curve: bayesian test inspired with (
|
94 |
- summarization & perplexity: bootstrapping.
|
95 |
|
96 |
### Duel Scoring Mechanism, Win Score
|
|
|
90 |
On every task, for every metric we compute test for statistical significance at α=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
|
91 |
We use the following tests, with varying statistical power:
|
92 |
- accuracy and exact-match: one-tailed paired t-test,
|
93 |
+
- average area under the curve: bayesian test inspired with (Goutte et al., 2005)[https://link.springer.com/chapter/10.1007/978-3-540-31865-1_25],
|
94 |
- summarization & perplexity: bootstrapping.
|
95 |
|
96 |
### Duel Scoring Mechanism, Win Score
|