Spaces:

CZLC
/

BenCzechMark

Running

mfajcik commited on Sep 5, 2024

Commit

c980420

1 Parent(s): 43aa8ea

Update content.py

Files changed (1) hide show

content.py CHANGED Viewed

@@ -90,7 +90,7 @@ We use the following metrics for following tasks:
 On every task, for every metric we compute test for statistical significance at α=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
 We use the following tests, with varying statistical power:
 - accuracy and exact-match: one-tailed paired t-test,
-- average area under the curve: bayesian test inspired with ((Goutte et al., 2005)[https://link.springer.com/chapter/10.1007/978-3-540-31865-1_25]),
 - summarization & perplexity: bootstrapping.
 ### Duel Scoring Mechanism, Win Score

 On every task, for every metric we compute test for statistical significance at α=0.05, i.e., the probability that performance model A is equal to the performance model B is estimated to be less then 0.05.
 We use the following tests, with varying statistical power:
 - accuracy and exact-match: one-tailed paired t-test,
+- average area under the curve: bayesian test inspired with (Goutte et al., 2005)[https://link.springer.com/chapter/10.1007/978-3-540-31865-1_25],
 - summarization & perplexity: bootstrapping.
 ### Duel Scoring Mechanism, Win Score