zstanjj commited on
Commit
023a9be
·
1 Parent(s): ee70019

fix results

Browse files
Files changed (17) hide show
  1. README.md +33 -29
  2. eval-results/omnieval-auto/bge-large-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
  3. eval-results/omnieval-auto/bge-m3_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
  4. eval-results/omnieval-auto/gte-qwen2-1.5b_deepseek-v2-chat/results_2023-12-08 15:46:20.425378.json +12 -12
  5. eval-results/omnieval-auto/gte-qwen2-1.5b_llama3-70b-instruct/results_2023-12-08 15:46:20.425378.json +12 -12
  6. eval-results/omnieval-auto/gte-qwen2-1.5b_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
  7. eval-results/omnieval-auto/gte-qwen2-1.5b_yi15-34b/results_2023-12-08 15:46:20.425378.json +11 -11
  8. eval-results/omnieval-auto/jina-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
  9. eval-results/omnieval-human/bge-large-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
  10. eval-results/omnieval-human/bge-m3_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
  11. eval-results/omnieval-human/e5-mistral-7b_qwen2-72b/results_2023-12-08 15:46:20.425378.json +11 -11
  12. eval-results/omnieval-human/gte-qwen2-1.5b_deepseek-v2-chat/results_2023-12-08 15:46:20.425378.json +12 -12
  13. eval-results/omnieval-human/gte-qwen2-1.5b_llama3-70b-instruct/results_2023-12-08 15:46:20.425378.json +12 -12
  14. eval-results/omnieval-human/gte-qwen2-1.5b_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
  15. eval-results/omnieval-human/gte-qwen2-1.5b_yi15-34b/results_2023-12-08 15:46:20.425378.json +11 -11
  16. eval-results/omnieval-human/jina-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
  17. src/about.py +24 -103
README.md CHANGED
@@ -10,36 +10,40 @@ license: apache-2.0
10
  short_description: Official Leaderboard for OmniEval
11
  ---
12
 
13
- # Start the configuration
14
-
15
- Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
16
-
17
- Results files should have the following format and be stored as json files:
18
- ```json
19
- {
20
- "config": {
21
- "model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
22
- "model_name": "path of the model on the hub: org/model",
23
- "model_sha": "revision on the hub",
24
- },
25
- "results": {
26
- "task_name": {
27
- "metric_name": score,
28
- },
29
- "task_name2": {
30
- "metric_name": score,
31
- }
32
- }
33
- }
34
- ```
35
 
36
- Request files are created automatically by this tool.
37
 
38
- If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.
 
 
 
39
 
40
- # Code logic for more complex edits
41
 
42
- You'll find
43
- - the main table' columns names and properties in `src/display/utils.py`
44
- - the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
45
- - the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  short_description: Official Leaderboard for OmniEval
11
  ---
12
 
13
+ ---
14
+ license: mit
15
+ language:
16
+ - zh
17
+ - en
18
+ base_model:
19
+ - Qwen/Qwen2.5-7B-Instruct
20
+ pipeline_tag: text-generation
21
+ ---
22
+
23
+ # Dataset Information
 
 
 
 
 
 
 
 
 
 
 
24
 
25
+ We introduce an omnidirectional and automatic RAG benchmark, **OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain**, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including:
26
 
27
+ 1. a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios;
28
+ 2. a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47% acceptance ratio in human evaluations on generated instances;
29
+ 3. a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline;
30
+ 4. robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator.
31
 
32
+ Useful Links: 📝 [Paper](https://arxiv.org/abs/2412.13018) 🤗 [Hugging Face](https://huggingface.co/collections/RUC-NLPIR/omnieval-67629ccbadd3a715a080fd25) • 🧩 [Github](https://github.com/RUC-NLPIR/OmniEval)
33
 
34
+ We have trained two models from Qwen2.5-7B by the lora strategy and human-annotation labels to implement model-based evaluation.Note that the evaluator of hallucination is different from other four.
35
+
36
+ We provide the evaluator for other metrics except hallucination in this repo.
37
+
38
+ # 🌟 Citation
39
+ ```bibtex
40
+ @misc{wang2024omnievalomnidirectionalautomaticrag,
41
+ title={OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain},
42
+ author={Shuting Wang and Jiejun Tan and Zhicheng Dou and Ji-Rong Wen},
43
+ year={2024},
44
+ eprint={2412.13018},
45
+ archivePrefix={arXiv},
46
+ primaryClass={cs.CL},
47
+ url={https://arxiv.org/abs/2412.13018},
48
+ }
49
+ ```
eval-results/omnieval-auto/bge-large-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.3097634381445468,
5
- "map": 0.30402197247127166
6
  },
7
  "generation": {
8
- "em": 0.0026518499810582142,
9
- "f1": 0.2480828824153542,
10
- "rouge1": 0.2493538725800514,
11
- "rouge2": 0.1235656068292625,
12
- "rougeL": 0.16098924930699862,
13
- "accuracy": 0.3906427579239803,
14
- "completeness": 0.5930474914396308,
15
- "hallucination": 0.06504488096786783,
16
- "utilization": 0.5045650189122212,
17
- "numerical_accuracy": 0.28149656401119877
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.3865492275960769,
5
+ "map": 0.37771288462347935
6
  },
7
  "generation": {
8
+ "em": 0.003156964263164541,
9
+ "f1": 0.254069117724313,
10
+ "rouge1": 0.25832549561659673,
11
+ "rouge2": 0.13269187125919746,
12
+ "rougeL": 0.16925453426436302,
13
+ "accuracy": 0.4080060613713853,
14
+ "completeness": 0.6048002385211687,
15
+ "hallucination": 0.05973250227243215,
16
+ "utilization": 0.5193561001042752,
17
+ "numerical_accuracy": 0.31237373737373736
18
  }
19
  },
20
  "config": {
eval-results/omnieval-auto/bge-m3_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.33076566906595944,
5
- "map": 0.32402765500694536
6
  },
7
  "generation": {
8
- "em": 0.002525571410531633,
9
- "f1": 0.2524796046548042,
10
- "rouge1": 0.2542055585319881,
11
- "rouge2": 0.12967013110722864,
12
- "rougeL": 0.16623387811734364,
13
- "accuracy": 0.4025188916876574,
14
- "completeness": 0.6033108522378908,
15
- "hallucination": 0.07283603096410979,
16
- "utilization": 0.5141388174807198,
17
- "numerical_accuracy": 0.3162303664921466
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.40570358210211727,
5
+ "map": 0.396066422528097
6
  },
7
  "generation": {
8
+ "em": 0.003156964263164541,
9
+ "f1": 0.25926214898204075,
10
+ "rouge1": 0.2635672919940079,
11
+ "rouge2": 0.13850004284564332,
12
+ "rougeL": 0.17457743506358883,
13
+ "accuracy": 0.4090794292208612,
14
+ "completeness": 0.609230539815091,
15
+ "hallucination": 0.0634184068058778,
16
+ "utilization": 0.52025545090956,
17
+ "numerical_accuracy": 0.30601370210606443
18
  }
19
  },
20
  "config": {
eval-results/omnieval-auto/gte-qwen2-1.5b_deepseek-v2-chat/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.3406848507808225,
5
- "map": 0.3337426863661236
6
  },
7
  "generation": {
8
- "em": 0.0035568464031653824,
9
- "f1": 0.3226028700822056,
10
- "rouge1": 0.29804464952499493,
11
- "rouge2": 0.1619392409911174,
12
- "rougeL": 0.21536150159516076,
13
- "accuracy": 0.3783377209477247,
14
- "completeness": 0.5935541629364369,
15
- "hallucination": 0.06668379802132854,
16
- "utilization": 0.48314821907315203,
17
- "numerical_accuracy": 0.2761605035405193
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.44908027107799803,
5
+ "map": 0.4369785747358673
6
  },
7
  "generation": {
8
+ "em": 0.004714399966325714,
9
+ "f1": 0.3299866038099243,
10
+ "rouge1": 0.31137653557230416,
11
+ "rouge2": 0.17517183240145648,
12
+ "rougeL": 0.2279270260032969,
13
+ "accuracy": 0.409900239929284,
14
+ "completeness": 0.6072016768977392,
15
+ "hallucination": 0.0634046368643525,
16
+ "utilization": 0.519655704008222,
17
+ "numerical_accuracy": 0.31754059089699227
18
  }
19
  },
20
  "config": {
eval-results/omnieval-auto/gte-qwen2-1.5b_llama3-70b-instruct/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.3406848507808225,
5
- "map": 0.3337426863661236
6
  },
7
  "generation": {
8
- "em": 0.030906680136380857,
9
- "f1": 0.4704248712273675,
10
- "rouge1": 0.3844331865430577,
11
- "rouge2": 0.21544656691735142,
12
- "rougeL": 0.3082188596657867,
13
- "accuracy": 0.4181714862987751,
14
- "completeness": 0.586105675146771,
15
- "hallucination": 0.0880543450397334,
16
- "utilization": 0.45601078859491395,
17
- "numerical_accuracy": 0.2751721876024926
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.44908027107799803,
5
+ "map": 0.4369785747358673
6
  },
7
  "generation": {
8
+ "em": 0.04034600328324283,
9
+ "f1": 0.4810416082778636,
10
+ "rouge1": 0.39948754207404946,
11
+ "rouge2": 0.23047720731140595,
12
+ "rougeL": 0.3235410874683177,
13
+ "accuracy": 0.43982826114408385,
14
+ "completeness": 0.5925646063170621,
15
+ "hallucination": 0.07924721546536935,
16
+ "utilization": 0.4753909254037426,
17
+ "numerical_accuracy": 0.3087947882736156
18
  }
19
  },
20
  "config": {
eval-results/omnieval-auto/gte-qwen2-1.5b_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.3406848507808225,
5
- "map": 0.3337426863661236
6
  },
7
  "generation": {
8
- "em": 0.0028412678368480867,
9
- "f1": 0.2477112059712835,
10
- "rouge1": 0.25666135328401396,
11
- "rouge2": 0.13256084364546591,
12
- "rougeL": 0.1669344569228441,
13
- "accuracy": 0.40573304710190683,
14
- "completeness": 0.6131668895824045,
15
- "hallucination": 0.05456183245399562,
16
- "utilization": 0.5346272891410885,
17
- "numerical_accuracy": 0.2971301335972291
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.44908027107799803,
5
+ "map": 0.4369785747358673
6
  },
7
  "generation": {
8
+ "em": 0.004293471397903776,
9
+ "f1": 0.25632469571916017,
10
+ "rouge1": 0.26861074169895954,
11
+ "rouge2": 0.1444095692170222,
12
+ "rougeL": 0.17778126757506857,
13
+ "accuracy": 0.4326303826240687,
14
+ "completeness": 0.6255959475566151,
15
+ "hallucination": 0.04670259173723377,
16
+ "utilization": 0.5613256113256113,
17
+ "numerical_accuracy": 0.3292742328300049
18
  }
19
  },
20
  "config": {
eval-results/omnieval-auto/gte-qwen2-1.5b_yi15-34b/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.3406848507808225,
5
- "map": 0.3337426863661236
6
  },
7
  "generation": {
8
  "em": 0.0,
9
- "f1": 0.09732568803130702,
10
- "rouge1": 0.1642342072893325,
11
- "rouge2": 0.06542075931397044,
12
- "rougeL": 0.059256539829821125,
13
- "accuracy": 0.3304375804375804,
14
- "completeness": 0.5735068912710567,
15
- "hallucination": 0.06555017663221248,
16
- "utilization": 0.4132755170113409,
17
- "numerical_accuracy": 0.175
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.44908027107799803,
5
+ "map": 0.4369785747358673
6
  },
7
  "generation": {
8
  "em": 0.0,
9
+ "f1": 0.09576059934519404,
10
+ "rouge1": 0.1650998130595869,
11
+ "rouge2": 0.06697080375857452,
12
+ "rougeL": 0.05928647212536637,
13
+ "accuracy": 0.34019446899861094,
14
+ "completeness": 0.5778415961305925,
15
+ "hallucination": 0.059720954492111095,
16
+ "utilization": 0.42293577981651376,
17
+ "numerical_accuracy": 0.16823529411764707
18
  }
19
  },
20
  "config": {
eval-results/omnieval-auto/jina-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.25315906890600665,
5
- "map": 0.24830681483352277
6
  },
7
  "generation": {
8
- "em": 0.0026518499810582142,
9
- "f1": 0.24837825152624493,
10
- "rouge1": 0.24111819423215256,
11
- "rouge2": 0.11665848753826197,
12
- "rougeL": 0.1558018779014647,
13
- "accuracy": 0.3705644652102538,
14
- "completeness": 0.5820335932813437,
15
- "hallucination": 0.09210356820816695,
16
- "utilization": 0.4738984364905027,
17
- "numerical_accuracy": 0.24648820567187915
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.34687460537946707,
5
+ "map": 0.3395167740034516
6
  },
7
  "generation": {
8
+ "em": 0.0040409142568506124,
9
+ "f1": 0.25528888107857534,
10
+ "rouge1": 0.2532119544207203,
11
+ "rouge2": 0.12795048070526135,
12
+ "rougeL": 0.16617984432034583,
13
+ "accuracy": 0.3907690364945069,
14
+ "completeness": 0.5980714606069667,
15
+ "hallucination": 0.07936304096571209,
16
+ "utilization": 0.5078436415070079,
17
+ "numerical_accuracy": 0.28370640291514837
18
  }
19
  },
20
  "config": {
eval-results/omnieval-human/bge-large-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.3426063022019742,
5
- "map": 0.33500379650721335
6
  },
7
  "generation": {
8
- "em": 0.0017084282460136675,
9
- "f1": 0.3797528411547138,
10
- "rouge1": 0.3372893350582966,
11
- "rouge2": 0.18329984910669803,
12
- "rougeL": 0.23230144566069125,
13
- "accuracy": 0.40888382687927105,
14
- "completeness": 0.6021044427123928,
15
- "hallucination": 0.08138173302107728,
16
- "utilization": 0.5014637002341921,
17
- "numerical_accuracy": 0.3100358422939068
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.42523728170083525,
5
+ "map": 0.4153046697038724
6
  },
7
  "generation": {
8
+ "em": 0.003416856492027335,
9
+ "f1": 0.38699874429027187,
10
+ "rouge1": 0.3504002729437697,
11
+ "rouge2": 0.19632811311525056,
12
+ "rougeL": 0.24352337911354996,
13
+ "accuracy": 0.43251708428246016,
14
+ "completeness": 0.6223938223938223,
15
+ "hallucination": 0.07180694526191878,
16
+ "utilization": 0.5366863905325444,
17
+ "numerical_accuracy": 0.35452103849597133
18
  }
19
  },
20
  "config": {
eval-results/omnieval-human/bge-m3_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.3527809415337889,
5
- "map": 0.3458855353075171
6
  },
7
  "generation": {
8
- "em": 0.0017084282460136675,
9
- "f1": 0.38645032979631466,
10
- "rouge1": 0.3467267951634575,
11
- "rouge2": 0.1930581604826183,
12
- "rougeL": 0.24141093461883717,
13
- "accuracy": 0.4271070615034169,
14
- "completeness": 0.6119287374128582,
15
- "hallucination": 0.07481005260081823,
16
- "utilization": 0.5400116822429907,
17
- "numerical_accuracy": 0.3372093023255814
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.4236332574031891,
5
+ "map": 0.41523348519362185
6
  },
7
  "generation": {
8
+ "em": 0.003986332574031891,
9
+ "f1": 0.39131580638847696,
10
+ "rouge1": 0.35726262162172084,
11
+ "rouge2": 0.20428265081202376,
12
+ "rougeL": 0.25173121998034476,
13
+ "accuracy": 0.4450455580865604,
14
+ "completeness": 0.6207692307692307,
15
+ "hallucination": 0.07088459285295841,
16
+ "utilization": 0.541031652989449,
17
+ "numerical_accuracy": 0.34715960324616774
18
  }
19
  },
20
  "config": {
eval-results/omnieval-human/e5-mistral-7b_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.303246013667426,
5
- "map": 0.2960516324981017
6
  },
7
  "generation": {
8
  "em": 0.002277904328018223,
9
- "f1": 0.3705164550873997,
10
- "rouge1": 0.3270311806826159,
11
- "rouge2": 0.17476659877087528,
12
- "rougeL": 0.22225645997479143,
13
- "accuracy": 0.385250569476082,
14
- "completeness": 0.5877535101404057,
15
- "hallucination": 0.0924956369982548,
16
- "utilization": 0.4793244030285381,
17
- "numerical_accuracy": 0.28622540250447226
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.386455960516325,
5
+ "map": 0.37688876233864843
6
  },
7
  "generation": {
8
  "em": 0.002277904328018223,
9
+ "f1": 0.3787448936861267,
10
+ "rouge1": 0.34038227335702076,
11
+ "rouge2": 0.1898058362852231,
12
+ "rougeL": 0.23622836359261534,
13
+ "accuracy": 0.40689066059225515,
14
+ "completeness": 0.5954968944099379,
15
+ "hallucination": 0.07920792079207921,
16
+ "utilization": 0.5117027501462844,
17
+ "numerical_accuracy": 0.3050397877984085
18
  }
19
  },
20
  "config": {
eval-results/omnieval-human/gte-qwen2-1.5b_deepseek-v2-chat/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.36173120728929387,
5
- "map": 0.3512338648443432
6
  },
7
  "generation": {
8
- "em": 0.0056947608200455585,
9
- "f1": 0.4212862409737785,
10
- "rouge1": 0.3707328288930376,
11
- "rouge2": 0.21393113234607009,
12
- "rougeL": 0.2719847145278759,
13
- "accuracy": 0.3886674259681093,
14
- "completeness": 0.5858823529411765,
15
- "hallucination": 0.07893209518282066,
16
- "utilization": 0.48166472642607683,
17
- "numerical_accuracy": 0.27365491651205937
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.45742217160212606,
5
+ "map": 0.4442720197418375
6
  },
7
  "generation": {
8
+ "em": 0.005125284738041002,
9
+ "f1": 0.4353357282548688,
10
+ "rouge1": 0.39114215500827765,
11
+ "rouge2": 0.2348958346329388,
12
+ "rougeL": 0.29164097017642365,
13
+ "accuracy": 0.4234054669703872,
14
+ "completeness": 0.60062893081761,
15
+ "hallucination": 0.075,
16
+ "utilization": 0.516044340723454,
17
+ "numerical_accuracy": 0.32132963988919666
18
  }
19
  },
20
  "config": {
eval-results/omnieval-human/gte-qwen2-1.5b_llama3-70b-instruct/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.36173120728929387,
5
- "map": 0.3512338648443432
6
  },
7
  "generation": {
8
- "em": 0.04555808656036447,
9
- "f1": 0.4907954247383474,
10
- "rouge1": 0.4080491070348775,
11
- "rouge2": 0.23130474174425783,
12
- "rougeL": 0.3217574785678875,
13
- "accuracy": 0.4216970387243736,
14
- "completeness": 0.5688146380270486,
15
- "hallucination": 0.11832946635730858,
16
- "utilization": 0.4491869918699187,
17
- "numerical_accuracy": 0.288981288981289
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.45742217160212606,
5
+ "map": 0.4442720197418375
6
  },
7
  "generation": {
8
+ "em": 0.05125284738041002,
9
+ "f1": 0.5042287844817168,
10
+ "rouge1": 0.4252992013911242,
11
+ "rouge2": 0.25007376816549043,
12
+ "rougeL": 0.33900256076984714,
13
+ "accuracy": 0.4433371298405467,
14
+ "completeness": 0.574468085106383,
15
+ "hallucination": 0.11310904872389792,
16
+ "utilization": 0.47642607683352733,
17
+ "numerical_accuracy": 0.32676348547717843
18
  }
19
  },
20
  "config": {
eval-results/omnieval-human/gte-qwen2-1.5b_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.36173120728929387,
5
- "map": 0.3512338648443432
6
  },
7
  "generation": {
8
- "em": 0.002277904328018223,
9
- "f1": 0.3804001391052641,
10
- "rouge1": 0.34576336184459094,
11
- "rouge2": 0.1928778762677512,
12
- "rougeL": 0.2383694455084706,
13
- "accuracy": 0.4145785876993166,
14
- "completeness": 0.598297213622291,
15
- "hallucination": 0.07213496218731821,
16
- "utilization": 1.13922942206655,
17
- "numerical_accuracy": 0.3218694885361552
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.45742217160212606,
5
+ "map": 0.4442720197418375
6
  },
7
  "generation": {
8
+ "em": 0.0028473804100227792,
9
+ "f1": 0.39189804056173694,
10
+ "rouge1": 0.36142455862500045,
11
+ "rouge2": 0.20781042503487615,
12
+ "rougeL": 0.2528346438884966,
13
+ "accuracy": 0.44760820045558086,
14
+ "completeness": 0.6189922480620155,
15
+ "hallucination": 0.061843640606767794,
16
+ "utilization": 0.5575686732904734,
17
+ "numerical_accuracy": 0.35951134380453753
18
  }
19
  },
20
  "config": {
eval-results/omnieval-human/gte-qwen2-1.5b_yi15-34b/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.36173120728929387,
5
- "map": 0.3512338648443432
6
  },
7
  "generation": {
8
  "em": 0.0,
9
- "f1": 0.16041349053275844,
10
- "rouge1": 0.21775697114621573,
11
- "rouge2": 0.09738983880706074,
12
- "rougeL": 0.08775246194460379,
13
- "accuracy": 0.3211845102505695,
14
- "completeness": 0.5703789636504254,
15
- "hallucination": 0.07665094339622641,
16
- "utilization": 0.40828402366863903,
17
- "numerical_accuracy": 0.162
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.45742217160212606,
5
+ "map": 0.4442720197418375
6
  },
7
  "generation": {
8
  "em": 0.0,
9
+ "f1": 0.15831651384807305,
10
+ "rouge1": 0.2195147064138981,
11
+ "rouge2": 0.09922121332360972,
12
+ "rougeL": 0.08869793021948827,
13
+ "accuracy": 0.3365603644646925,
14
+ "completeness": 0.5820836621941594,
15
+ "hallucination": 0.0648202710665881,
16
+ "utilization": 0.4234421364985163,
17
+ "numerical_accuracy": 0.18561001042752867
18
  }
19
  },
20
  "config": {
eval-results/omnieval-human/jina-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED
@@ -1,20 +1,20 @@
1
  {
2
  "results": {
3
  "retrieval": {
4
- "mrr": 0.27484813971146543,
5
- "map": 0.26924354593773725
6
  },
7
  "generation": {
8
- "em": 0.003416856492027335,
9
- "f1": 0.37960439080933656,
10
- "rouge1": 0.3255380867320351,
11
- "rouge2": 0.1732248556904568,
12
- "rougeL": 0.22591939162851002,
13
- "accuracy": 0.3826879271070615,
14
- "completeness": 0.5793588741204065,
15
- "hallucination": 0.0897510133178923,
16
- "utilization": 0.4855072463768116,
17
- "numerical_accuracy": 0.2663594470046083
18
  }
19
  },
20
  "config": {
 
1
  {
2
  "results": {
3
  "retrieval": {
4
+ "mrr": 0.3532839787395595,
5
+ "map": 0.3458285876993166
6
  },
7
  "generation": {
8
+ "em": 0.003986332574031891,
9
+ "f1": 0.38207566850400565,
10
+ "rouge1": 0.3373954886971943,
11
+ "rouge2": 0.18428324959065878,
12
+ "rougeL": 0.2341310217806067,
13
+ "accuracy": 0.40888382687927105,
14
+ "completeness": 0.5930414386239249,
15
+ "hallucination": 0.08864426419466975,
16
+ "utilization": 0.516260162601626,
17
+ "numerical_accuracy": 0.3073351903435469
18
  }
19
  },
20
  "config": {
src/about.py CHANGED
@@ -43,118 +43,30 @@ TITLE = """<h1 align="center" id="space-title">🏅 OmniEval Leaderboard</h1>"""
43
 
44
  # What does your leaderboard evaluate?
45
  INTRODUCTION_TEXT = """
46
- <div align="center">OmniEval: Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain</div>
47
- """
48
-
49
- # Which evaluations are you running? how can people reproduce what you have?
50
- LLM_BENCHMARKS_TEXT = f"""
51
- # <div align="center">OmniEval: Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain</div>
52
-
53
-
54
  <div align="center">
55
- <!-- <a href="https://arxiv.org/abs/2405.13576" target="_blank"><img src=https://img.shields.io/badge/arXiv-b5212f.svg?logo=arxiv></a> -->
56
- <!-- <a href="https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace%20Datasets-27b3b4.svg></a> -->
57
- <!-- <a href="https://huggingface.co/ShootingWong/OmniEval-ModelEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace%20Checkpoint-5fc372.svg></a> -->
58
- <!-- <a href="https://huggingface.co/ShootingWong/OmniEval-HallucinationEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace%20Checkpoint-b181d9.svg></a> -->
59
- <a href="https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-27b3b4></a>
60
- <a href="https://huggingface.co/ShootingWong/OmniEval-ModelEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-5fc372></a>
61
- <a href="https://huggingface.co/ShootingWong/OmniEval-HallucinationEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-b181d9></a>
62
- <a href="https://huggingface.co/spaces/NLPIR-RAG/OmniEval" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Leaderboard-blue></a>
63
- <a href="https://github.com/RUC-NLPIR/FlashRAG/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green"></a>
64
- <a><img alt="Static Badge" src="https://img.shields.io/badge/made_with-Python-blue"></a>
65
  </div>
 
66
 
67
- <!-- [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Leaderboard-blue)](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) -->
68
-
69
- <h4 align="center">
70
-
71
- <p>
72
- <a href="#wrench-installation">Installation</a> |
73
- <!-- <a href="#sparkles-features">Features</a> | -->
74
- <a href="#rocket-quick-start">Quick-Start</a> |
75
- <a href="#bookmark-license">License</a> |
76
- <a href="#star2-citation">Citation</a>
77
-
78
- </p>
79
-
80
- </h4>
81
-
82
- <!--
83
- With FlashRAG and provided resources, you can effortlessly reproduce existing SOTA works in the RAG domain or implement your custom RAG processes and components. -->
84
-
85
-
86
- ## 🔧 Installation
87
- `conda env create -f environment.yml && conda activate finrag`
88
-
89
- <!-- ## ✨ Features
90
- 1. -->
91
- ## 🚀 Quick-Start
92
- Notion:
93
- 1. The code run path is `./OpenFinBench`
94
- 2. We provide our auto-generated evaluation dataset in <a href="https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-27b3b4></a>
95
- ### 1. Build the Retrieval Corpus
96
- ```
97
- # cd OpenFinBench
98
- sh corpus_builder/build_corpus.sh # Please see the annotation inner the bash file to set parameters.
99
- ```
100
- ### 2. Generate Evaluation Data Samples
101
- 1. Generate evaluation instances
102
- ```
103
- # cd OpenFinBench
104
- sh data_generator/generate_data.sh
105
- ```
106
- 2. Filter (quality inspection) evaluation instances
107
- ```
108
- sh data_generator/generate_data_filter.sh
109
- ```
110
- ### 3. Inference Your Models
111
- ```
112
- # cd OpenFinBench
113
- sh evaluator/inference/rag_inference.sh
114
- ```
115
- ### 4. Evaluate Your Models
116
- #### (a) Rule-based Evaluation
117
- ```
118
- # cd OpenFinBench
119
- sh evaluator/judgement/judger.sh # by setting judge_type="rule"
120
- ```
121
- #### (b) Model-based Evalution
122
- We propose five model-based metric: accuracy, completeness, utilization, numerical_accuracy, and hallucination. We have trained two models from Qwen2.5-7B by the lora strategy and human-annotation labels to implement model-based evaluation.
123
-
124
- Note that the evaluator of hallucination is different from other four. Their model checkpoint can be load from the following huggingface links:
125
- 1. The evaluator for hallucination metric: <a href="https://huggingface.co/ShootingWong/OmniEval-HallucinationEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-b181d9></a>
126
- 2. The evaluator for other metric: <a href="https://huggingface.co/ShootingWong/OmniEval-ModelEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-5fc372></a>
127
-
128
-
129
-
130
- To implement model-based evaluation, you can first set up two vllm servers by the following codes:
131
- ```
132
- ```
133
-
134
- Then conduct the model-based evaluate using the following codes, (change the parameters inner the bash file).
135
- ```
136
- sh evaluator/judgement/judger.sh
137
- ```
138
-
139
- ## 🔖 License
140
 
141
- OmniEval is licensed under the [<u>MIT License</u>](./LICENSE).
142
 
143
- ## 🌟 Citation
144
- The paper is waiting to be released!
 
 
145
 
146
- <!-- # Check Infos
147
- ## Pipeline
148
- 1. Build corpus
149
- 2. Data generation
150
- 3. RAG inference
151
- 4. Result evaluatioin
152
 
153
- ## Code
154
- 1. remove "baichuan"
155
- 2. remove useless annotation -->
156
 
 
157
 
 
158
  """
159
 
160
  EVALUATION_QUEUE_TEXT = """
@@ -189,4 +101,13 @@ If everything is done, check you can launch the EleutherAIHarness on your model
189
 
190
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
191
  CITATION_BUTTON_TEXT = r"""
 
 
 
 
 
 
 
 
 
192
  """
 
43
 
44
  # What does your leaderboard evaluate?
45
  INTRODUCTION_TEXT = """
 
 
 
 
 
 
 
 
46
  <div align="center">
47
+ Please contact us if you would like to submit your model to this leaderboard. Email: wangshuting@ruc.edu.cn
48
+ 如果您想将您的模型提交到此排行榜,请联系我们。邮箱:wangshuting@ruc.edu.cn
 
 
 
 
 
 
 
 
49
  </div>
50
+ """
51
 
52
+ # Which evaluations are you running? how can people reproduce what you have?
53
+ LLM_BENCHMARKS_TEXT = """
54
+ # Leaderboard Information
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
+ We introduce an omnidirectional and automatic RAG benchmark, **OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain**, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including:
57
 
58
+ 1. a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios;
59
+ 2. a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47% acceptance ratio in human evaluations on generated instances;
60
+ 3. a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline;
61
+ 4. robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator.
62
 
63
+ Useful Links: 📝 [Paper](https://arxiv.org/abs/2412.13018) • 🤗 [Hugging Face](https://huggingface.co/collections/RUC-NLPIR/omnieval-67629ccbadd3a715a080fd25) • 🧩 [Github](https://github.com/RUC-NLPIR/OmniEval)
 
 
 
 
 
64
 
65
+ We have trained two models from Qwen2.5-7B by the lora strategy and human-annotation labels to implement model-based evaluation.Note that the evaluator of hallucination is different from other four.
 
 
66
 
67
+ We provide the evaluator for other metrics except hallucination in this repo.
68
 
69
+ # 🌟 Citation
70
  """
71
 
72
  EVALUATION_QUEUE_TEXT = """
 
101
 
102
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
103
  CITATION_BUTTON_TEXT = r"""
104
+ @misc{wang2024omnievalomnidirectionalautomaticrag,
105
+ title={OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain},
106
+ author={Shuting Wang and Jiejun Tan and Zhicheng Dou and Ji-Rong Wen},
107
+ year={2024},
108
+ eprint={2412.13018},
109
+ archivePrefix={arXiv},
110
+ primaryClass={cs.CL},
111
+ url={https://arxiv.org/abs/2412.13018},
112
+ }
113
  """