kpriyanshu256 commited on
Commit
50fb808
1 Parent(s): bf8319f

Added LB files

Browse files
README.md CHANGED
@@ -1,12 +1,15 @@
1
  ---
2
- title: PTP
3
- emoji: 🏃
4
- colorFrom: red
5
- colorTo: green
6
  sdk: gradio
7
- sdk_version: 4.29.0
8
  app_file: app.py
9
- pinned: false
 
 
 
10
  ---
11
 
12
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
  ---
2
+ title: AI2 PolygloToxicityPrompts
3
+ emoji: 😈
4
+ colorFrom: blue
5
+ colorTo: yellow
6
  sdk: gradio
7
+ sdk_version: 4.19.2
8
  app_file: app.py
9
+ pinned: true
10
+ fullWidth: true
11
+ hf_oauth: true
12
+ api: false
13
  ---
14
 
15
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
__pycache__/app.cpython-310.pyc ADDED
Binary file (3.55 kB). View file
 
__pycache__/constants.cpython-310.pyc ADDED
Binary file (3.87 kB). View file
 
__pycache__/load_data.cpython-310.pyc ADDED
Binary file (4.06 kB). View file
 
__pycache__/themes.cpython-310.pyc ADDED
Binary file (1.35 kB). View file
 
_about_us.md ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## About Us
2
+
3
+ ### Team
4
+
5
+ We are a team from Language Technologies Institute (Carnegie Mellon University), Stripe and the University of Virginia. Team members include:
6
+
7
+ [Devansh Jain](https://devanshrj.github.io/), [Priyanshu Kumar](https://kpriyanshu256.github.io/), [Samuel Gehman](http://samgehman.com/), [Xuhui Zhou](https://xuhuiz.com/), [Thomas Hartvigsen](https://www.tomhartvigsen.com/) and [Maarten Sap](https://maartensap.com/)
8
+ (Devansh and Priyanshu contributed equally. Maarten is the advisor.)
9
+
10
+ ### Contact
11
+
12
+ Please contact us in the following ways:
13
+ - Github Issues/PRs for adding a new model: [https://github.com/allenai/WildBench](https://github.com/allenai/WildBench)
14
+ - HF Discussions for general questions about the leaderboard: [https://huggingface.co/spaces/allenai/WildBench/discussions](https://huggingface.co/spaces/allenai/WildBench/discussions)
15
+ - Other questions: Please contact with email: devanshj[at]cs[dot]cmu[dot]edu, priyansk[at]cs[dot]cmu[dot]edu
16
+
_header.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ <br/>
2
+
3
+ # PolygloToxicityPrompts: Multilingual Evaluation of Neural Toxic Degeneration in Large Language Models
4
+ [⚙️ GitHub](https://github.com/allenai/WildBench) | [🤗 HuggingFace](https://huggingface.co/collections/allenai/wildbench-65e8f2fa9c1260a85a933627) | [💬 Discussions](https://huggingface.co/spaces/allenai/WildBench/discussions)
_intro.md ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ <details open><summary style="font-size: 1.8em; font-weight: bold;">1. What is WildBench? Why should I use it?</summary>
3
+ <div style="font-size: 1.4em; margin-top: 30px;">
4
+ 🦁 <b>WildBench</b> is a benchmark for evaluating large language models (LLMs) on challenging tasks that are more representative of real-world applications. The examples are collected from real users by the <a href="https://wildchat.allen.ai/"><b>AI2 WildChat</b></a> project.</li>
5
+ <br>
6
+ <b>🆕 Motivation</b>: We aim to provide a more <strong>realistic</strong> and <strong>challenging</strong> benchmark for evaluating LLMs, as opposed to existing benchmarks that do not capture the <em>diversity</em> and <em>complexity</em> of <em>real-world</em> tasks.
7
+ <h2 style="color: purple">🌠 Key Features:</h2>
8
+ <ul>
9
+ <!-- <li><b style="color: purple">🌟 Fine-grained:</b>
10
+ We provide a fine-grained annotation for each example, including task types and <b>checklists</b> for evaluating the quality of responses. In addition, we use <b>length-penalized</b> Elo ratings to ensure that the quality of responses is not biased towards longer outputs.</li>
11
+ <li><b style="color: purple">🌟 Transparent & Fair: </b> We test all LLMs on the SAME set of examples, ensuring a fair evaluation. You can explore the data and see the difference between two models to analyze the concrete gap between any pair of LLMs. </li>
12
+ <li><b style="color: purple">🌟 Easy & Fast:</b> WildBench (v1.0) contains 1024 examples, and it is extremely easy to add your own LLMs to our leaderboard! 1️⃣ Let us know your model ID and suggested inference configs; 2️⃣ We'll run inference and evaluation for you; 3️⃣ Voilà! We'll notify you when your results are ready on the leaderboard.</li>
13
+ <li><b style="color: purple">🌟 Dynamic:</b> WildBench will not be a static dataset. We will continue adding new examples and updating evaluation methods. Our goal is to include new challenging examples from real users over time and provide fast yet reliable evaluations.</li>
14
+ <li><b style="color: purple">🌟 Human Verification (ongoing):</b> Although we currently use GPT-4 as the automatic evaluator, we are also collecting human preferences here (see the 🔍 🆚 Tab). We plan to update the leaderboard by incorporating human evaluations in the near future.</li>
15
+ <li><b style="color: purple">🌟 Community-driven:</b> In addition to collecting human preferences for improving our evaluation, we also welcome community users to contribute new examples they find challenging to top LLMs like GPT-4/Claude3. Any feedback and suggestions are welcome, and we'll do our best to upgrade our data and evaluation methods accordingly. </li> -->
16
+ <li><b style="color: purple">🌟 Challenging & Real:</b> We carefully curate a collection of 1024 hard tasks from real users, which cover common use cases such as code debugging, creative writing, and data analysis.</li>
17
+ <li><b style="color: purple">🌟 Reliable AutoEval w/ Checklists:</b> Instead of merely asking GPT-4 to choose between A and B, we provide an instance-specific Checklist (i.e., a list of evaluation questions) for it to reason before making a judgment. It’s similar to CoT. Thus, our eval is highly interpretable and easy-to-verify.</li>
18
+ <li><b style="color: purple">🌟 Length Penalty:</b> GPT-4 judges tend to prefer longer outputs (although humans do too); to avoid this, we devise a simple method to add length penalty on Elo. You can even slide it on our leaderboard UI!</li>
19
+ <li><b style="color: purple">🌟 Task Categorization:</b> We tag each example with 12 task types, so we can analyze task-specific performance of LLMs, in addition to their overall ranking.</li>
20
+ <li><b style="color: purple">🌟 Fair Comparisons:</b> WildBench tests all examples on all LLMs. This is different from arena-style evaluation, where one example is only tested on a single pair of models and never seen again.</li>
21
+ <li><b style="color: purple">🌟 Easy & Fast:</b> WildBench (v1.0) contains 1024 examples now, and it is extremely easy to add your own LLMs to our leaderboard! We will do the work for you!</li>
22
+ <li><b style="color: purple">🌟 Dynamic:</b> WildBench will not be a static dataset. We will continue adding new examples and updating evaluation methods based on community feedback.</li>
23
+ <li><b style="color: purple">🌟 Human Evaluation (ongoing):</b> We are collecting human preferences via our Leaderboard UI (check the 🔍 🆚 tab). Please help us vote! (We’re planning to recruit domain experts too.)</li>
24
+ <li><b style="color: purple">🌟 Community driven:</b> We welcome everyone to contribute to human evaluation and create challenging examples. We also value your feedback and suggestions, and will continue enhancing our benchmark leaderboard accordingly.</li>
25
+ </ul>
26
+ </div>
27
+ </details>
28
+
29
+
30
+ ---
31
+
32
+ <details>
33
+ <summary style="font-size: 1.8em; font-weight: bold;">2. Where are the examples of WildBench from? </summary>
34
+ <div style="font-size: 1.4em; margin-top: 30px;">
35
+ <p>
36
+ <b>WildBench</b> was designed with a focus on capturing the real-world complexity and diversity of tasks that large language models (LLMs) encounter. The design process involved several key steps:
37
+ </p>
38
+ <h2>2.1. Task Collection from WildChat</h2>
39
+ <p>
40
+ <b>WildChat</b>, a dataset akin to ShareGPT but larger and with user consent, was utilized to gather human-GPT conversations. We filtered the data for English, non-toxic responses and used various popular LLMs to generate responses, which were then scored using reward models such as StarlingRM and PairRM. The examples with the highest score variance were shortlisted, from which 1024 were chosen for curating the <b>WildBench v1.0</b>, ensuring a mix of diversity and quality.
41
+ </p>
42
+ <h2>2.2. Task Categories</h2>
43
+ <img src="https://huggingface.co/spaces/WildEval/WildBench-Leaderboard/resolve/main/task_dist.png" width="80%" />
44
+ <p>
45
+ The tasks are classified into 12 categories to cover a broad spectrum of real-user scenarios. This categorization helps in maintaining a balanced task distribution, mirroring the task variety in WildChat and differing significantly from traditional benchmarks.
46
+ </p>
47
+ <h2>2.3. Additional Annotations</h2>
48
+ <p>
49
+ WildBench includes further annotations like secondary task types, conversation turn counts, user intents, moderation tags, and evaluation checklists, providing deeper insights into the tasks and enhancing response assessments. These annotations are generated by GPT-4.
50
+ </p>
51
+ </div>
52
+ </details>
53
+
54
+ <!-- ---
55
+
56
+ <details>
57
+ <summary style="font-size: 1.8em; font-weight: bold;">3. How is WildBench different from other benchmarks?</summary>
58
+ <div style="font-size: 1.4em; margin-top: 30px;">
59
+ <h2>3.1. WildBench vs AlpacaEval</h2>
60
+ <p>
61
+ Unlike AlpacaEval's simpler, single-turn prompts, WildBench employs over 1024 multi-turn prompts from genuine user interactions, focusing on challenging and varied tasks. This represents a significant shift towards realism and complexity, aiming to reflect authentic LLM usage.
62
+ </p>
63
+ <h2>3.2. WildBench vs MT-Bench</h2>
64
+ <p>
65
+ MT-Bench offers two-turn instruction-following tasks, while WildBench provides a broader and more challenging array of multi-turn scenarios, ensuring a comprehensive evaluation across different dimensions.
66
+ </p>
67
+ <h2>3.3. WildBench vs Chatbot Arena</h2>
68
+ <p>
69
+ Though both benchmarks use real-user data, WildBench is distinct in its focus on challenging content, task diversity, and a structured, transparent evaluation methodology that offers more detailed insights into LLM performance.
70
+ </p>
71
+ </div>
72
+ </details>
73
+
74
+ -->
75
+
76
+ ---
77
+
78
+ <details>
79
+ <summary style="font-size: 1.8em; font-weight: bold;">3. How do you evaluate the performance of LLMs on WildBench?</summary>
80
+ <div style="font-size: 1.4em; margin-top: 30px;">
81
+ <h2>3.1. Elo Rating</h2>
82
+ <p>We show two Elo rating for each model in our Main table. The "Overall" Elo rating is the standard method of using bootstrap method to compute the Elo scores for each model. The "Task-Avg" Elo is computed by first computing standard Elo on subsets of our data for each task type and then perform the average of them. </p>
83
+ <h2>3.2. Length Penalty</h2>
84
+ <p>We know that GPT-based evaluation tends to prefer longer responses, which is also the case for human evaluation. To mitigate this, we use a length penalty to normalize the Elo rating of the responses. Specifically, we compute two versions of Elo ratings for each model: one is based on win rates, and the other is based on "longer rates". The <code>WinElo</code> is the standard Elo rating, and the LongElo is the Elo rating considering longer outputs are always better than shorter outputs.
85
+ Then, we present the final adjusted Elo by taking the difference between <code>WinElo</code> and <code>LongElo</code>, i.e.,
86
+ <code>AdjustedElo = WinElo - LengthPenalty * LongElo</code>.
87
+ </p>
88
+ <h2>3.3. Checklist-based Evaluation</h2>
89
+ <p>In our automatic evaluation, we use a checklist (a list of 5~10 questions) for prompting GPT-4 to judge which model output is better. This checklist is example-specific. You can find real examples in "🔍 Explore | 🆚 Evaluate". The checklists help us ensure that GPT-4 uses a rather fixed standard to compare different model pairs on the same examples. Also, they facilitate us to better explain how GPT-4 make the decisions. </p>
90
+ <h2>3.4. Estimated Win Rates</h2>
91
+ <p>We estimate the win rates of each model winning GPT-4 by the differences of their Elo Rating versus GPT-4's. The formula can be found in <a href="https://www.hexwiki.net/index.php/Elo_rating#Definition"> this page</a>. </p>
92
+ <h2>3.5. Human-Verified Auto Evaluation</h2>
93
+ <p>Although the current version of our WildBench is purely based on automatic evaluators, we aim to collect human preferences from our demo here ("🔍 Explore | 🆚 Evaluate") and then incorporate these human evaluation for mitigating the bias of GPT-4 based evaluation. We also plan to recruit domain experts for further improving the fairness of our evaluation. Please stay tuned! </p>
94
+ </div>
95
+ </details>
96
+
97
+ ---
98
+
99
+ <details>
100
+ <summary style="font-size: 1.8em; font-weight: bold;">4. How can I test my model on WildBench?</summary>
101
+ <div style="font-size: 1.4em; margin-top: 30px;">
102
+ <p>Please refer to our Github <a href="https://github.com/allenai/WildBench">here</a> and create a PR or issue to tell us the information about your model. </p>
103
+ </div>
104
+ </details>
105
+
106
+ ---
107
+
108
+ <details>
109
+ <summary style="font-size: 1.8em; font-weight: bold;">5. How do I know why a particular model is weaker than others?</summary>
110
+ <div style="font-size: 1.4em; margin-top: 30px;">
111
+ <p>Please click the tab for "🔍 Explore | 🆚 Evaluate" and select the models and task types that you're interested in. We'll sample an example with two model outputs for you to compare and you can see the model ids after you submit your feedback. </p>
112
+ </div>
113
+ </details>
114
+
115
+ ---
116
+
117
+ <details>
118
+ <summary style="font-size: 1.8em; font-weight: bold;">6. Any future plans for WildBench?</summary>
119
+ <div style="font-size: 1.4em; margin-top: 30px;">
120
+ <p>We have many todo items! The most important one is to collect human preferences for improving our evaluation. We are also going to recruit domain experts for further improving the fairness of our evaluation. As for auto-evaluation, we will add multiple auto evaluators for mitigating the bias of GPT-4 based evaluation. For example, we aim to use Claude 3 as evaluator to check if the ranking would be different. We're also developing our open-source evaluation models for supporting faster local evaluation. </p>
121
+ </div>
122
+ </details>
123
+
124
+ ---
125
+
126
+ <details>
127
+ <summary style="font-size: 1.8em; font-weight: bold;">7. How do I contact you?</summary>
128
+ <div style="font-size: 1.4em; margin-top: 30px;">
129
+ <p>Please use the community discussion board <a href="https://huggingface.co/spaces/allenai/WildBench/discussions">here</a> or the Github issues. Also, please feel free to email us at [email protected] and mention "WildBench" in the title. </p>
130
+ </div>
131
+ </details>
_metrics.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ ### Metrics
2
+
3
+ - **Average Toxicity**: *what is the model’s overall toxicity?*
4
+ - **Expected Maximum Toxicity**: *what is the expected toxicity of a model’s worst-case generations?*
5
+ - **Empirical Probability**: *how frequently does a model generate toxicity?*
app.py ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import pandas as pd
3
+
4
+ from themes import Seafoam
5
+ from load_data import load_main_table
6
+ from constants import BANNER, CITATION_TEXT, css, js_code, all_task_types, js_light
7
+
8
+ TYPES = ["number", "markdown", "number"]
9
+ MAIN_TABLE_COLS = ['Model', 'Language', 'Average Toxicity', 'Expected Maximum Toxicity', 'Empirical Probability']
10
+
11
+
12
+
13
+ df_main = load_main_table()
14
+ available_models = df_main['Model'].unique()
15
+ MODEL_SIZE = list(df_main['Model Size'].unique())
16
+ MODEL_TYPE = list(df_main['Model Type'].unique())
17
+ LANGAUGES = list(df_main['Language'].unique())
18
+ MODEL_FAMILY = list(df_main['Model Family'].unique())
19
+
20
+
21
+ with open("_intro.md", "r") as f:
22
+ INTRO_MD = f.read()
23
+
24
+ with open("_about_us.md", "r") as f:
25
+ ABOUT_MD = f.read()
26
+
27
+ with open("_header.md", "r") as f:
28
+ HEADER_MD = f.read()
29
+
30
+ with open("_metrics.md", "r") as f:
31
+ METRIC_MD = f.read()
32
+
33
+
34
+ with gr.Blocks(theme=gr.themes.Soft(), css=css, js=js_light) as demo:
35
+ gr.HTML(BANNER, elem_id="banner")
36
+ # gr.HTML("<img src='file/image.png' alt='image One'>")
37
+ gr.Markdown(HEADER_MD, elem_classes="markdown-text")
38
+ gr.Image("data/ptp.png")
39
+ gr.Markdown(f"**Version**: PTP-Small | **# Examples**: 85K | **# Models**: {len(available_models)}", elem_classes="markdown-text")
40
+ gr.Markdown(METRIC_MD, elem_classes="markdown-text")
41
+
42
+
43
+ with gr.Tabs(elem_classes="tab-buttons") as tabs:
44
+
45
+ with gr.TabItem("🏅 Multilingual Leaderboard", elem_id="od-benchmark-tab-table", id=0, elem_classes="subtab"):
46
+ print(df_main.head())
47
+ mling_df = df_main.loc[df_main['Multilingual']==True, MAIN_TABLE_COLS].copy()
48
+ del mling_df['Language']
49
+
50
+ mling_df = mling_df.groupby("Model").agg('mean').reset_index().round(3)
51
+ mling_df = mling_df.sort_values(by="Average Toxicity")
52
+
53
+ print(mling_df.head())
54
+
55
+ ablation_table = gr.components.Dataframe(
56
+ value=mling_df,
57
+ datatype=TYPES,
58
+ height=1000,
59
+ elem_id="mling-table",
60
+ interactive=False,
61
+ visible=True,
62
+ min_width=60,
63
+ )
64
+
65
+
66
+ with gr.TabItem("📊 Ablation Results", elem_id="od-benchmark-tab-table", id=1, elem_classes="subtab"):
67
+ with gr.Row():
68
+ language = gr.CheckboxGroup(
69
+ choices=LANGAUGES,
70
+ value=LANGAUGES,
71
+ label='Language',
72
+ interactive=True
73
+ )
74
+ with gr.Row():
75
+ model_family = gr.CheckboxGroup(
76
+ choices=MODEL_FAMILY,
77
+ value=MODEL_FAMILY,
78
+ label='Model Family',
79
+ interactive=True
80
+ )
81
+
82
+ with gr.Row():
83
+ model_size = gr.CheckboxGroup(
84
+ choices=MODEL_SIZE,
85
+ value=MODEL_SIZE,
86
+ label='Model Size',
87
+ interactive=True,
88
+ )
89
+ model_type = gr.CheckboxGroup(
90
+ choices=MODEL_TYPE,
91
+ value=MODEL_TYPE,
92
+ label='Model Type',
93
+ interactive=True
94
+ )
95
+
96
+ ablation_table = gr.components.Dataframe(
97
+ value=df_main[MAIN_TABLE_COLS],
98
+ datatype=TYPES,
99
+ height=500,
100
+ elem_id="full-table",
101
+ interactive=False,
102
+ visible=True,
103
+ min_width=60,
104
+ )
105
+
106
+
107
+ def filter_df(model_size, model_type, language, model_family):
108
+ df = df_main.copy()
109
+
110
+ print(df.isnull().sum())
111
+
112
+ df = df[df['Model Type'].isin(model_type)]
113
+ df = df[df['Model Size'].isin(model_size)]
114
+ df = df[df['Language'].isin(language)]
115
+ df = df[df['Model Family'].isin(model_family)]
116
+
117
+ df = df.sort_values(by="Average Toxicity")
118
+
119
+ assert (df.isnull().sum().sum())==0
120
+
121
+
122
+ comp = gr.components.DataFrame(
123
+ value=df[MAIN_TABLE_COLS],
124
+ datatype=TYPES,
125
+ interactive=False,
126
+ visible=True)
127
+ return comp
128
+
129
+ for cbox in [model_size, model_type, language, model_family]:
130
+ cbox.change(fn=filter_df, inputs=[model_size, model_type, language, model_family], outputs=ablation_table)
131
+
132
+
133
+ with gr.TabItem("📮 About Us", elem_id="od-benchmark-tab-table", id=2):
134
+ gr.Markdown(ABOUT_MD, elem_classes="markdown-text")
135
+
136
+ with gr.Row():
137
+ with gr.Accordion("📙 Citation", open=False, elem_classes="accordion-label"):
138
+ gr.Textbox(
139
+ value=CITATION_TEXT,
140
+ lines=7,
141
+ label="Copy the BibTeX snippet to cite this source",
142
+ elem_id="citation-button",
143
+ show_copy_button=True)
144
+
145
+ if __name__ == '__main__':
146
+ demo.launch(share=True, allowed_paths=["."])
constants.py ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+
3
+
4
+ banner_url = "https://allenai.github.io/WildBench/gray_banner.png" # the same repo here.
5
+ banner_url = "/file/data/image.png"
6
+ BANNER = f'<div style="display: flex; justify-content: flex-start;"><img src="{banner_url}" alt="PolygloToxicityPrompts Banner" style="width: 40vw; min-width: 300px; max-width: 800px;"> </div>'
7
+
8
+ # BANNER = "<img src='file/image.png'>"
9
+
10
+ TITLE = "<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> AI2 PolygloToxicityPrompts Leaderboard </b> </body> </html>"
11
+
12
+
13
+
14
+ CITATION_TEXT = """
15
+ """
16
+
17
+
18
+ column_names = {
19
+ "model name ": "Model",
20
+ "elo overall": "Overall Elo",
21
+ 'Information seeking': 'InfoSek',
22
+ 'Creative Writing': 'CrtWrt',
23
+ 'Coding & Debugging': 'Code',
24
+ 'Reasoning': 'Reason',
25
+ 'Editing': 'Edit',
26
+ 'Math': 'Math',
27
+ 'Planning': 'Plan',
28
+ 'Brainstorming': 'Brnstrm',
29
+ 'Role playing': 'RolPly',
30
+ 'Advice seeking': 'AdvSek',
31
+ 'Data Analysis': 'DataAna',
32
+ 'Others': 'Misc',
33
+ "average": "Task-Avg Elo",
34
+ }
35
+
36
+ all_task_types = [
37
+ 'Information seeking',
38
+ 'Creative Writing',
39
+ 'Coding & Debugging',
40
+ 'Reasoning',
41
+ 'Editing',
42
+ 'Math',
43
+ 'Planning',
44
+ 'Brainstorming',
45
+ 'Role playing',
46
+ 'Advice seeking',
47
+ 'Data Analysis',
48
+ 'Others'
49
+ ]
50
+
51
+
52
+
53
+ js_light = """
54
+ function refresh() {
55
+ const url = new URL(window.location);
56
+
57
+ if (url.searchParams.get('__theme') !== 'light') {
58
+ url.searchParams.set('__theme', 'light');
59
+ window.location.href = url.href;
60
+ }
61
+ }
62
+ """
63
+
64
+ js_code = """
65
+ function scroll_top() {
66
+ console.log("Hello from Gradio!");
67
+ const bubbles = document.querySelectorAll('.bubble-wrap');
68
+ bubbles.forEach((bubble, index) => {
69
+ setTimeout(() => {
70
+ bubble.scrollTop = 0;
71
+ }, index * 100); // Delay of 100ms between each iteration
72
+ });
73
+ }
74
+ """
75
+
76
+
77
+
78
+ css = """
79
+ code {
80
+ font-size: large;
81
+ }
82
+ footer {visibility: hidden}
83
+ .top-left-LP{
84
+ margin-top: 6px;
85
+ margin-left: 5px;
86
+ }
87
+ .markdown-text{font-size: 14pt}
88
+ .markdown-text-small{font-size: 13pt}
89
+ .markdown-text-tiny{font-size: 12pt}
90
+ .markdown-text-tiny-red{
91
+ font-size: 12pt;
92
+ color: red;
93
+ background-color: yellow;
94
+ font-color: red;
95
+ font-weight: bold;
96
+ }
97
+ th {
98
+ text-align: center;
99
+ font-size: 17px; /* Adjust the font size as needed */
100
+ }
101
+ td {
102
+ font-size: 15px; /* Adjust the font size as needed */
103
+ text-align: center;
104
+ }
105
+
106
+ .sample_button{
107
+ border: 1px solid #000000;
108
+ border-radius: 5px;
109
+ padding: 5px;
110
+ font-size: 15pt;
111
+ font-weight: bold;
112
+ margin: 5px;
113
+ }
114
+
115
+ .chat-common{
116
+ height: auto;
117
+ max-height: 400px;
118
+ min-height: 100px;
119
+ }
120
+ .chat-specific{
121
+ height: auto;
122
+ max-height: 600px;
123
+ min-height: 200px;
124
+ }
125
+ #od-benchmark-tab-table-button{
126
+ font-size: 15pt;
127
+ font-weight: bold;
128
+ }
129
+
130
+ .btn_boderline{
131
+ border: 1px solid #000000;
132
+ border-radius: 5px;
133
+ padding: 5px;
134
+ margin: 5px;
135
+ font-size: 15pt;
136
+ font-weight: bold;
137
+ }
138
+
139
+ .btn_boderline_next{
140
+ border: 0.1px solid #000000;
141
+ border-radius: 5px;
142
+ padding: 5px;
143
+ margin: 5px;
144
+ font-size: 15pt;
145
+ font-weight: bold;
146
+ }
147
+
148
+ .btn_boderline_gray{
149
+ border: 0.5px solid gray;
150
+ border-radius: 5px;
151
+ padding: 5px;
152
+ margin: 5px;
153
+ font-size: 15pt;
154
+ font-weight: italic;
155
+ }
156
+ .btn_boderline_selected{
157
+ border: 2px solid purple;
158
+ background-color: #f2f2f2;
159
+ border-radius: 5px;
160
+ padding: 5px;
161
+ margin: 5px;
162
+ font-size: 15pt;
163
+ font-weight: bold;
164
+ }
165
+ .accordion-label button span{
166
+ font-size: 14pt;
167
+ font-weight: bold;
168
+ }
169
+
170
+ #select-models span{
171
+ font-size: 10pt;
172
+ }
173
+
174
+ #select-tasks span{
175
+ font-size: 10pt;
176
+ }
177
+
178
+
179
+ .markdown-text-details{
180
+ margin: 10px;
181
+ padding: 10px;
182
+ }
183
+
184
+
185
+ button.selected[role="tab"][aria-selected="true"] {
186
+ font-size: 18px; /* or any other size you prefer */
187
+ font-weight: bold;
188
+ }
189
+
190
+ #od-benchmark-tab-table-ablation-button {
191
+ font-size: larger; /* Adjust the font size as needed */
192
+ }
193
+
194
+
195
+ .plotly-plot{
196
+ height: auto;
197
+ max-height: 600px;
198
+ min-height: 600px;
199
+ }
200
+ """
201
+
data/data.csv ADDED
@@ -0,0 +1,270 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Model,Language,n,Empirical Probability,Average Toxicity,AT_std,Expected Maximum Toxicity,EMT_std
2
+ Aya101,ar,4991,0.3173712682829092,0.1789909070819233,0.1699583303192907,0.3779634160958525,0.2838244737433122
3
+ Aya101,hi,4985,0.472617853560682,0.327846254210008,0.2853136276925565,0.5145376081964493,0.3316002199422784
4
+ Aya101,sv,4982,0.3960256924929747,0.2004753431498536,0.2065632418193262,0.403791000663629,0.3142486184293141
5
+ Aya101,es,4994,0.4237084501401682,0.2126733223526963,0.2357619425363373,0.4289806327062915,0.3364525066360539
6
+ Aya101,ko,4972,0.337288817377313,0.1818423013757116,0.178804873248136,0.3813411081753922,0.2930026113322383
7
+ Aya101,it,4979,0.3165294235790319,0.1641074723479408,0.1720771518867963,0.3604029101729463,0.2865118219975858
8
+ Aya101,ru,4963,0.2472294982873262,0.1190333688151164,0.1661806570927175,0.2862729372170864,0.2837578073671991
9
+ Aya101,de,4971,0.3099979883323275,0.1694211445641122,0.1709485308015794,0.3832602374635083,0.2903628763460412
10
+ Aya101,zh,4972,0.2769509251810136,0.1555506199552906,0.1674579791846571,0.3363831536835499,0.2629828703267124
11
+ Aya101,pl,4988,0.2768644747393745,0.1928323713893985,0.2409767653775025,0.335552301428364,0.3097830687543669
12
+ Aya101,cs,4970,0.4826961770623742,0.3449695208196935,0.3413234036810039,0.5016781293022233,0.3599190314997191
13
+ Aya101,nl,4977,0.258991360257183,0.1280962412446357,0.144820440846364,0.3078528502100626,0.2664830287696625
14
+ Aya101,ja,4985,0.4499498495486459,0.228281477403223,0.2453175719979331,0.4223120736750462,0.3338384105697259
15
+ Aya101,fr,4988,0.2383720930232558,0.1036430561063775,0.1243604494142232,0.2911114934994988,0.2638918260438188
16
+ Aya101,id,4959,0.3938294010889292,0.2116315035335735,0.2040831093697636,0.3993943878217382,0.3077969334882113
17
+ Aya101,en,4982,0.3637093536732236,0.2045797664998171,0.2326475833616081,0.3984143262739261,0.33933166373826
18
+ Aya101,pt,4923,0.3654275848060126,0.1703121356754513,0.183177553473386,0.3902560535074141,0.2942922512273861
19
+ Airavata,hi,4845,0.3642930856553147,0.1856313282550476,0.1501095674451557,0.4401045692688338,0.2932780681026647
20
+ CCK-v2.0-DPO,ko,4875,0.2761025641025641,0.1991500617846393,0.1696440753252164,0.3397594317982154,0.2367317536464028
21
+ GEITje-7B,nl,5000,0.0704,0.13152274459371,0.0315215640975836,0.2994678723764,0.1184165941850124
22
+ Heimer-dpo-TinyLlama-1.1B,en,4924,0.4892363931762794,0.3008567081680482,0.2453218173184554,0.4857918330164134,0.3053028901518194
23
+ Heimer-ipo-TinyLlama-1.1B,en,4923,0.4826325411334552,0.2989898731332143,0.2446273807178657,0.480950849151107,0.3042516125458957
24
+ Heimer-kto-TinyLlama-1.1B,en,4923,0.4842575665244769,0.3012298240857637,0.2461710928912794,0.4842102191080032,0.3035877945907039
25
+ HuggingFaceH4,es,5000,0.3648,0.2242235653088816,0.2056231631643599,0.3749202732691,0.2762139827740253
26
+ HuggingFaceH4,it,4998,0.2376950780312124,0.1835984779009262,0.1874635229096702,0.3062229654742897,0.2480258460624863
27
+ HuggingFaceH4,de,5000,0.2202,0.1528056631657272,0.1520803493368028,0.3178238023726,0.2493072822371166
28
+ HuggingFaceH4,fr,5000,0.2426,0.1794527222492735,0.1984804286377933,0.30273084100066,0.2551934706093537
29
+ HuggingFaceH4,en,5000,0.1654,0.1415391936811164,0.1835181592542144,0.23674793454756,0.2417855761323892
30
+ Llama-2-13b-chat-hf,es,4933,0.0060814919927022,0.0867484800016185,0.0629421374608256,0.1626203345821001,0.1163969571723593
31
+ Llama-2-13b-chat-hf,it,4973,0.0094510355921978,0.0852969928666662,0.067603864169044,0.1605789602603056,0.1224329064580384
32
+ Llama-2-13b-chat-hf,de,4959,0.0084694494857834,0.0619209204184681,0.058370506024682,0.1184552240804597,0.1166773136863573
33
+ Llama-2-13b-chat-hf,fr,4975,0.0024120603015075,0.0747153678304357,0.056179403827879,0.1366475326766432,0.0999366040879878
34
+ Llama-2-13b-chat-hf,en,4985,0.0050150451354062,0.0788232293580374,0.0639665184607887,0.1451363887487462,0.1161592216558846
35
+ Llama-2-13b-hf,es,4933,0.5438881005473343,0.3006502234620613,0.2063175102274039,0.5065645002096087,0.2639665346291082
36
+ Llama-2-13b-hf,it,4972,0.4680209171359614,0.2857698286663739,0.2010627505870962,0.4739712323480088,0.2530324732098599
37
+ Llama-2-13b-hf,de,4958,0.4711577248890681,0.2815902638541687,0.186756751755917,0.5184673693955224,0.2517923269259948
38
+ Llama-2-13b-hf,fr,4974,0.5146763168476075,0.3094129572655659,0.2196238975454312,0.5010350415956977,0.2697283429741529
39
+ Llama-2-13b-hf,en,4985,0.4968906720160481,0.3009252558051605,0.2265010305812429,0.4884974315470411,0.2795463594174111
40
+ Llama-2-70b-chat-hf,es,4933,0.0075005067909993,0.0978844219137605,0.0722012916150897,0.1660386646737279,0.1178323562125016
41
+ Llama-2-70b-chat-hf,it,4973,0.0162879549567665,0.0957087563647232,0.0795915655452702,0.1685684898351095,0.1300436977331734
42
+ Llama-2-70b-chat-hf,de,4959,0.0120992135511191,0.0720012460645247,0.0668315409101066,0.135189690848054,0.1250649913720454
43
+ Llama-2-70b-chat-hf,fr,4975,0.0038190954773869,0.0866289403019821,0.0642158099393095,0.1467231940405226,0.1047098843883123
44
+ Llama-2-70b-chat-hf,en,4985,0.0062186559679037,0.0873236902260613,0.0724670909697174,0.1493344206063189,0.1188287105712093
45
+ Llama-2-7b-chat-hf,es,4933,0.0050679099939185,0.1051838532533802,0.0722169034515546,0.171924629598216,0.113439353925958
46
+ Llama-2-7b-chat-hf,it,4973,0.012065151819827,0.1013351909360475,0.0773649490346743,0.1708801626016489,0.1220059247014668
47
+ Llama-2-7b-chat-hf,de,4959,0.0086711030449687,0.0714485055354368,0.0700663948458093,0.1248150737412785,0.1173718764768897
48
+ Llama-2-7b-chat-hf,fr,4975,0.0060301507537688,0.0991720665466376,0.0688013244725269,0.1656969812785929,0.1084023280110066
49
+ Llama-2-7b-chat-hf,en,4985,0.0076228686058174,0.097608116189029,0.076437725349258,0.1664830399518555,0.1241574675566904
50
+ Llama-2-7b-hf,es,4933,0.52300831137239,0.27906912245073,0.2022038457630561,0.4935407837851206,0.2783322480976382
51
+ Llama-2-7b-hf,it,4973,0.4494269052885582,0.2555684716314286,0.188896856152805,0.4556347030068167,0.2652412614083991
52
+ Llama-2-7b-hf,de,4959,0.4480742085097802,0.2562826687366737,0.1768580952215861,0.4993673323303086,0.2591608405098994
53
+ Llama-2-7b-hf,fr,4974,0.4857257740249296,0.2665448884569666,0.1983577923872908,0.4739215426064536,0.280067949011631
54
+ Llama-2-7b-hf,en,4984,0.4691011235955056,0.2833533285699884,0.2259292223517908,0.4720414649890651,0.2910027721180275
55
+ Mistral-7B-Instruct-v0.1,es,4999,0.4138827765553111,0.2073307179838222,0.1776887348933398,0.4033939762114423,0.2804378054388877
56
+ Mistral-7B-Instruct-v0.1,it,4998,0.3435374149659864,0.1971848899194879,0.1737866404057179,0.3798321170442177,0.281915203519734
57
+ Mistral-7B-Instruct-v0.1,de,5000,0.2438,0.1404019201183962,0.136170959080391,0.3345475866346999,0.2620937036699952
58
+ Mistral-7B-Instruct-v0.1,fr,5000,0.3978,0.1999616765953496,0.1806540884928694,0.3968423785601,0.2944703692094368
59
+ Mistral-7B-Instruct-v0.1,en,5000,0.3212,0.1876004412471928,0.2007518034352385,0.3459246829402,0.2954551631646708
60
+ Mistral-7B-Instruct-v0.2,es,4999,0.3250650130026005,0.2343391644244923,0.193815663514442,0.3435507356584997,0.2484226088547393
61
+ Mistral-7B-Instruct-v0.2,it,5000,0.2662,0.2274904198964189,0.2010907741105085,0.3208400631358,0.2458557796698025
62
+ Mistral-7B-Instruct-v0.2,de,5000,0.087,0.1211173553553025,0.127870277111899,0.1992228197003999,0.1917643417479949
63
+ Mistral-7B-Instruct-v0.2,fr,5000,0.324,0.2455026883666892,0.2141325417593894,0.34603889667144,0.2639226180001327
64
+ Mistral-7B-Instruct-v0.2,en,5000,0.2094,0.1766480347054035,0.1797234620368595,0.2697839986727999,0.2372901749804189
65
+ Mistral-7B-v0.1,es,4998,0.4981992797118847,0.2831080831839713,0.2275465364354873,0.4775010492554822,0.29564715804532
66
+ Mistral-7B-v0.1,it,4998,0.4101640656262505,0.2372407402415852,0.1949152188955373,0.4302636677593123,0.2809795447770399
67
+ Mistral-7B-v0.1,de,4999,0.4006801360272054,0.2471960684643148,0.1801188396376044,0.4729964280076414,0.2529281322534472
68
+ Mistral-7B-v0.1,fr,4996,0.4685748598879103,0.2524165783809966,0.2040604772246496,0.4538804152882507,0.2928602476382341
69
+ Mistral-7B-v0.1,en,4995,0.4736736736736737,0.3086187261349492,0.2715391375378712,0.4719950468238038,0.3155133632250202
70
+ OLMo-7B-Instruct,es,4962,0.4179766223297058,0.2674601278997236,0.2324037549280606,0.4084598271352075,0.2867404406866717
71
+ OLMo-7B-Instruct,it,4970,0.3237424547283702,0.2299715708627383,0.1997031871363648,0.3620493999568008,0.2620623495168277
72
+ OLMo-7B-Instruct,de,4972,0.2208366854384553,0.1537195992396726,0.1607163265093092,0.31718835321607,0.2597974625233605
73
+ OLMo-7B-Instruct,fr,4980,0.3628514056224899,0.2370283888386106,0.2070959427541308,0.3674805384451807,0.2689582991207681
74
+ OLMo-7B-Instruct,en,4986,0.2795828319294023,0.2082598999315822,0.2071522161581455,0.3139966061627557,0.2612589419350429
75
+ OpenHathi-7B-Hi-v0.1-Base,hi,4860,0.4602880658436214,0.3278048383318688,0.2560548739773518,0.5120501071041975,0.296741000297937
76
+ Qwen-7B-Chat,zh,5000,0.0406,0.0906028022275186,0.0588331839789582,0.2043447303,0.1276910587361531
77
+ Swallow-13b-instruct-hf,ja,4651,0.435390238658353,0.153196516080019,0.1533381235006014,0.4188003307047211,0.3040419870355701
78
+ Swallow-70b-instruct-hf,ja,4652,0.423688736027515,0.1451565909085777,0.1569291512096124,0.4031103039660232,0.3144019858621369
79
+ Swallow-7b-hf,ja,4650,0.5197849462365591,0.3106864885867317,0.2684483513357074,0.4812557407159806,0.3194415319600209
80
+ Swallow-7b-instruct-hf,ja,4654,0.4542329179200687,0.1591118966610399,0.1597645573436028,0.4287120936904834,0.3008055382857481
81
+ Yi-6B-Chat,zh,4986,0.1253509827517047,0.0981361459369175,0.1015149460281245,0.2530036935569595,0.197391983706302
82
+ archangel_dpo_llama13b,en,4923,0.5055860247816372,0.2832817609568308,0.2239926257535594,0.4964983952163315,0.2983581728553022
83
+ archangel_dpo_llama7b,en,4924,0.494110479285134,0.2732369889167676,0.2225961985116831,0.4877659523357027,0.3074480664883964
84
+ archangel_kto_llama13b,en,4923,0.4917733089579524,0.265857362879638,0.2148456742616247,0.4824199446808247,0.2920729834365902
85
+ archangel_kto_llama7b,en,4924,0.4849715678310317,0.266149771469339,0.2202585017071929,0.4755960759538992,0.3040187559510424
86
+ archangel_ppo_llama13b,en,4924,0.5030463038180342,0.2908826047534769,0.23276620724245,0.4948351588919578,0.3033505139179421
87
+ archangel_ppo_llama7b,en,4924,0.5004061738424046,0.2826556918211783,0.2325780825949814,0.4889972752091794,0.3096995773586565
88
+ archangel_sft-dpo_llama13b,en,4923,0.5157424334755231,0.2921078977444891,0.2305752029160327,0.5009939678280723,0.3017906256511777
89
+ archangel_sft-dpo_llama7b,en,4924,0.5154346060113729,0.2846739661056439,0.2261595628358827,0.4995514289980706,0.3056728002519796
90
+ archangel_sft-kto_llama13b,en,4923,0.5094454600853139,0.285761451985992,0.224978586612828,0.4991280969122486,0.2973415555426719
91
+ archangel_sft-kto_llama7b,en,4924,0.5201056051990252,0.2846931913661586,0.2265785930969792,0.4993440421803412,0.3064377850374949
92
+ archangel_sft-ppo_llama13b,en,4923,0.5149299207800122,0.2852270856215136,0.2220936880641315,0.5019223601167987,0.2970997097391119
93
+ archangel_sft-ppo_llama7b,en,4924,0.5199025182778229,0.2824472204317482,0.2265393150476197,0.502339453259139,0.3102802776521014
94
+ bloomz-1b1,ar,4884,0.3579033579033579,0.1988126094389675,0.1859144904916891,0.413300040152887,0.3060614632943153
95
+ bloomz-1b1,hi,4883,0.4552529182879377,0.3040653020408722,0.2773203452723169,0.4870647319451566,0.3304269964171472
96
+ bloomz-1b1,sv,4907,0.3185245567556551,0.1809954814318183,0.1737039651322506,0.3870572822005299,0.2712772580495033
97
+ bloomz-1b1,es,4815,0.3852544132917965,0.1948844231215703,0.2184892196812931,0.3978647540299688,0.3249690565869331
98
+ bloomz-1b1,ko,4684,0.2944064901793339,0.1559347188998527,0.1623012972566951,0.3660134365328022,0.2869210465134975
99
+ bloomz-1b1,it,4889,0.2360400899979546,0.1301224356844653,0.1425929938692934,0.3256653116192268,0.2622766792927356
100
+ bloomz-1b1,ru,4743,0.1454775458570525,0.0738199384292565,0.1040246639243491,0.2394997214601012,0.2294333281970293
101
+ bloomz-1b1,de,4814,0.2690070627336934,0.1444564854043241,0.1323649452556936,0.3635254407277606,0.2638115683881881
102
+ bloomz-1b1,zh,4931,0.3849117825998783,0.1835992572537652,0.1624694934595119,0.3958749752204481,0.2746026687709914
103
+ bloomz-1b1,pl,4879,0.2168477146956343,0.1587783807707588,0.1802322827394425,0.320774269990455,0.2654652775342848
104
+ bloomz-1b1,cs,4894,0.380057212913772,0.2401843336052662,0.2221600633461362,0.4389435291224765,0.3078778105438327
105
+ bloomz-1b1,nl,4908,0.1471067644661776,0.0942781541531134,0.1125976638612326,0.2582121798035397,0.2251663435990484
106
+ bloomz-1b1,ja,4841,0.2964263581904565,0.1314543797887075,0.1816567071924119,0.3076154745123037,0.3034652623225028
107
+ bloomz-1b1,fr,4723,0.3641753123015033,0.1831075809877769,0.1960987471381426,0.3717349784911983,0.3008484508147089
108
+ bloomz-1b1,id,4794,0.376720901126408,0.1979997540012488,0.1920580178024304,0.3938623010839591,0.3008800082655624
109
+ bloomz-1b1,en,4828,0.356255178127589,0.1939117105036264,0.2107564540170656,0.3885185411065659,0.3158871223881973
110
+ bloomz-1b1,pt,4687,0.368465969703435,0.1743079727265758,0.1798173849976589,0.3945279006941967,0.2952136705155121
111
+ bloomz-1b7,ar,4793,0.3686626330064678,0.2110961287132615,0.2017913487541383,0.4164192709278531,0.3140467920514896
112
+ bloomz-1b7,hi,4823,0.4418411776902343,0.2960465060673452,0.2713589272631541,0.4859591170898403,0.3365198805639519
113
+ bloomz-1b7,sv,4857,0.3446571957998764,0.1833053781052303,0.187096890738626,0.3869428454201689,0.2913352570795194
114
+ bloomz-1b7,es,4779,0.4109646369533375,0.2060065240881409,0.2222009717348751,0.4130996861535886,0.330976422910291
115
+ bloomz-1b7,ko,4675,0.1935828877005347,0.1262135133189459,0.1388998324538134,0.3043995333893482,0.2556722824202309
116
+ bloomz-1b7,it,4874,0.2550266721378744,0.1420957207675061,0.1574160222022657,0.3335099572148338,0.2714755919147721
117
+ bloomz-1b7,ru,4793,0.1450031295639474,0.0817765688799047,0.0998634504547132,0.2472471332813248,0.2221538122571006
118
+ bloomz-1b7,de,4774,0.2681189777963971,0.1453916765252789,0.1359428440070636,0.3644266367607457,0.2671752670609557
119
+ bloomz-1b7,zh,4927,0.3344834584940125,0.1607734042035958,0.1454420890167919,0.3681627339489425,0.2599663949149025
120
+ bloomz-1b7,pl,4855,0.2166838311019567,0.1636871099659885,0.1687391175746943,0.3265897748646344,0.2559728310862014
121
+ bloomz-1b7,cs,4798,0.3478532721967486,0.2301924012423336,0.2276787903023999,0.4206283971561692,0.3110408804566562
122
+ bloomz-1b7,nl,4884,0.1928746928746929,0.1115675689480917,0.1369691961776696,0.286509719896554,0.2549950356560522
123
+ bloomz-1b7,ja,4662,0.3191763191763191,0.1422363076916304,0.1854973536617564,0.3328498451662031,0.3033924605048432
124
+ bloomz-1b7,fr,4690,0.3678038379530917,0.186555142567338,0.2044696381908879,0.3879760294719912,0.3255233864588548
125
+ bloomz-1b7,id,4765,0.4331584470094438,0.2299851047116209,0.2265380511886736,0.437667464328831,0.3310604826714987
126
+ bloomz-1b7,en,4773,0.3710454640687199,0.2036142752167638,0.219141191741574,0.4024057179168657,0.3266146108705132
127
+ bloomz-1b7,pt,4707,0.3722115997450606,0.1829574098442442,0.1923722997951045,0.3987491124731676,0.300851404426086
128
+ bloomz-3b,ar,4733,0.3481935347559687,0.1989521540909371,0.1969044290685992,0.3976980851763184,0.3137515294352428
129
+ bloomz-3b,hi,4824,0.4305555555555556,0.2951413473936469,0.2747976354845753,0.476294720335199,0.3347605208618968
130
+ bloomz-3b,sv,4849,0.293875025778511,0.1649800001827937,0.1851560536193055,0.3472926751324809,0.2880593851406002
131
+ bloomz-3b,es,4786,0.3888424571667363,0.2011138572533638,0.2277942379060415,0.3998648381237108,0.3330172233567403
132
+ bloomz-3b,ko,4507,0.1985799866873752,0.1260037596092703,0.1410219438164681,0.3045653509931155,0.2560873307577248
133
+ bloomz-3b,it,4872,0.2489737274220032,0.1425173812131798,0.1648516857325555,0.3271301407924671,0.2791336658509164
134
+ bloomz-3b,ru,4626,0.1234327712926934,0.0745140321697381,0.1061666719763735,0.218804583192019,0.222951650247158
135
+ bloomz-3b,de,4786,0.2547012118679482,0.1412479933485121,0.1362053460865442,0.3528360987581905,0.2694048122194307
136
+ bloomz-3b,zh,4857,0.321597694049825,0.1570781178315149,0.1489819592031391,0.3549811925722071,0.2633076238309731
137
+ bloomz-3b,pl,4843,0.2172207309518893,0.1520616745748974,0.1944063595350569,0.3014650100090212,0.280949133833519
138
+ bloomz-3b,cs,4817,0.3537471455262611,0.2331574846403224,0.2444212535339614,0.4124709826625493,0.324019666842642
139
+ bloomz-3b,nl,4868,0.1801561216105176,0.1051916807742035,0.1387467474194312,0.2666120386458479,0.2573149299524141
140
+ bloomz-3b,ja,4720,0.2726694915254237,0.1200956627835282,0.179308081769775,0.2818031221116541,0.3043855203163603
141
+ bloomz-3b,fr,4621,0.367669335641636,0.1943090790344112,0.2157295697418776,0.3901585653657974,0.334758726802748
142
+ bloomz-3b,id,4726,0.4162082099026661,0.2193931456634718,0.2192528273724368,0.4240216357772323,0.3311356487622928
143
+ bloomz-3b,en,4757,0.3634643682993483,0.2010799780384276,0.2220843449016772,0.3930131163469203,0.3273553712014336
144
+ bloomz-3b,pt,4733,0.3667863934079864,0.1827065795733951,0.1919020633182748,0.3974777431630192,0.3005179703606417
145
+ bloomz-560m,ar,4869,0.2885602793181351,0.1653370382391464,0.1574807009699313,0.3620933456200041,0.2796046616449588
146
+ bloomz-560m,hi,4820,0.3811203319502074,0.2564123307257731,0.2541300427122325,0.4402706829053735,0.3298163168469705
147
+ bloomz-560m,sv,4861,0.2867722690804361,0.1659265474126653,0.171052480173118,0.3498465618346657,0.2741637165808483
148
+ bloomz-560m,es,4754,0.3279343710559528,0.1606180381071098,0.1925444377070328,0.3541924999419857,0.3142102135614228
149
+ bloomz-560m,ko,4482,0.1360999553770638,0.1096055493624048,0.117785525907125,0.2665502123016752,0.218148644194868
150
+ bloomz-560m,it,4896,0.2071078431372549,0.1266645336013044,0.1382480781530631,0.3085719553377655,0.2493015823055378
151
+ bloomz-560m,ru,4669,0.0901692011137288,0.0686816019535131,0.0793942627024689,0.2109074534741957,0.1932645652437007
152
+ bloomz-560m,de,4889,0.2724483534465126,0.1471733252855039,0.1274240005674466,0.3741434723741072,0.2587935915127669
153
+ bloomz-560m,zh,4861,0.2155934992799835,0.1256338871264844,0.1332772343673066,0.2949651799626928,0.2431094009335612
154
+ bloomz-560m,pl,4847,0.2009490406436971,0.1641499439266185,0.1985455829471652,0.3048555579382476,0.2709528175511562
155
+ bloomz-560m,cs,4878,0.4015990159901599,0.2762467561713251,0.2561788310254942,0.4585475691339074,0.3136600804215982
156
+ bloomz-560m,nl,4857,0.1165328392011529,0.09541615438474,0.1180402918021845,0.2365694196995328,0.2157562304959352
157
+ bloomz-560m,ja,4800,0.205625,0.0960826234366286,0.1519654890782737,0.2471489072334082,0.2731308636490629
158
+ bloomz-560m,fr,4726,0.2659754549301735,0.1349263387694462,0.1586540836654987,0.309220609790457,0.2802965713681963
159
+ bloomz-560m,id,4665,0.3106109324758842,0.1626359550510073,0.1706963182592705,0.3468014862967198,0.2942877423288753
160
+ bloomz-560m,en,4838,0.326581231914014,0.1640252514470016,0.1785054637242486,0.3631733477939851,0.3060055125802647
161
+ bloomz-560m,pt,4592,0.3310104529616725,0.1523100942400806,0.1643927616322851,0.3603721122403092,0.2889869256779097
162
+ bloomz-7b1,ar,4751,0.3531888023573984,0.2080134152104135,0.206809640863939,0.4047497175140939,0.3186954192109211
163
+ bloomz-7b1,hi,4810,0.4251559251559251,0.2865515247031075,0.2736984743105785,0.4664746590733263,0.33255425051118
164
+ bloomz-7b1,sv,4845,0.3263157894736842,0.1738977482615529,0.1880768783947489,0.3723458635907946,0.2930971426933625
165
+ bloomz-7b1,es,4794,0.415310805173133,0.220594930734861,0.2406765958624402,0.4222990363411973,0.3389011123794337
166
+ bloomz-7b1,ko,4598,0.2427142235754676,0.1375611162529812,0.1496084031156419,0.3284488846789038,0.2736219763276082
167
+ bloomz-7b1,it,4844,0.2698183319570603,0.1494985121215096,0.1705011311658617,0.341845987914905,0.2878001645559825
168
+ bloomz-7b1,ru,4622,0.139766334919948,0.0753118573269215,0.1028831243206662,0.231652666230926,0.2285664845677253
169
+ bloomz-7b1,de,4854,0.2904820766378245,0.1630368190841042,0.1542847373858901,0.3846557993222703,0.2751725591090824
170
+ bloomz-7b1,zh,4873,0.3291606813051508,0.1615940958048956,0.1567701049363325,0.3595361161220429,0.2710445019194619
171
+ bloomz-7b1,pl,4867,0.2177933018286418,0.1590935140465615,0.2003197466515264,0.305779018425901,0.2819435489223021
172
+ bloomz-7b1,cs,4850,0.3734020618556701,0.2344864836581485,0.252690566939387,0.4191875089289072,0.3343424991209316
173
+ bloomz-7b1,nl,4857,0.2118591723285979,0.1167147214557745,0.1465950127024021,0.2929210021398394,0.2747827256452229
174
+ bloomz-7b1,ja,4740,0.2632911392405063,0.1172765470090478,0.1747472123724674,0.2720290066322215,0.3013684725758117
175
+ bloomz-7b1,fr,4597,0.3743745921252991,0.1954787240152017,0.2174211227204581,0.3844908260728084,0.323959669032934
176
+ bloomz-7b1,id,4704,0.4162414965986394,0.2224641414598692,0.2249788100673474,0.4201624765582483,0.3304199356587939
177
+ bloomz-7b1,en,4766,0.3680234997901804,0.2066260204207959,0.230053283729392,0.3964604167041753,0.3303262371715915
178
+ bloomz-7b1,pt,4746,0.3948588284871471,0.1903330857551345,0.1947112075239938,0.4145236111528445,0.3035163194963773
179
+ gemma-7b-it,es,4981,0.1863079702870909,0.1347118923600555,0.1236905254914831,0.2948829968307709,0.2212544419620162
180
+ gemma-7b-it,it,4990,0.1677354709418837,0.1328199312693879,0.1146681257710871,0.2796453107428256,0.2097050771258541
181
+ gemma-7b-it,de,4994,0.2074489387264717,0.1347462166979726,0.1261150044323819,0.3134520952707248,0.2498825271511417
182
+ gemma-7b-it,fr,4995,0.1027027027027027,0.0972704599897812,0.0904723488328675,0.2353051798888889,0.187088840235101
183
+ gemma-7b-it,en,4993,0.2074904866813539,0.163302407635538,0.1418358596779297,0.3071260098062287,0.2325820866498657
184
+ mpt-7b,en,4982,0.4429947812123645,0.2850575431496288,0.2564651610797962,0.4547912716334484,0.3180164478113562
185
+ mpt-7b-instruct,en,4987,0.4517746139963906,0.2865563375597613,0.259601902689536,0.4455640935877281,0.3110713421699829
186
+ pythia-1.4b,en,4985,0.4846539618856569,0.3179250759979772,0.2786760160030095,0.4891714649439118,0.3182511597102306
187
+ pythia-12b,en,4981,0.493073679983939,0.3275623643938577,0.2840494830399754,0.494287683243144,0.3221270882806815
188
+ pythia-160m,en,4984,0.4295746388443017,0.2485048421036461,0.2262870873804885,0.4516201300334069,0.2998120649817254
189
+ pythia-1b,en,4985,0.4872617853560682,0.3116851872498744,0.2725849253503696,0.4900452411745034,0.31492409916706
190
+ pythia-2.8b,en,4987,0.4862642871465811,0.3234546363516069,0.2836429594304018,0.4904870833895327,0.3213196823196072
191
+ pythia-410m,en,4986,0.4665062174087445,0.294583539846771,0.2650497338923498,0.4752743262868631,0.3149770156538719
192
+ pythia-6.9b,en,4982,0.4965877157767964,0.3280188382083215,0.2841066773911033,0.4959576413200923,0.3210815330798021
193
+ pythia-70m,en,4990,0.3745490981963927,0.2103240876974745,0.1939012573556414,0.4198962104250101,0.2811215340229599
194
+ ruGPT-3.5-13B,ru,4978,0.4475693049417437,0.2489569154503428,0.2148358099513138,0.4491778316225391,0.2891238611292507
195
+ stablelm-2-1_6b,es,4990,0.4981963927855711,0.3028962297425099,0.2623548113469321,0.4856119822771811,0.3171865192000768
196
+ stablelm-2-1_6b,it,4993,0.4247947125976367,0.2647205638138566,0.237495973974507,0.4468687608520128,0.3078418298204741
197
+ stablelm-2-1_6b,de,4989,0.4209260372820204,0.2826959646252295,0.2094638316870471,0.4998779019797554,0.2625660401984461
198
+ stablelm-2-1_6b,nl,4990,0.4484969939879759,0.2685109197712423,0.2234950664700348,0.4378160987056934,0.2952505274125404
199
+ stablelm-2-1_6b,fr,4995,0.5003003003003003,0.3107929238233334,0.2617553494554963,0.4857563457586586,0.3243640761962988
200
+ stablelm-2-1_6b,en,4991,0.483670607092767,0.3166972244542632,0.2825447738598285,0.4811732910285915,0.327127140851104
201
+ stablelm-2-1_6b,pt,4994,0.4841810172206648,0.2749691978333554,0.2105127990906706,0.4937805853707048,0.2641486492088508
202
+ stablelm-2-zephyr-1_6b,es,4992,0.3439503205128205,0.1975571231597293,0.2003995862604208,0.3631917205599158,0.2852058622360945
203
+ stablelm-2-zephyr-1_6b,it,4991,0.2614706471648968,0.1672555072766634,0.1644583387299278,0.3205809912218994,0.2548215347118873
204
+ stablelm-2-zephyr-1_6b,de,4954,0.1927735163504239,0.1178850858754082,0.1270355908442243,0.2846524161144529,0.2538455572533253
205
+ stablelm-2-zephyr-1_6b,nl,4999,0.294258851770354,0.1900694631270387,0.1595946761183296,0.3376390411280256,0.2362789971650273
206
+ stablelm-2-zephyr-1_6b,fr,4994,0.3468161794152983,0.1933550737307658,0.188281498725866,0.3504911477881057,0.2774103686521471
207
+ stablelm-2-zephyr-1_6b,en,4998,0.2891156462585034,0.1825243821735419,0.1960363954065823,0.3125230818735494,0.27342528932281
208
+ stablelm-2-zephyr-1_6b,pt,4989,0.2671878131890158,0.1544029868218321,0.1742810066663866,0.3249303766874123,0.2639163798869512
209
+ tulu-2-13b,es,4831,0.301179879942041,0.1536978455028007,0.170304469901537,0.3372673955906023,0.2897330449233107
210
+ tulu-2-13b,it,4860,0.2234567901234568,0.1396093573682553,0.1423296042053044,0.3060829265945884,0.2560066871413096
211
+ tulu-2-13b,de,4834,0.2244517997517583,0.1253667226227362,0.1362693514984649,0.3085984872008688,0.2744011604890453
212
+ tulu-2-13b,fr,4822,0.1945250933222729,0.116507544356612,0.1373843284515729,0.2620769662020571,0.2498996872174629
213
+ tulu-2-13b,en,4755,0.2557308096740273,0.1439266080786302,0.1665434025538998,0.3060834713413669,0.2861778100427951
214
+ tulu-2-7b,es,4980,0.1769076305220883,0.0984272419110188,0.0889995240349428,0.2729633899687751,0.2301020827594039
215
+ tulu-2-7b,it,4990,0.1915831663326653,0.1169004375016297,0.1045058964679143,0.2853349568879158,0.2326931901605381
216
+ tulu-2-7b,de,4988,0.1497594226142742,0.0931519011269649,0.0921135754628028,0.2601720211044506,0.2333021067344569
217
+ tulu-2-7b,fr,4988,0.0414995990376904,0.0677952249465112,0.0590472823029127,0.1746461079947473,0.150990410718193
218
+ tulu-2-7b,en,4976,0.1109324758842443,0.0905827340912625,0.0903820448373618,0.2216262440212017,0.2045443308715486
219
+ tulu-2-dpo-13b,es,4697,0.1762827336597828,0.1230781068580369,0.1450236350809632,0.25381381414109,0.2429287064433257
220
+ tulu-2-dpo-13b,it,4770,0.1251572327044025,0.119751384379528,0.1218516500817482,0.2373653279120126,0.2081164652759309
221
+ tulu-2-dpo-13b,de,4788,0.1451545530492899,0.100972484756509,0.1143970867467482,0.2470458699010025,0.2400581114773784
222
+ tulu-2-dpo-13b,fr,4826,0.1121011189390799,0.1012318728637349,0.1134171890397562,0.213672100858765,0.2037919816184375
223
+ tulu-2-dpo-13b,en,4780,0.1370292887029288,0.1080142958971487,0.1311714789215987,0.2230872758214435,0.2290575935469172
224
+ tulu-2-dpo-70b,es,4908,0.1269356153219233,0.1122805596073435,0.1227827505927808,0.2264330823620211,0.209431913743211
225
+ tulu-2-dpo-70b,it,4918,0.0801138674257828,0.1162505012663672,0.1125259056331195,0.2173964258457706,0.1803509590368752
226
+ tulu-2-dpo-70b,de,4939,0.1350475804818789,0.1116127685856357,0.112180414281075,0.2566106270074914,0.2186805084383896
227
+ tulu-2-dpo-70b,fr,4945,0.0693629929221435,0.0970020444291488,0.1033260174167977,0.1879190584888776,0.1739605583110998
228
+ tulu-2-dpo-70b,en,4937,0.0832489366011748,0.1035554269724415,0.1224673588351606,0.195344025140875,0.1920519395030683
229
+ tulu-2-dpo-7b,es,4977,0.1189471569218404,0.1039567203107767,0.0927288481877795,0.2399892372828009,0.2010756150196121
230
+ tulu-2-dpo-7b,it,4995,0.1253253253253253,0.122254772486335,0.1061814582665379,0.2524468361092092,0.201863897431973
231
+ tulu-2-dpo-7b,de,4993,0.1197676747446425,0.0952000149543841,0.0950397306297983,0.2326035442374324,0.2141453967963351
232
+ tulu-2-dpo-7b,fr,4992,0.0336538461538461,0.0802549188944927,0.0692508606752942,0.1780189827890424,0.1441083129855735
233
+ tulu-2-dpo-7b,en,4991,0.1069925866559807,0.1050166351905586,0.1094511172782801,0.216859477456722,0.2026533998184208
234
+ zephyr-7b-beta,es,5000,0.419,0.2597888022066376,0.2147494351650221,0.4129112422936999,0.2825345829192548
235
+ zephyr-7b-beta,it,5000,0.3516,0.2511241260480538,0.215647610772426,0.3813650237994,0.2735986642080513
236
+ zephyr-7b-beta,de,5000,0.2194,0.1547041175452616,0.1550671225242424,0.3131005268101,0.2498253102817157
237
+ zephyr-7b-beta,fr,4999,0.381876375275055,0.2592065150187518,0.2266763520386383,0.3930610152179435,0.2876798951909775
238
+ zephyr-7b-beta,en,5000,0.2766,0.2043592229530769,0.2053109934941847,0.3127769966276,0.2648962724349844
239
+ zephyr-7b-gemma-v0.1,es,4999,0.1868373674734947,0.1442589485224596,0.113853682839481,0.2912251141488297,0.211575562329375
240
+ zephyr-7b-gemma-v0.1,it,4999,0.1400280056011202,0.1361333846604859,0.109341040685136,0.2685830900229045,0.1943769189480327
241
+ zephyr-7b-gemma-v0.1,de,4999,0.2238447689537907,0.1594512177822305,0.1197821041230799,0.3674270070566113,0.2124394268030186
242
+ zephyr-7b-gemma-v0.1,fr,4999,0.1398279655931186,0.1195401309680135,0.0924008493529877,0.2631673641432286,0.1983356363563246
243
+ zephyr-7b-gemma-v0.1,en,4999,0.1394278855771154,0.1205078063936122,0.1111768413077736,0.2449670516225245,0.2062974336630307
244
+ gpt-35-turbo,ar,4998,0.2629051620648259,0.2098182243433282,0.2076886447797046,0.3386736775885334,0.2667473345198126
245
+ gpt-35-turbo,hi,4997,0.3466079647788673,0.2868542142160802,0.2828622478196919,0.4288032535456473,0.3245947530810348
246
+ gpt-35-turbo,sv,4993,0.2876026437011816,0.211261189733052,0.2323300564125041,0.3291692371137605,0.2851837344857831
247
+ gpt-35-turbo,es,4994,0.316780136163396,0.2290432449976606,0.2553592997681056,0.3424355070337032,0.301609893831499
248
+ gpt-35-turbo,ko,4996,0.2301841473178542,0.1723125838362593,0.1932421760259724,0.299941085080321,0.2640945991813433
249
+ gpt-35-turbo,it,4989,0.2309079975947083,0.1931366456063818,0.217427916655189,0.3003791876244165,0.2641736072919595
250
+ gpt-35-turbo,ru,4990,0.2595190380761523,0.1584896433990052,0.2008878922434815,0.281148133478624,0.275052981927308
251
+ gpt-35-turbo,de,4803,0.247137205912971,0.1904172972482734,0.1931909206170704,0.3472546836037747,0.2654044092084133
252
+ gpt-35-turbo,zh,4998,0.0870348139255702,0.1038718053312282,0.099605120442037,0.2245489997786534,0.180784006959398
253
+ gpt-35-turbo,pl,4996,0.2762209767814251,0.176019233851971,0.2104067582250721,0.3153047964276055,0.2856932507785094
254
+ gpt-35-turbo,cs,4990,0.3973947895791583,0.2584965203747262,0.2499373203558789,0.4051195468078959,0.3262767212687349
255
+ gpt-35-turbo,nl,4998,0.2963185274109643,0.2169110453018776,0.2158052462110761,0.3271973792529909,0.2576118050933906
256
+ gpt-35-turbo,ja,4996,0.3514811849479584,0.2252790278197624,0.256862733132941,0.357303335287475,0.3138901604694437
257
+ gpt-35-turbo,fr,4999,0.3004600920184037,0.2040322666268356,0.2356166073997148,0.3142078285360175,0.2911940988658895
258
+ gpt-35-turbo,id,4999,0.2710542108421684,0.2159155927805076,0.21785771468491,0.3301515931595029,0.2706773899286154
259
+ gpt-35-turbo,en,4998,0.2831132452981192,0.2186820781361611,0.2530648331356822,0.3154917507462865,0.2931820344761845
260
+ gpt-35-turbo,pt,4879,0.2605042016806723,0.1893918914739677,0.2124061701397226,0.3290584642256282,0.2669319179369109
261
+ Meta-Llama-3-8B-Instruct,en,4999,0.04280856171234247,0.1109264753416201,0.0898458223482576,0.20950434451790356,0.1561794119730667
262
+ Meta-Llama-3-8B-Instruct,fr,4999,0.021204240848169634,0.10675989695859678,0.07605534797608259,0.20027916838923784,0.13528190697912806
263
+ Meta-Llama-3-8B-Instruct,es,4997,0.040224134480688414,0.11982465746896896,0.09140614339767897,0.2226936411527917,0.15257628678437607
264
+ Meta-Llama-3-8B-Instruct,it,5000,0.0614,0.13582813172364613,0.10722073166375928,0.2341709128155,0.16342695384202371
265
+ Meta-Llama-3-8B-Instruct,de,4999,0.009801960392078415,0.08752700190872742,0.08752537889409855,0.14115270519793957,0.13232116699557078
266
+ Meta-Llama-3-8B,en,4998,0.4579831932773109,0.28755746751034983,0.23542476060330728,0.46702268799834734,0.2826402637075691
267
+ Meta-Llama-3-8B,fr,5000,0.4706,0.27313631752562423,0.21204422931531994,0.4683250663852238,0.27599484776501726
268
+ Meta-Llama-3-8B,es,4998,0.501000400160064,0.2554243335665402,0.18987061733297014,0.4854739333491116,0.2696156053557734
269
+ Meta-Llama-3-8B,it,4999,0.4368873774754951,0.2477625607559858,0.19117635149432813,0.4437417093397171,0.25613964381080023
270
+ Meta-Llama-3-8B,de,4997,0.10406243746247748,0.09947650703016991,0.09404481341481113,0.24664965581461262,0.1743693628108685
data/image.png ADDED
data/ptp.png ADDED
load_data.py ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ SIZE_MAP = {
2
+ 'Airavata': '7b',
3
+ 'CCK-v2.0-DPO': '13b',
4
+ 'GEITje-7B': '7b',
5
+ 'Heimer-dpo-TinyLlama-1.1B': '1b',
6
+ 'Heimer-kto-TinyLlama-1.1B': '1b',
7
+ 'Heimer-ipo-TinyLlama-1.1B': '1b',
8
+ 'HuggingFaceH4': '7b',
9
+ 'Llama-2-13b-chat-hf': '13b',
10
+ 'Llama-2-13b-hf': '13b',
11
+ 'Llama-2-70b-chat-hf': '70b',
12
+ 'Llama-2-7b-chat-hf': '7b',
13
+ 'Llama-2-7b-hf': '7b',
14
+ 'Mistral-7B-Instruct-v0.1': '7b',
15
+ 'Mistral-7B-Instruct-v0.2': '7b',
16
+ 'OLMo-7B-Instruct': '7b',
17
+ 'OpenHathi-7B-Hi-v0.1-Base': '7b',
18
+ 'Qwen-7B-Chat': '7b',
19
+ 'Swallow-13b-instruct-hf': '13b',
20
+ 'Swallow-70b-instruct-hf': '70b',
21
+ 'Swallow-7b-hf': '7b',
22
+ 'Swallow-7b-instruct-hf': '7b',
23
+ 'Yi-6B-Chat': '1b-7b',
24
+ 'archangel_dpo_llama13b': '13b',
25
+ 'archangel_dpo_llama7b': '7b',
26
+ 'archangel_kto_llama13b': '13b',
27
+ 'archangel_kto_llama7b': '7b',
28
+ 'archangel_ppo_llama13b': '13b',
29
+ 'archangel_ppo_llama7b': '7b',
30
+ 'archangel_sft-dpo_llama13b': '13b',
31
+ 'archangel_sft-dpo_llama7b': '7b',
32
+ 'archangel_sft-kto_llama13b': '13b',
33
+ 'archangel_sft-kto_llama7b': '7b',
34
+ 'archangel_sft-ppo_llama13b': '13b',
35
+ 'archangel_sft-ppo_llama7b': '7b',
36
+ 'bloomz-1b1': '1b',
37
+ 'bloomz-1b7': '7b',
38
+ 'bloomz-3b': '1b-7b',
39
+ 'bloomz-560m': '<1b',
40
+ 'bloomz-7b1': '7b',
41
+ 'gemma-7b-it': '7b',
42
+ 'llama-30b': '30b',
43
+ 'mpt-7b': '7b',
44
+ 'mpt-7b-instruct': '7b',
45
+ 'pythia-1.4b': '1b-7b',
46
+ 'pythia-12b': '13b',
47
+ 'pythia-160m': '<1b',
48
+ 'pythia-1b': '1b',
49
+ 'pythia-2.8b': '1b-7b',
50
+ 'pythia-410m': '1b',
51
+ 'pythia-6.9b': '7b',
52
+ 'pythia-70m': '1b',
53
+ 'ruGPT-3.5-13B': '13b',
54
+ 'stablelm-2-1_6b': '1b-7b',
55
+ 'stablelm-2-zephyr-1_6b': '1b-7b',
56
+ 'tulu-2-13b': '13b',
57
+ 'tulu-2-7b': '7b',
58
+ 'tulu-2-dpo-13b': '13b',
59
+ 'tulu-2-dpo-70b': '70b',
60
+ 'tulu-2-dpo-7b': '7b',
61
+ 'zephyr-7b-beta': '7b',
62
+ 'gpt-35-turbo': "Unknown",
63
+ "Aya101": '13b',
64
+ "zephyr-7b-gemma-v0.1": "7b",
65
+ "Mistral-7B-v0.1": '7b',
66
+ "Meta-Llama-3-8B-Instruct": "8b",
67
+ "Meta-Llama-3-8B": '8b',
68
+ }
69
+
70
+
71
+ MODEL_FAMILY = {
72
+ 'Airavata': 'OpenHathi',
73
+ 'CCK-v2.0-DPO': 'NA',
74
+ 'GEITje-7B': 'Mistral-GEITje',
75
+ 'Heimer-dpo-TinyLlama-1.1B': 'Llama-Tiny',
76
+ 'Heimer-kto-TinyLlama-1.1B': 'Llama-Tiny',
77
+ 'Heimer-ipo-TinyLlama-1.1B': 'Llama-Tiny',
78
+ 'HuggingFaceH4': 'Mistral-CAI',
79
+ 'Llama-2-13b-chat-hf': 'Llama',
80
+ 'Llama-2-13b-hf': 'Llama',
81
+ 'Llama-2-70b-chat-hf': 'Llama',
82
+ 'Llama-2-7b-chat-hf': 'Llama',
83
+ 'Llama-2-7b-hf': 'Llama',
84
+ 'Mistral-7B-Instruct-v0.1': 'Mistral',
85
+ 'Mistral-7B-Instruct-v0.2': 'Mistral',
86
+ 'OLMo-7B-Instruct': 'OLMo',
87
+ 'OpenHathi-7B-Hi-v0.1-Base': 'OpenHathi',
88
+ 'Qwen-7B-Chat': 'Qwen',
89
+ 'Swallow-13b-instruct-hf': 'Llama-Swallow',
90
+ 'Swallow-70b-instruct-hf': 'Llama-Swallow',
91
+ 'Swallow-7b-hf': 'Llama-Swallow',
92
+ 'Swallow-7b-instruct-hf': 'Llama-Swallow',
93
+ 'Yi-6B-Chat': 'Yi',
94
+ 'archangel_dpo_llama13b': 'Llama-Archangel',
95
+ 'archangel_dpo_llama7b': 'Llama-Archangel',
96
+ 'archangel_kto_llama13b': 'Llama-Archangel',
97
+ 'archangel_kto_llama7b': 'Llama-Archangel',
98
+ 'archangel_ppo_llama13b': 'Llama-Archangel',
99
+ 'archangel_ppo_llama7b': 'Llama-Archangel',
100
+ 'archangel_sft-dpo_llama13b': 'Llama-Archangel',
101
+ 'archangel_sft-dpo_llama7b': 'Llama-Archangel',
102
+ 'archangel_sft-kto_llama13b': 'Llama-Archangel',
103
+ 'archangel_sft-kto_llama7b': 'Llama-Archangel',
104
+ 'archangel_sft-ppo_llama13b': 'Llama-Archangel',
105
+ 'archangel_sft-ppo_llama7b': 'Llama-Archangel',
106
+ 'bloomz-1b1': 'Bloomz',
107
+ 'bloomz-1b7': 'Bloomz',
108
+ 'bloomz-3b': 'Bloomz',
109
+ 'bloomz-560m': 'Bloomz',
110
+ 'bloomz-7b1': 'Bloomz',
111
+ 'gemma-7b-it': 'Gemma',
112
+ 'llama-30b': 'Llama',
113
+ 'mpt-7b': 'MPT',
114
+ 'mpt-7b-instruct': 'MPT',
115
+ 'pythia-1.4b': 'Pythia',
116
+ 'pythia-12b': 'Pythia',
117
+ 'pythia-160m': 'Pythia',
118
+ 'pythia-1b': 'Pythia',
119
+ 'pythia-2.8b': 'Pythia',
120
+ 'pythia-410m': 'Pythia',
121
+ 'pythia-6.9b': 'Pythia',
122
+ 'pythia-70m': 'Pythia',
123
+ 'ruGPT-3.5-13B': 'GPT',
124
+ 'stablelm-2-1_6b': 'StableLM',
125
+ 'stablelm-2-zephyr-1_6b': 'StableLM',
126
+ 'tulu-2-13b': 'Llama-Tulu',
127
+ 'tulu-2-7b': 'Llama-Tulu',
128
+ 'tulu-2-dpo-13b': 'Llama-Tulu',
129
+ 'tulu-2-dpo-70b': 'Llama-Tulu',
130
+ 'tulu-2-dpo-7b': 'Llama-Tulu',
131
+ 'zephyr-7b-beta': 'Mistral',
132
+ 'gpt-35-turbo': "GPT-OAI",
133
+ 'Aya101': 'Aya101',
134
+ "zephyr-7b-gemma-v0.1": 'Gemma',
135
+ "Mistral-7B-v0.1": 'Mistral',
136
+ "Meta-Llama-3-8B-Instruct": "Llama",
137
+ "Meta-Llama-3-8B": 'Llama',
138
+ }
139
+
140
+
141
+ MODEL_TYPE = {
142
+ 'Airavata': 'instruct',
143
+ 'CCK-v2.0-DPO': 'preference',
144
+ 'GEITje-7B': 'base',
145
+ 'Heimer-dpo-TinyLlama-1.1B': 'preference',
146
+ 'Heimer-kto-TinyLlama-1.1B': 'preference',
147
+ 'Heimer-ipo-TinyLlama-1.1B': 'preference',
148
+ 'HuggingFaceH4': 'preference',
149
+ 'Llama-2-13b-chat-hf': 'preference',
150
+ 'Llama-2-13b-hf': 'base',
151
+ 'Llama-2-70b-chat-hf': 'preference',
152
+ 'Llama-2-7b-chat-hf': 'preference',
153
+ 'Llama-2-7b-hf': 'base',
154
+ 'Mistral-7B-Instruct-v0.1': 'instruct',
155
+ 'Mistral-7B-Instruct-v0.2': 'instruct',
156
+ 'OLMo-7B-Instruct': 'preference',
157
+ 'OpenHathi-7B-Hi-v0.1-Base': 'instruct',
158
+ 'Qwen-7B-Chat': 'preference',
159
+ 'Swallow-13b-instruct-hf': 'instruct',
160
+ 'Swallow-70b-instruct-hf': 'instruct',
161
+ 'Swallow-7b-hf': 'base',
162
+ 'Swallow-7b-instruct-hf': 'instruct',
163
+ 'Yi-6B-Chat': 'preference',
164
+ 'archangel_dpo_llama13b': 'preference',
165
+ 'archangel_dpo_llama7b': 'preference',
166
+ 'archangel_kto_llama13b': 'preference',
167
+ 'archangel_kto_llama7b': 'preference',
168
+ 'archangel_ppo_llama13b': 'preference',
169
+ 'archangel_ppo_llama7b': 'preference',
170
+ 'archangel_sft-dpo_llama13b': 'preference',
171
+ 'archangel_sft-dpo_llama7b': 'preference',
172
+ 'archangel_sft-kto_llama13b': 'preference',
173
+ 'archangel_sft-kto_llama7b': 'preference',
174
+ 'archangel_sft-ppo_llama13b': 'preference',
175
+ 'archangel_sft-ppo_llama7b': 'preference',
176
+ 'bloomz-1b1': 'base',
177
+ 'bloomz-1b7': 'base',
178
+ 'bloomz-3b': 'base',
179
+ 'bloomz-560m': 'base',
180
+ 'bloomz-7b1': 'base',
181
+ 'gemma-7b-it': 'instruct',
182
+ 'llama-30b': 'base',
183
+ 'mpt-7b': 'base',
184
+ 'mpt-7b-instruct': 'instruct',
185
+ 'pythia-1.4b': 'base',
186
+ 'pythia-12b': 'base',
187
+ 'pythia-160m': 'base',
188
+ 'pythia-1b': 'base',
189
+ 'pythia-2.8b': 'base',
190
+ 'pythia-410m': 'base',
191
+ 'pythia-6.9b': 'base',
192
+ 'pythia-70m': 'base',
193
+ 'ruGPT-3.5-13B': 'base',
194
+ 'stablelm-2-1_6b': 'instruct',
195
+ 'stablelm-2-zephyr-1_6b': 'preference',
196
+ 'tulu-2-13b': 'preference',
197
+ 'tulu-2-7b': 'preference',
198
+ 'tulu-2-dpo-13b': 'preference',
199
+ 'tulu-2-dpo-70b': 'preference',
200
+ 'tulu-2-dpo-7b': 'preference',
201
+ 'zephyr-7b-beta': 'preference',
202
+ 'gpt-35-turbo': "preference",
203
+ 'Aya101': 'instruct',
204
+ 'zephyr-7b-gemma-v0.1': 'preference',
205
+ 'Mistral-7B-v0.1': 'base',
206
+ "Meta-Llama-3-8B-Instruct": "preference",
207
+ "Meta-Llama-3-8B": 'base',
208
+ }
209
+
210
+ MULTILINGUAL_FAMILY = ['Aya101', 'GPT-OAI', 'Bloomz']
211
+
212
+ import pandas as pd
213
+
214
+ def load_main_table():
215
+
216
+ df = pd.read_csv("./data/data.csv").round(3)
217
+ df = df[df.Model!='CCK-v2.0-DPO']
218
+ assert len(set(df['Model'].unique()) - set(list(SIZE_MAP.keys())))==0
219
+
220
+ df['Model Size'] = df['Model'].map(SIZE_MAP)
221
+
222
+ df['Model Type'] = df['Model'].map(MODEL_TYPE)
223
+
224
+ df['Model Family'] = df['Model'].map(MODEL_FAMILY)
225
+
226
+ df['Multilingual'] = df['Model Family'].apply(lambda x: x in MULTILINGUAL_FAMILY)
227
+ df = df.sort_values(by="Average Toxicity")
228
+
229
+ return df
requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ gradio[oauth]==4.19.2
2
+ datasets
3
+ toolz==0.12.1
4
+ plotly
themes.py ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+ from typing import Iterable
3
+ import gradio as gr
4
+ from gradio.themes.base import Base
5
+ from gradio.themes.utils import colors, fonts, sizes
6
+ import time
7
+
8
+ class Seafoam(Base):
9
+ def __init__(
10
+ self,
11
+ *,
12
+ primary_hue: colors.Color | str = colors.blue,
13
+ secondary_hue: colors.Color | str = colors.gray,
14
+ neutral_hue: colors.Color | str = colors.gray,
15
+ spacing_size: sizes.Size | str = sizes.spacing_md,
16
+ radius_size: sizes.Size | str = sizes.radius_md,
17
+ text_size: sizes.Size | str = sizes.text_lg,
18
+ font: fonts.Font
19
+ | str
20
+ | Iterable[fonts.Font | str] = (
21
+ fonts.GoogleFont("Quicksand"),
22
+ "ui-sans-serif",
23
+ "sans-serif",
24
+ ),
25
+ font_mono: fonts.Font
26
+ | str
27
+ | Iterable[fonts.Font | str] = (
28
+ fonts.GoogleFont("IBM Plex Mono"),
29
+ "ui-monospace",
30
+ "monospace",
31
+ ),
32
+ ):
33
+ super().__init__(
34
+ primary_hue=primary_hue,
35
+ secondary_hue=secondary_hue,
36
+ neutral_hue=neutral_hue,
37
+ spacing_size=spacing_size,
38
+ radius_size=radius_size,
39
+ text_size=text_size,
40
+ font=font,
41
+ font_mono=font_mono,
42
+ )
43
+
44
+
45
+ seafoam = Seafoam()