Spaces:
Running
Running
Jae-Won Chung
commited on
Commit
•
94e85ec
1
Parent(s):
0082f5e
Create README.md
Browse files- data/README.md +33 -0
data/README.md
ADDED
@@ -0,0 +1,33 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Data files for the ML.ENERGY Leaderboard
|
2 |
+
|
3 |
+
This directory holds all the data for the leaderboard table.
|
4 |
+
|
5 |
+
## Parameters
|
6 |
+
|
7 |
+
There are two types of parameters: (1) Those that become radio buttons on the leaderboard and (2) those that become columns on the leaderboard table.
|
8 |
+
Models are always placed in rows.
|
9 |
+
|
10 |
+
Currently, there are only two parameters that become radio buttons: GPU model (e.g., V100, A40, A100) and task (e.g., chat, chat-concise, instruct, and instruct-concise).
|
11 |
+
This is defined in the `schema.yaml` file.
|
12 |
+
|
13 |
+
Radio button parameters have their own CSV file in this directory.
|
14 |
+
For instance, benchmark results for the *chat* task ran on an *A100* GPU lives in `A100_chat_benchmark.csv`. This file name is dynamically constructed by the leaderboard Gradio application by looking at `schema.yaml` and read in as a Pandas DataFrame.
|
15 |
+
|
16 |
+
Parameters that become columns in the table are put directly in the benchmark CSV files, e.g., `batch_size` and `datatype`.
|
17 |
+
|
18 |
+
## Adding new models
|
19 |
+
|
20 |
+
1. Add your model to `models.json`.
|
21 |
+
- The model's JSON key should be its unique codename, e.g. Hugging Face Hub model name. It's usually not that readable.
|
22 |
+
- `url` should point to a page where people can obtain the model's weights, e.g. Hugging Face Hub.
|
23 |
+
- `nickname` should be a short human-readable string that identifies the model.
|
24 |
+
- `params` should be an integer rounded to billions.
|
25 |
+
|
26 |
+
1. Add NLP dataset evaluation scores to `score.csv`.
|
27 |
+
- `model` is the model's JSON key in `models.json`.
|
28 |
+
- `arc` is the accuracy on the [ARC challenge](https://allenai.org/data/arc) dataset.
|
29 |
+
- `hellaswag` is the accuracy on the [HellaSwag](https://allenai.org/data/hellaswag) dataset.
|
30 |
+
- `truthfulqa` is the accuracy on the [TruthfulQA](https://github.com/sylinrl/TruthfulQA) MC2 dataset.
|
31 |
+
- We obtain these metrics using lm-evaluation-harness. See [here](https://github.com/ml-energy/leaderboard/tree/master/pegasus#nlp-benchmark) for specific instructions.
|
32 |
+
|
33 |
+
1. Add benchmarking results in CSV files, e.g. `A100_chat_benchmark.csv`. It should be evident from the name of the CSV files which setting the file corresponds to.
|