Spaces:
Running
Running
File size: 3,344 Bytes
360f81c b428657 360f81c 82141cb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 |
The goal of the ML.ENERGY Leaderboard is to give people a sense of how much **energy** LLMs would consume.
## How is energy different?
Even between models with the exact same architecture and size, the average energy consumption per prompt is different because they have **different verbosity**.
That is, when asked the same thing, they answer in different lengths.
## Metrics
- `gpu`: NVIDIA GPU model name
- `task`: Name of the task. See *Tasks* below for details.
- `throughput` (token/s): The average number of tokens generated per second.
- `response_length` (token): The average number of tokens in the model's response.
- `latency` (s): The average time it took for the model to generate a response.
- `energy` (J): The average energy consumed by the model to generate a response.
- `parameters`: The number of parameters the model has, in units of billion.
## Tasks
For each task, every model uses the same system prompt. We still account for differences in roles, e.g. `USER`, `HUMAN`, `ASSISTANT`, `GPT`.
| Name | System prompt |
|--|--|
| chat | A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. |
| chat-concise | A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. The assistant's answers are very concise. |
| instruct | Below is an instruction that describes a task. Write a response that appropriately completes the request. |
| instruct-concise | Below is an instruction that describes a task. Write a response that appropriately completes the request. The response should be very concise. |
## Setup
Find our benchmark script for one model [here](https://github.com/ml-energy/leaderboard/blob/master/benchmark.py).
### Software
- PyTorch 2.0.1
- [FastChat](https://github.com/lm-sys/fastchat) -- For various model support
- [Zeus](https://ml.energy/zeus) -- For GPU energy measurement
### Hardware
- NVIDIA A40 GPU
### Parameters
- Model
- Batch size 1
- FP16
- Sampling (decoding)
- Greedy sampling from multinomial distribution
- Temperature 0.7
- Repetition penalty 1.0
## Data
We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
We used identical system prompts for all models (while respecting their own *role* tokens):
```
A chat between a human user (prompter) and an artificial intelligence (AI) assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
```
## Upcoming
- Compare against more optimized inference runtimes, like TensorRT.
- Other GPUs
- Other model/sampling parameters
- More models
- Model quality evaluation numbers (e.g., AI2 Reasoning Challenge, HellaSwag)
# License
This leaderboard is a research preview intended for non-commercial use only.
The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
Please direct inquiries and reports of potential license/copyright violation to Jae-Won Chung.
|