Spaces:
Running
Running
Jae-Won Chung
commited on
Commit
·
d846882
1
Parent(s):
511ed5e
Better About tab
Browse files- LEADERBOARD.md +16 -12
LEADERBOARD.md
CHANGED
|
@@ -1,14 +1,23 @@
|
|
| 1 |
The goal of the ML.ENERGY Leaderboard is to give people a sense of how much **energy** LLMs would consume.
|
| 2 |
|
|
|
|
|
|
|
|
|
|
| 3 |
## Columns
|
| 4 |
|
| 5 |
-
- `gpu`: NVIDIA GPU model name.
|
| 6 |
- `task`: Name of the task. See *Tasks* below for details.
|
| 7 |
- `energy` (J): The average GPU energy consumed by the model to generate a response.
|
| 8 |
- `throughput` (token/s): The average number of tokens generated per second.
|
| 9 |
- `latency` (s): The average time it took for the model to generate a response.
|
| 10 |
- `response_length` (token): The average number of tokens in the model's response.
|
| 11 |
- `parameters`: The number of parameters the model has, in units of billion.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
## Tasks
|
| 14 |
|
|
@@ -39,6 +48,7 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
|
|
| 39 |
|
| 40 |
- NVIDIA A40 GPU
|
| 41 |
- NVIDIA A100 GPU
|
|
|
|
| 42 |
|
| 43 |
### Parameters
|
| 44 |
|
|
@@ -50,17 +60,11 @@ Find our benchmark script for one model [here](https://github.com/ml-energy/lead
|
|
| 50 |
- Temperature 0.7
|
| 51 |
- Repetition penalty 1.0
|
| 52 |
|
| 53 |
-
|
| 54 |
|
| 55 |
We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
|
| 56 |
See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
|
| 57 |
|
| 58 |
-
## NLP evaluation metrics
|
| 59 |
-
|
| 60 |
-
- `arc`: [AI2 Reasoning Challenge](https://allenai.org/data/arc)'s `challenge` dataset, measures capability to do grade-school level question answering, 25 shot
|
| 61 |
-
- `hellaswag`: [HellaSwag dataset](https://allenai.org/data/hellaswag), measuring grounded commonsense, 10 shot
|
| 62 |
-
- `truthfulqa`: [TruthfulQA dataset](https://arxiv.org/abs/2109.07958), measuring truthfulness against questions that elicit common falsehoods, 0 shot
|
| 63 |
-
|
| 64 |
## Limitations
|
| 65 |
|
| 66 |
Currently, inference is run with basically bare PyTorch with batch size 1, which is unrealistic assuming a production serving scenario.
|
|
@@ -68,18 +72,18 @@ Hence, absolute latency, throughput, and energy numbers should not be used to es
|
|
| 68 |
|
| 69 |
## Upcoming
|
| 70 |
|
| 71 |
-
- Within the Summer, we'll add an
|
| 72 |
- More optimized inference runtimes, like TensorRT.
|
| 73 |
- Larger models with distributed inference, like Falcon 40B.
|
| 74 |
- More models, like RWKV.
|
| 75 |
|
| 76 |
-
|
| 77 |
|
| 78 |
This leaderboard is a research preview intended for non-commercial use only.
|
| 79 |
Model weights were taken as is from the Hugging Face Hub if available and are subject to their licenses.
|
| 80 |
The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
|
| 81 |
Please direct inquiries/reports of potential violation to Jae-Won Chung.
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
-
We thank [Chameleon Cloud](https://www.chameleoncloud.org/)
|
|
|
|
| 1 |
The goal of the ML.ENERGY Leaderboard is to give people a sense of how much **energy** LLMs would consume.
|
| 2 |
|
| 3 |
+
The code for the leaderboard, backing data, and scripts for benchmarking are all open-source in our [repository](https://github.com/ml-energy/leaderboard).
|
| 4 |
+
We'll see you at the [Discussion board](https://github.com/ml-energy/leaderboard/discussions), where you can ask questions, suggest improvement ideas, or just discuss leaderboard results!
|
| 5 |
+
|
| 6 |
## Columns
|
| 7 |
|
| 8 |
+
- `gpu`: NVIDIA GPU model name.
|
| 9 |
- `task`: Name of the task. See *Tasks* below for details.
|
| 10 |
- `energy` (J): The average GPU energy consumed by the model to generate a response.
|
| 11 |
- `throughput` (token/s): The average number of tokens generated per second.
|
| 12 |
- `latency` (s): The average time it took for the model to generate a response.
|
| 13 |
- `response_length` (token): The average number of tokens in the model's response.
|
| 14 |
- `parameters`: The number of parameters the model has, in units of billion.
|
| 15 |
+
- `arc`: [AI2 Reasoning Challenge](https://allenai.org/data/arc)'s `challenge` dataset. Measures capability to do grade-school level question answering, 25 shot.
|
| 16 |
+
- `hellaswag`: [HellaSwag dataset](https://allenai.org/data/hellaswag). Measuring grounded commonsense, 10 shot.
|
| 17 |
+
- `truthfulqa`: [TruthfulQA dataset](https://arxiv.org/abs/2109.07958). Measuring truthfulness against questions that elicit common falsehoods, 0 shot.
|
| 18 |
+
|
| 19 |
+
NLP evaluation metrics (`arc`, `hellaswag`, and `truthfulqa`) were only run once each on A40 GPUs because their results do not depend on the GPU type.
|
| 20 |
+
Hence, all GPU model rows for the same model share the same NLP evaluation numbers.
|
| 21 |
|
| 22 |
## Tasks
|
| 23 |
|
|
|
|
| 48 |
|
| 49 |
- NVIDIA A40 GPU
|
| 50 |
- NVIDIA A100 GPU
|
| 51 |
+
- NVIDIA V100 GPU
|
| 52 |
|
| 53 |
### Parameters
|
| 54 |
|
|
|
|
| 60 |
- Temperature 0.7
|
| 61 |
- Repetition penalty 1.0
|
| 62 |
|
| 63 |
+
### Data
|
| 64 |
|
| 65 |
We randomly sampled around 3000 prompts from the [cleaned ShareGPT dataset](https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered).
|
| 66 |
See [here](https://github.com/ml-energy/leaderboard/tree/master/sharegpt) for more detail on how we created the benchmark dataset.
|
| 67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
## Limitations
|
| 69 |
|
| 70 |
Currently, inference is run with basically bare PyTorch with batch size 1, which is unrealistic assuming a production serving scenario.
|
|
|
|
| 72 |
|
| 73 |
## Upcoming
|
| 74 |
|
| 75 |
+
- Within the Summer, we'll add an online text generation interface for real time energy consumption measurement!
|
| 76 |
- More optimized inference runtimes, like TensorRT.
|
| 77 |
- Larger models with distributed inference, like Falcon 40B.
|
| 78 |
- More models, like RWKV.
|
| 79 |
|
| 80 |
+
## License
|
| 81 |
|
| 82 |
This leaderboard is a research preview intended for non-commercial use only.
|
| 83 |
Model weights were taken as is from the Hugging Face Hub if available and are subject to their licenses.
|
| 84 |
The use of LLaMA weights are subject to their [license](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md).
|
| 85 |
Please direct inquiries/reports of potential violation to Jae-Won Chung.
|
| 86 |
|
| 87 |
+
## Acknowledgements
|
| 88 |
|
| 89 |
+
We thank [Chameleon Cloud](https://www.chameleoncloud.org/) and [CloudLab](https://cloudlab.us/) for the GPU nodes.
|