hunterhector commited on
Commit
751e527
·
verified ·
1 Parent(s): aec46d9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -4
README.md CHANGED
@@ -20,7 +20,7 @@ Despite being trained on a smaller dataset of 1.4 trillion tokens—compared to
20
  It demonstrates superior performance in benchmarks like MMLU, HumanEval, and MBPP.
21
  By comparing CrystalCoder with other similar work, CrystalCoder is quite balance on language and coding tasks.
22
 
23
- | Model | Trained Tokens | Avg. of Avg. | Language Avg. | Coding Avg. | ARC | HellaSwag | MMLU (5-shot) | TruthfulQA | HumanEval (pass@1) | MBPP (pass@1) |
24
  |:-------------------:|:--------------:|:------------:|:-------------:|:-----------:|:-----:|:---------:|:-------------:|:----------:|:------------------:|:-------------:|
25
  | Mistral 7B | - | 48.68 | 62.40 | 33.95 | 59.98 | 83.31 | 64.16 | 42.15 | 29.12 | 38.78 |
26
  | **CrystalCoder 7B** | 1.27T | 39.56 | 51.68 | 27.44 | 47.44 | 74.38 | 48.42 | 36.46 | 23.90 | 30.988 |
@@ -31,10 +31,14 @@ By comparing CrystalCoder with other similar work, CrystalCoder is quite balance
31
  | LLaMA 2 7B | 2T | 34.98 | 53.39 | 16.57 | 53.07 | 77.74 | 43.80 | 38.98 | 13.05 | 20.09 |
32
  | StarCoder-15B | 1.03 | - | - | 38.46 | - | - | - | - | 33.63 | 43.28 |
33
 
34
- ** Notes **
 
 
 
 
 
 
35
  - For detailed token breakdown of CrystalCoder dataset, refer to the [CrystalCoder dataset repository](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
36
- - Scores for HumanEval is computed with a temporature of 0.2
37
- - Scores for MBPP is computed with a temperature of 0.1
38
 
39
 
40
  ## About LLM360
 
20
  It demonstrates superior performance in benchmarks like MMLU, HumanEval, and MBPP.
21
  By comparing CrystalCoder with other similar work, CrystalCoder is quite balance on language and coding tasks.
22
 
23
+ | Model | Trained Tokens | Avg. of Avg. | Language Avg. | Coding Avg. | ARC | HellaSwag | MMLU | TruthfulQA | HumanEval (pass@1) | MBPP (pass@1) |
24
  |:-------------------:|:--------------:|:------------:|:-------------:|:-----------:|:-----:|:---------:|:-------------:|:----------:|:------------------:|:-------------:|
25
  | Mistral 7B | - | 48.68 | 62.40 | 33.95 | 59.98 | 83.31 | 64.16 | 42.15 | 29.12 | 38.78 |
26
  | **CrystalCoder 7B** | 1.27T | 39.56 | 51.68 | 27.44 | 47.44 | 74.38 | 48.42 | 36.46 | 23.90 | 30.988 |
 
31
  | LLaMA 2 7B | 2T | 34.98 | 53.39 | 16.57 | 53.07 | 77.74 | 43.80 | 38.98 | 13.05 | 20.09 |
32
  | StarCoder-15B | 1.03 | - | - | 38.46 | - | - | - | - | 33.63 | 43.28 |
33
 
34
+ **Notes**
35
+ - We compute all evaluation metrics ourselves.
36
+ - Language benchmarks are computed following the convention of [the Huggingface Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard), which means
37
+ AI2 Reasoning Challenge in 25-shot, HellaSwag in 10-shot, MMLU computed in 5-shot, TruthfulQA in 0-shot.
38
+ - As reported in prior work, the choice of temperature affect the programming metrics a lot, we evaluate all models with the following temperature:
39
+ - Scores for HumanEval is computed with a temperature of 0.2
40
+ - Scores for MBPP is computed with a temperature of 0.1
41
  - For detailed token breakdown of CrystalCoder dataset, refer to the [CrystalCoder dataset repository](https://huggingface.co/datasets/LLM360/CrystalCoderDatasets).
 
 
42
 
43
 
44
  ## About LLM360