LLM360-MBZUAI commited on
Commit
f95d63f
·
verified ·
1 Parent(s): 0110944

change table of performance in benchmarks

Browse files

replace the table with a premade table picture

Files changed (1) hide show
  1. README.md +1 -10
README.md CHANGED
@@ -20,16 +20,7 @@ Despite being trained on a smaller dataset of 1.4 trillion tokens—compared to
20
  It demonstrates superior performance in benchmarks like MMLU, HumanEval, and MBPP.
21
  By comparing CrystalCoder with other similar work, CrystalCoder is quite balance on language and coding tasks.
22
 
23
- | Model | Trained Tokens | Avg. of Avg. | Language Avg. | Coding Avg. | ARC | HellaSwag | MMLU | TruthfulQA | HumanEval (pass@1) | MBPP (pass@1) |
24
- |:-------------------:|:--------------:|:------------:|:-------------:|:-----------:|:-----:|:---------:|:-------------:|:----------:|:------------------:|:-------------:|
25
- | Mistral 7B | - | 48.68 | 62.40 | 33.95 | 59.98 | 83.31 | 64.16 | 42.15 | 29.12 | 38.78 |
26
- | **CrystalCoder 7B** | 1.27T | 39.56 | 51.68 | 27.44 | 47.44 | 74.38 | 48.42 | 36.46 | 23.90 | 30.988 |
27
- | **CrystalCoder 7B Python/Web** | 1.4T | 41.65 | 50.92 | 32.38 | 47.01 | 71.97 | 48.78 | 35.91 | 28.38 | 36.38 |
28
- | CodeLlaMA 7B Base | 2.5T | 40.24 | 46.16 | 34.32 | 42.75 | 64.74 | 39.98 | 37.19 | 30.06 | 38.573 |
29
- | CodeLlaMA 7B - Python | 2.6T | 40.09 | 42.42 | 37.76 | 39.93 | 60.80 | 31.12 | 37.82 | 34.12 | 41.40 |
30
- | OpenLLaMA v2 7B | 1T | 38.10 | 48.18 | 28.01 | 43.60 | 72.20 | 41.29 | 35.54 | 15.32 | 12.69 |
31
- | LLaMA 2 7B | 2T | 34.98 | 53.39 | 16.57 | 53.07 | 77.74 | 43.80 | 38.98 | 13.05 | 20.09 |
32
- | StarCoder-15B | 1.03 | - | - | 38.46 | - | - | - | - | 33.63 | 43.28 |
33
 
34
  **Notes**
35
  - We compute all evaluation metrics ourselves.
 
20
  It demonstrates superior performance in benchmarks like MMLU, HumanEval, and MBPP.
21
  By comparing CrystalCoder with other similar work, CrystalCoder is quite balance on language and coding tasks.
22
 
23
+ <center><img src="performance_in_benchmarks.png" alt="performance in benchmarks" /></center>
 
 
 
 
 
 
 
 
 
24
 
25
  **Notes**
26
  - We compute all evaluation metrics ourselves.