LLM360
/

Crystal

@@ -20,16 +20,7 @@ Despite being trained on a smaller dataset of 1.4 trillion tokens—compared to
 It demonstrates superior performance in benchmarks like MMLU, HumanEval, and MBPP.
 By comparing CrystalCoder with other similar work, CrystalCoder is quite balance on language and coding tasks.
-|        Model        | Trained Tokens | Avg. of Avg. | Language Avg. | Coding Avg. |  ARC  | HellaSwag | MMLU | TruthfulQA | HumanEval (pass@1) | MBPP (pass@1) |
-|:-------------------:|:--------------:|:------------:|:-------------:|:-----------:|:-----:|:---------:|:-------------:|:----------:|:------------------:|:-------------:|
-| Mistral 7B          | -              | 48.68        | 62.40         | 33.95       | 59.98 | 83.31     | 64.16         | 42.15      | 29.12              | 38.78         |
-| **CrystalCoder 7B** | 1.27T           | 39.56        | 51.68         | 27.44       | 47.44 | 74.38     | 48.42         | 36.46      | 23.90 | 30.988  |
-| **CrystalCoder 7B Python/Web** | 1.4T           | 41.65        | 50.92         | 32.38       | 47.01 | 71.97     | 48.78         | 35.91      | 28.38  | 36.38  |
-| CodeLlaMA 7B Base        | 2.5T           | 40.24        | 46.16         | 34.32       | 42.75 | 64.74     | 39.98         | 37.19      | 30.06     | 38.573         |
-| CodeLlaMA 7B - Python | 2.6T           | 40.09        | 42.42         | 37.76       | 39.93 | 60.80     | 31.12         | 37.82      | 34.12              | 41.40         |
-| OpenLLaMA v2 7B     | 1T             | 38.10        | 48.18         | 28.01       | 43.60 | 72.20     | 41.29         | 35.54      | 15.32              | 12.69         |
-| LLaMA 2 7B          | 2T             | 34.98        | 53.39         | 16.57       | 53.07 | 77.74     | 43.80         | 38.98      | 13.05              | 20.09         |
-| StarCoder-15B       | 1.03           | -            | -             | 38.46       | -     | -         | -             | -          | 33.63              | 43.28         |
 **Notes**
 - We compute all evaluation metrics ourselves.

 It demonstrates superior performance in benchmarks like MMLU, HumanEval, and MBPP.
 By comparing CrystalCoder with other similar work, CrystalCoder is quite balance on language and coding tasks.
+<center><img src="performance_in_benchmarks.png" alt="performance in benchmarks" /></center>
 **Notes**
 - We compute all evaluation metrics ourselves.