Update README.md
Browse files
README.md
CHANGED
@@ -18,23 +18,20 @@ For full model details, refer to the base model page [meta-llama/Llama-3.2-1B](h
|
|
18 |
|
19 |
## Evaluations
|
20 |
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
|
26 |
-
|
|
27 |
-
|
|
28 |
-
|
|
29 |
-
|
|
30 |
-
| ARC
|
31 |
-
|
|
32 |
-
|
|
33 |
-
|
|
34 |
-
|
35 |
-
I've updated the table with the new metrics from the 15k model where applicable. Let me know if you need further adjustments or more details!
|
36 |
-
|
37 |
-
[Detailed Eval Metrics Available Here](https://docs.google.com/document/d/174SRz1pb9GIJ4kIOoMOEyN6ebz3PrEX-9rNnlcVOjyM/edit?usp=sharing)
|
38 |
|
39 |
## Using this Model
|
40 |
|
|
|
18 |
|
19 |
## Evaluations
|
20 |
|
21 |
+
In comparsion to [AdamLucek/Orpo-Llama-3.2-1B-40k](https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-40k) using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
|
22 |
+
|
23 |
+
| Benchmark | 15k Accuracy | 15k Normalized | 40k Accuracy | 40k Normalized | Notes |
|
24 |
+
|----------------|--------------|----------------|--------------|----------------|-------------------------------------------|
|
25 |
+
| AGIEval | 22.14% | 21.01% | 23.57% | 23.26% | 0-Shot Average across multiple reasoning tasks |
|
26 |
+
| GPT4ALL | 51.15% | 54.38% | 51.63% | 55.00% | 0-Shot Average across all categories |
|
27 |
+
| TruthfulQA | 42.79% | N/A | 42.14% | N/A | MC2 accuracy |
|
28 |
+
| MMLU | 31.22% | N/A | 31.01% | N/A | 5-Shot Average across all categories |
|
29 |
+
| Winogrande | 61.72% | N/A | 61.12% | N/A | 0-shot evaluation |
|
30 |
+
| ARC Challenge | 32.94% | 36.01% | 33.36% | 37.63% | 0-shot evaluation |
|
31 |
+
| ARC Easy | 64.52% | 60.40% | 65.91% | 60.90% | 0-shot evaluation |
|
32 |
+
| BoolQ | 50.24% | N/A | 52.29% | N/A | 0-shot evaluation |
|
33 |
+
| PIQA | 75.46% | 74.37% | 75.63% | 75.19% | 0-shot evaluation |
|
34 |
+
| HellaSwag | 48.56% | 64.71% | 48.46% | 64.50% | 0-shot evaluation |
|
|
|
|
|
|
|
35 |
|
36 |
## Using this Model
|
37 |
|