AdamLucek commited on
Commit
930d8a4
·
verified ·
1 Parent(s): 22d1804

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -17
README.md CHANGED
@@ -18,23 +18,20 @@ For full model details, refer to the base model page [meta-llama/Llama-3.2-1B](h
18
 
19
  ## Evaluations
20
 
21
-
22
- | Benchmark | Accuracy | Notes |
23
- |----------------|--------------------------|-------------------------------------------|
24
- | AGIEval | 22.14% (21.01% normalized) | 0-Shot Average across multiple reasoning tasks |
25
- | GPT4ALL | 51.15% (54.38% normalized) | 0-Shot Average across all categories|
26
- | TruthfulQA | 42.79% | MC2 accuracy |
27
- | MMLU | 31.22% | 5-Shot Average across all categories |
28
- | Winogrande | 61.72% | 0-shot evaluation|
29
- | ARC Challenge | 32.94% (36.01% normalized) | 0-shot evaluation|
30
- | ARC Easy | 64.52% (60.40% normalized) | 0-shot evaluation|
31
- | BoolQ | 50.24% | 0-shot evaluation|
32
- | PIQA | 75.46% (74.37% normalized) | 0-shot evaluation|
33
- | HellaSwag | 48.56% (64.71% normalized) | 0-shot evaluation|
34
-
35
- I've updated the table with the new metrics from the 15k model where applicable. Let me know if you need further adjustments or more details!
36
-
37
- [Detailed Eval Metrics Available Here](https://docs.google.com/document/d/174SRz1pb9GIJ4kIOoMOEyN6ebz3PrEX-9rNnlcVOjyM/edit?usp=sharing)
38
 
39
  ## Using this Model
40
 
 
18
 
19
  ## Evaluations
20
 
21
+ In comparsion to [AdamLucek/Orpo-Llama-3.2-1B-40k](https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-40k) using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
22
+
23
+ | Benchmark | 15k Accuracy | 15k Normalized | 40k Accuracy | 40k Normalized | Notes |
24
+ |----------------|--------------|----------------|--------------|----------------|-------------------------------------------|
25
+ | AGIEval | 22.14% | 21.01% | 23.57% | 23.26% | 0-Shot Average across multiple reasoning tasks |
26
+ | GPT4ALL | 51.15% | 54.38% | 51.63% | 55.00% | 0-Shot Average across all categories |
27
+ | TruthfulQA | 42.79% | N/A | 42.14% | N/A | MC2 accuracy |
28
+ | MMLU | 31.22% | N/A | 31.01% | N/A | 5-Shot Average across all categories |
29
+ | Winogrande | 61.72% | N/A | 61.12% | N/A | 0-shot evaluation |
30
+ | ARC Challenge | 32.94% | 36.01% | 33.36% | 37.63% | 0-shot evaluation |
31
+ | ARC Easy | 64.52% | 60.40% | 65.91% | 60.90% | 0-shot evaluation |
32
+ | BoolQ | 50.24% | N/A | 52.29% | N/A | 0-shot evaluation |
33
+ | PIQA | 75.46% | 74.37% | 75.63% | 75.19% | 0-shot evaluation |
34
+ | HellaSwag | 48.56% | 64.71% | 48.46% | 64.50% | 0-shot evaluation |
 
 
 
35
 
36
  ## Using this Model
37