AdamLucek
/

Orpo-Llama-3.2-1B-15k

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

AdamLucek commited on Nov 30, 2024

Commit

930d8a4

·

verified ·

1 Parent(s): 22d1804

Update README.md

Files changed (1) hide show

README.md +14 -17

README.md CHANGED Viewed

@@ -18,23 +18,20 @@ For full model details, refer to the base model page [meta-llama/Llama-3.2-1B](h
 ## Evaluations
-| Benchmark      | Accuracy                 | Notes                                     |
-|----------------|--------------------------|-------------------------------------------|
-| AGIEval        | 22.14% (21.01% normalized) | 0-Shot Average across multiple reasoning tasks |
-| GPT4ALL        | 51.15% (54.38% normalized) | 0-Shot Average across all categories|
-| TruthfulQA     | 42.79%                   | MC2 accuracy                |
-| MMLU           | 31.22%                   | 5-Shot Average across all categories |
-| Winogrande     | 61.72%                   | 0-shot evaluation|
-| ARC Challenge  | 32.94% (36.01% normalized) | 0-shot evaluation|
-| ARC Easy       | 64.52% (60.40% normalized) | 0-shot evaluation|
-| BoolQ          | 50.24%                   | 0-shot evaluation|
-| PIQA           | 75.46% (74.37% normalized) | 0-shot evaluation|
-| HellaSwag      | 48.56% (64.71% normalized) | 0-shot evaluation|
-I've updated the table with the new metrics from the 15k model where applicable. Let me know if you need further adjustments or more details!
-[Detailed Eval Metrics Available Here](https://docs.google.com/document/d/174SRz1pb9GIJ4kIOoMOEyN6ebz3PrEX-9rNnlcVOjyM/edit?usp=sharing)
 ## Using this Model

 ## Evaluations
+In comparsion to [AdamLucek/Orpo-Llama-3.2-1B-40k](https://huggingface.co/AdamLucek/Orpo-Llama-3.2-1B-40k) using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
+| Benchmark      | 15k Accuracy | 15k Normalized | 40k Accuracy | 40k Normalized | Notes                                     |
+|----------------|--------------|----------------|--------------|----------------|-------------------------------------------|
+| AGIEval        | 22.14%       | 21.01%         | 23.57%       | 23.26%         | 0-Shot Average across multiple reasoning tasks |
+| GPT4ALL        | 51.15%       | 54.38%         | 51.63%       | 55.00%         | 0-Shot Average across all categories      |
+| TruthfulQA     | 42.79%       | N/A            | 42.14%       | N/A            | MC2 accuracy                              |
+| MMLU           | 31.22%       | N/A            | 31.01%       | N/A            | 5-Shot Average across all categories      |
+| Winogrande     | 61.72%       | N/A            | 61.12%       | N/A            | 0-shot evaluation                         |
+| ARC Challenge  | 32.94%       | 36.01%         | 33.36%       | 37.63%         | 0-shot evaluation                         |
+| ARC Easy       | 64.52%       | 60.40%         | 65.91%       | 60.90%         | 0-shot evaluation                         |
+| BoolQ          | 50.24%       | N/A            | 52.29%       | N/A            | 0-shot evaluation                         |
+| PIQA           | 75.46%       | 74.37%         | 75.63%       | 75.19%         | 0-shot evaluation                         |
+| HellaSwag      | 48.56%       | 64.71%         | 48.46%       | 64.50%         | 0-shot evaluation                         |
 ## Using this Model