pankajmathur
commited on
Commit
•
f72cb4f
1
Parent(s):
2b7fc2b
Update README.md
Browse files
README.md
CHANGED
@@ -158,6 +158,60 @@ After that, you can `generate()` again to let the model use the tool result in t
|
|
158 |
see the [LLaMA prompt format docs](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/) and the Transformers [tool use documentation](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling).
|
159 |
|
160 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
161 |
## Llama 3.3 Responsibility & Safety
|
162 |
|
163 |
As part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks:
|
|
|
158 |
see the [LLaMA prompt format docs](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/) and the Transformers [tool use documentation](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling).
|
159 |
|
160 |
|
161 |
+
## Open LLM Leaderboard Evals
|
162 |
+
|
163 |
+
```
|
164 |
+
|
165 |
+
|Tasks|Version|Filter|n-shot|Metric| |Value | |Stderr|
|
166 |
+
|-----|------:|------|-----:|------|---|-----:|---|-----:|
|
167 |
+
|leaderboard_bbh_boolean_expressions|1|none|3|acc_norm|↑|0.92|±|0.0172|
|
168 |
+
|leaderboard_bbh_causal_judgement|1|none|3|acc_norm|↑|0.6738|±|0.0344|
|
169 |
+
|leaderboard_bbh_date_understanding|1|none|3|acc_norm|↑|0.716|±|0.0286|
|
170 |
+
|leaderboard_bbh_disambiguation_qa|1|none|3|acc_norm|↑|0.748|±|0.0275|
|
171 |
+
|leaderboard_bbh_formal_fallacies|1|none|3|acc_norm|↑|0.728|±|0.0282|
|
172 |
+
|leaderboard_bbh_geometric_shapes|1|none|3|acc_norm|↑|0.38|±|0.0308|
|
173 |
+
|leaderboard_bbh_hyperbaton|1|none|3|acc_norm|↑|0.744|±|0.0277|
|
174 |
+
|leaderboard_bbh_logical_deduction_five_objects|1|none|3|acc_norm|↑|0.604|±|0.031|
|
175 |
+
|leaderboard_bbh_logical_deduction_seven_objects|1|none|3|acc_norm|↑|0.6|±|0.031|
|
176 |
+
|leaderboard_bbh_logical_deduction_three_objects|1|none|3|acc_norm|↑|0.94|±|0.0151|
|
177 |
+
|leaderboard_bbh_movie_recommendation|1|none|3|acc_norm|↑|0.812|±|0.0248|
|
178 |
+
|leaderboard_bbh_navigate|1|none|3|acc_norm|↑|0.696|±|0.0292|
|
179 |
+
|leaderboard_bbh_object_counting|1|none|3|acc_norm|↑|0.724|±|0.0283|
|
180 |
+
|leaderboard_bbh_penguins_in_a_table|1|none|3|acc_norm|↑|0.6712|±|0.039|
|
181 |
+
|leaderboard_bbh_reasoning_about_colored_objects|1|none|3|acc_norm|↑|0.808|±|0.025|
|
182 |
+
|leaderboard_bbh_ruin_names|1|none|3|acc_norm|↑|0.884|±|0.0203|
|
183 |
+
|leaderboard_bbh_salient_translation_error_detection|1|none|3|acc_norm|↑|0.688|±|0.0294|
|
184 |
+
|leaderboard_bbh_snarks|1|none|3|acc_norm|↑|0.7697|±|0.0316|
|
185 |
+
|leaderboard_bbh_sports_understanding|1|none|3|acc_norm|↑|0.956|±|0.013|
|
186 |
+
|leaderboard_bbh_temporal_sequences|1|none|3|acc_norm|↑|0.996|±|0.004|
|
187 |
+
|leaderboard_bbh_tracking_shuffled_objects_five_objects|1|none|3|acc_norm|↑|0.352|±|0.0303|
|
188 |
+
|leaderboard_bbh_tracking_shuffled_objects_seven_objects|1|none|3|acc_norm|↑|0.292|±|0.0288|
|
189 |
+
|leaderboard_bbh_tracking_shuffled_objects_three_objects|1|none|3|acc_norm|↑|0.404|±|0.0311|
|
190 |
+
|leaderboard_bbh_web_of_lies|1|none|3|acc_norm|↑|0.676|±|0.0297|
|
191 |
+
|leaderboard_gpqa_diamond|1|none|0|acc_norm|↑|0.399|±|0.0349|
|
192 |
+
|leaderboard_gpqa_extended|1|none|0|acc_norm|↑|0.4634|±|0.0214|
|
193 |
+
|leaderboard_gpqa_main|1|none|0|acc_norm|↑|0.4978|±|0.0236|
|
194 |
+
|leaderboard_ifeval|3|none|0|inst_level_loose_acc|↑|0.8118|±|N/A|
|
195 |
+
|leaderboard_ifeval|3|none|0|inst_level_strict_acc|↑|0.7362|±|N/A|
|
196 |
+
|leaderboard_ifeval|3|none|0|prompt_level_loose_acc|↑|0.7338|±|0.019|
|
197 |
+
|leaderboard_ifeval|3|none|0|prompt_level_strict_acc|↑|0.634|±|0.0207|
|
198 |
+
|leaderboard_math_algebra_hard|2|none|4|exact_match|↑|0.5928|±|0.0281|
|
199 |
+
|leaderboard_math_counting_and_prob_hard|2|none|4|exact_match|↑|0.3415|±|0.0429|
|
200 |
+
|leaderboard_math_geometry_hard|2|none|4|exact_match|↑|0.2045|±|0.0352|
|
201 |
+
|leaderboard_math_intermediate_algebra_hard|2|none|4|exact_match|↑|0.15|±|0.0214|
|
202 |
+
|leaderboard_math_num_theory_hard|2|none|4|exact_match|↑|0.3442|±|0.0384|
|
203 |
+
|leaderboard_math_prealgebra_hard|2|none|4|exact_match|↑|0.5803|±|0.0356|
|
204 |
+
|leaderboard_math_precalculus_hard|2|none|4|exact_match|↑|0.1556|±|0.0313|
|
205 |
+
|leaderboard_mmlu_pro|0.1|none|5|acc|↑|0.5151|±|0.0046|
|
206 |
+
|leaderboard_musr_murder_mysteries|1|none|0|acc_norm|↑|0.62|±|0.0308|
|
207 |
+
|leaderboard_musr_object_placements|1|none|0|acc_norm|↑|0.2812|±|0.0282|
|
208 |
+
|leaderboard_musr_team_allocation|1|none|0|acc_norm|↑|0.6|±|0.031|
|
209 |
+
|Average|--|--|--|--|↑|0.6058|±|0.0270|
|
210 |
+
```
|
211 |
+
|
212 |
+
|
213 |
+
|
214 |
+
|
215 |
## Llama 3.3 Responsibility & Safety
|
216 |
|
217 |
As part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks:
|