Tijmen2
/

cosmosage_v2

@@ -106,25 +106,6 @@ When using one of the quantized versions, make sure to pass the quantization con
 }
 ```
-## Standard evaluations
-cosmosage can be compared to OpenHermes-2.5-Mistral-7B using standard evaluation metrics.
-| Test Category | cosmosage_v2 | OpenHermes-2.5-Mistral-7B |
-|---------------|-------------------------|------------------------------------|
-| Overall | 0.595 | 0.632 |
-| ARC Challenge | 0.565 | 0.613 |
-| Hellaswag | 0.619 | 0.652 |
-| TruthfulQA:mc1 | 0.348 | 0.361 |
-| TruthfulQA:mc2 | 0.510 | 0.522 |
-| Winogrande | 0.759 | 0.781 |
-| GSM8k | 0.368 | 0.261 |
-cosmosage_v2 performs only slightly below OpenHermes-2.5-Mistral-7B on most metrics, indicating that the
-heavy specialization in cosmology has not come at much of a cost on general-purpose abilities. The exception
-is GSM8k, which is a collection of grade school math problems. Here, cosmosage performs significantly better
-than OpenHermes-2.5-Mistral-7B.
 ## Instruction format
 cosmosage_v2 was trained with the "inst" chat template as implemented in axolotl v0.4.0. This resulted in an
@@ -184,15 +165,22 @@ unusual instruction format:
 > In summary, the time of matter-radiation equality affects the damping tail of the CMB power spectrum by influencing the amount of time that photons spend in the diffusive state before they are able to decouple from the matter and travel freely through the universe. The longer the photons spend in the diffusive state, the more damping occurs, and the earlier matter-radiation equality occurs, the less damping occurs.>
 # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
 Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_Tijmen2__cosmosage_v2)
-|             Metric              |Value|
-|---------------------------------|----:|
-|Avg.                             |60.66|
-|AI2 Reasoning Challenge (25-Shot)|59.73|
-|HellaSwag (10-Shot)              |80.90|
-|MMLU (5-Shot)                    |59.57|
-|TruthfulQA (0-shot)              |50.98|
-|Winogrande (5-shot)              |75.93|
-|GSM8k (5-shot)                   |36.85|

 }
 ```
 ## Instruction format
 cosmosage_v2 was trained with the "inst" chat template as implemented in axolotl v0.4.0. This resulted in an
 > In summary, the time of matter-radiation equality affects the damping tail of the CMB power spectrum by influencing the amount of time that photons spend in the diffusive state before they are able to decouple from the matter and travel freely through the universe. The longer the photons spend in the diffusive state, the more damping occurs, and the earlier matter-radiation equality occurs, the less damping occurs.>
 # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
 Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_Tijmen2__cosmosage_v2)
+|             Metric              |Value|OpenHermes2.5-Mistral7B|
+|---------------------------------|----:|----------------------:|
+|Avg.                             |60.66|61.52|
+|AI2 Reasoning Challenge (25-Shot)|59.73|64.93|
+|HellaSwag (10-Shot)              |80.90|84.18|
+|MMLU (5-Shot)                    |59.57|63.64|
+|TruthfulQA (0-shot)              |50.98|52.24|
+|Winogrande (5-shot)              |75.93|78.06|
+|GSM8k (5-shot)                   |36.85|26.08|
+cosmosage_v2 can be compared to OpenHermes-2.5-Mistral-7B because it started from the same base model and also trained on the OpenHermes2.5 dataset.
+cosmosage_v2 performs only slightly below OpenHermes-2.5-Mistral-7B on most metrics, indicating that the
+heavy specialization in cosmology has not come at much of a cost on general-purpose abilities. The exception
+is GSM8k, which is a collection of grade school math problems. Here, cosmosage performs significantly better
+than OpenHermes-2.5-Mistral-7B.