Testing Might be broken
Collection
testing only models,
•
10 items
•
Updated
•
1
Another trial of merging models with different sizes, still under testing, should be more stable, but I have no ideia if it's improving or degrading the base model.
Recipe:
merge_method: task_anysize
base_model: princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT
models:
- model: KoboldAI/Mistral-7B-Erebus-v3
parameters:
weight: 0.5
dtype: bfloat16
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 41.85 |
AI2 Reasoning Challenge (25-Shot) | 40.70 |
HellaSwag (10-Shot) | 71.04 |
MMLU (5-Shot) | 28.06 |
TruthfulQA (0-shot) | 47.40 |
Winogrande (5-shot) | 63.93 |
GSM8k (5-shot) | 0.00 |