instruction template / sampling parameters / merge theory
template
force the regular mistral template (v3, although earlier ones appear to degrade gracefully enough). chatml support may have bled through from the sappho g/j side but no intentional effort was made to preserve it and enough of the components of this final merge weren't themselves chatml-dominant so those embeddings have presumably been averaged out too much by this point.
sampling
quadratic sampling recommended over min-p/top-p/top-k/typical-p/tfs:
- temp: 1
- smoothing factor: ~0.3β1.5 (lower to vary responses, higher to increase likelihood of instruction following)
- smoothing curve: 1.0β1.2
- low dry and/or rep. penalty as needed
sampler order
repetition_penalty
quadratic_sampling (coming after rep. pen. should make it harder to overpenalize tokens? before works fine too)
temperature (if you're gonna mess with it)
merge theory
the main idea here was to explore the usefulness of iterated model stock merges after observing that weights in model stock merges only affect one aspect of the merge: the average of the non-base models.
how much the base model is smudged toward that average is determined by model-pairwise average of cosine difference. this means that lower-weighted models contribute fully to the choice of which parts of the base model are changed and how much, and simply have less say in how those parts are changed.
as a result, you'll notice most of the sappho merges transitively included, particularly early ones, were themselves merges of large numbers of models. this was not an attempt to incorporate the characteristics of a great number of models, but to limit the influence of the weighted finetune average on the base model (or to double down on a small part of it when choosing a previous merge as the merge base rather than the actual base or instruct-tuned model).
the hope is that this model feels like a fairly direct instruct-leaning base/instruct merge with only the most impactful-to-my-subjective-experience aspects of the other successful merges coming along for the ride. i find it quite good at grasping the context of requests in comparison to earlier attempts, which is where a lot of models, especially roleplay-tuned ones, tend to fall down in my experience.
as general advice for this style of model merging, i would say the only real thing to keep in mind is that "final" merges should be done with an unmerged base model (probably instruct-tuned if anything else in the model list is not). the results of multiple iterations repeatedly feeding the output back in as the merge base work very well as stock ingredients, though.
i also found consistently good results from merges containing several variant models made from the same merge config except different base models (base, instruct-tuned, and a finetune/merge), such as g and g2 (or n2 and n3 if you squint).
the athena run of llama 3/3.1 8B merges has been a much rougher go of this same approach, the perfection of which yet eludes and taunts me.
N.B. quantization
the imatrix quants have a different feel to them, maybe try them as well as the static.