|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
--- |
|
|
|
This is an experimental self-merge of [miqu-1-70b-sf](https://huggingface.co/152334H/miqu-1-70b-sf) |
|
using @jukofyork idea of [Downscaling the K and/or Q matrices for repeated layers in franken-merges](https://github.com/arcee-ai/mergekit/issues/198). |
|
More info about the _attenuation_ is available in this [discussion](https://huggingface.co/wolfram/miqu-1-120b/discussions/4) |
|
|
|
In my own [LLM Creativity benchmark](https://huggingface.co/datasets/froggeric/creativity) it performs sightly |
|
better than the original [wolfram/miqu-1-120b](https://huggingface.co/wolfram/miqu-1-120b). Specifically, |
|
I noticed some **improvements at creative writing, producing longer, more details, and unrushed text**. |
|
Like wolfram/miqu-1-120b though, there is some degradation over miqu-1-70b, with longer text, as it starts |
|
deviating from instructions and requires some effort to keep it on track. |
|
|
|
## Model Details |
|
|
|
- Max Context: 32764 tokens (kept the weird number from the original/base model) |
|
- Layers: 140 |
|
|
|
### Prompt template: Mistral |
|
|
|
``` |
|
<s>[INST] {prompt} [/INST] |
|
``` |
|
|
|
See also: [πΊπ¦ββ¬ LLM Prompt Format Comparison/Test: Mixtral 8x7B Instruct with **17** different instruct templates : LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/18ljvxb/llm_prompt_format_comparisontest_mixtral_8x7b/) |
|
|
|
## Merge details with mergekit |
|
|
|
```yaml |
|
############################### |
|
# miqu-1-120b-attenuated.yaml # |
|
############################### |
|
|
|
# Use: mergekit-yaml --clone-tensors ./miqu-1-120b-attenuated.yaml ./miqu-1-120b-attenuated |
|
|
|
# See: https://huggingface.co/wolfram/miqu-1-120b for original 'miqu-1-120b' layer ranges. |
|
# See: https://github.com/arcee-ai/mergekit/issues/198 for discussion/reasoning behind this idea. |
|
|
|
# --- |
|
|
|
# The scale factor to use, eg: solve x^2 = 1/2 --> x = 1/sqrt(2) β 0.7071067812 |
|
const_tag: &scale_factor 0.7071067812 # 1/sqrt(2) |
|
|
|
# The filter parameters of a scaled block. |
|
attenuate-env: &attenuated_env |
|
parameters: |
|
scale: |
|
- filter: q_proj |
|
value: *scale_factor |
|
- filter: k_proj |
|
value: *scale_factor |
|
- value: 1.0 |
|
|
|
# --- |
|
|
|
slices: |
|
|
|
########################### |
|
# Block 1: miqu-1 [0, 20] # |
|
########################### |
|
- sources: |
|
- model: miqu-1-70b-sf |
|
layer_range: [0, 10] # The first 10 layers of Block 1 are not duplicated |
|
- sources: |
|
- model: miqu-1-70b-sf |
|
layer_range: [10, 20] # The last 10 layers of Block 1 are are duplicated twice |
|
<<: *attenuated_env |
|
|
|
########################### |
|
# Block 2: miqu-1 [10, 30] # |
|
########################### |
|
- sources: |
|
- model: miqu-1-70b-sf |
|
layer_range: [10, 30] |
|
<<: *attenuated_env |
|
|
|
########################### |
|
# Block 3: miqu-1 [20, 40] # |
|
########################### |
|
- sources: |
|
- model: miqu-1-70b-sf |
|
layer_range: [20, 40] |
|
<<: *attenuated_env |
|
|
|
########################### |
|
# Block 4: miqu-1 [30, 50] # |
|
########################### |
|
- sources: |
|
- model: miqu-1-70b-sf |
|
layer_range: [30, 50] |
|
<<: *attenuated_env |
|
|
|
########################### |
|
# Block 5: miqu-1 [40, 60] # |
|
########################### |
|
- sources: |
|
- model: miqu-1-70b-sf |
|
layer_range: [40, 60] |
|
<<: *attenuated_env |
|
|
|
########################### |
|
# Block 6: miqu-1 [50, 70] # |
|
########################### |
|
- sources: |
|
- model: miqu-1-70b-sf |
|
layer_range: [50, 70] |
|
<<: *attenuated_env |
|
|
|
########################## |
|
# Block 7: miqu-1 [60, 80] # |
|
########################## |
|
- sources: |
|
- model: miqu-1-70b-sf |
|
layer_range: [60, 70] # The first 10 layers of Block 7 are are duplicated twice |
|
<<: *attenuated_env |
|
- sources: |
|
- model: miqu-1-70b-sf |
|
layer_range: [70, 80] # The last 10 layers of Block 7 are not duplicated |
|
|
|
merge_method: passthrough |
|
dtype: float16 |
|
``` |