`transplant_vocab` code improvements soon
Just saw this posted on Reddit and came to say that I'll be improving the transplant_vocab
code over the next few days so keep and eye on the repo!
Wow that is great! will sure keep an eye on it.
I've merged all the changes for now, but not 100% sure if they will work for all models:
https://github.com/jukofyork/transplant-vocab/pulls?q=is%3Apr+is%3Aclosed
(I'm still working on the full deepseek-r1
and 1deepseek-v3` models currently)
@jukofyork did a new run here for the QwenUnslothPhi-4 and I don't notice any significant change on acceptance or speed.
@jukofyork did a new run here for the QwenUnslothPhi-4 and I don't notice any significant change on acceptance or speed.
Yeah, it probably needs fine-tuning to get working properly then.
Out of interest, what percentage of lm_head
are getting copy
vs mean
?
For deepseek-r1
it's around 65%/35%, but for cohere
models it's was more like 30%/70% and they didn't work at all.
I think that stat may be useful for diagnostics to see if the draft has a good chance of working or not without fine-tuning.
Loaded OK:
- Donor vocab size : 151936
- Target vocab size : 100352 (used = 100352, unused = 0)
- Donor hidden size : 896
Processing 3 automatic token overrides:
β 'bos_token_id' : 100257 '<|endoftext|>' β [151643] '<|endoftext|>'
β 'eos_token_id' : 100265 '<|im_end|>' β [151645] '<|im_end|>'
β 'pad_token_id' : 100351 '<|dummy_87|>' β [151643] '<|endoftext|>'
Transplanting tokens: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 100352/100352 [00:03<00:00, 27595.94token/s]
Transplant mappings:
- 1 to 1 : 99084 (99%)
- 2 to 1 : 174 (0.17%)
- 3 to 1 : 1005 (1%)
- 6 to 1 : 1 (0.001%)
- 7 to 1 : 11 (0.011%)
- 8 to 1 : 77 (0.077%)
Head initialized with:
- Copies : 99084 (99%)
- Means : 1268 (1.3%)
- Zeros : 0 (0%)
@jukofyork and maybe I wasn't clear, I mean that I don't notice any change on acceptance or speed related to the previous draft made using the previous version , I still see great results in MLX (and not so great in GGUF) , so it works, just as good as the previous version.
Transplant mappings: - 1 to 1 : 99084 (99%) - 2 to 1 : 174 (0.17%) - 3 to 1 : 1005 (1%) - 6 to 1 : 1 (0.001%) - 7 to 1 : 11 (0.011%) - 8 to 1 : 77 (0.077%) Head initialized with: - Copies : 99084 (99%) - Means : 1268 (1.3%) - Zeros : 0 (0%)
Thanks - this explains why you are getting such good results I think! That's an almost perfect mapping for the tokeniser!
@jukofyork and maybe I wasn't clear, I mean that I don't notice any change on acceptance or speed related to the previous draft made using the previous version , I still see great results in MLX (and not so great in GGUF) , so it works, just as good as the previous version.
Yeah, I haven't had great success with my attempts using llama.cpp
and deepseek-r1
yet. I can get around 15-20% increase after fine-tuning the 0.5b
model, but was hoping for much more considering I used over 3B tokens of real r1
data to fine-tune it on :/
Having one more try now with a 0.5b
model cut down to 16/24 layers and better token mapping/initialisation, but if this doesn't work any better I'm just gonna move onto trying a coder-specific draft for deepseek-v3
and give up on r1
.
I tried the trimming here but I am finding an error when converting the mlx ( I was trying to convert it without further finetuning to see the impact )
raise ValueError(f"Received parameters not in model: {extras}.")
ValueError: Received parameters not in model: model.layers.22.self_attn.k_proj.bias model.layers.23.self_attn.q_proj.bias model.layers.22.self_attn.v_proj.bias model.layers.22.self_attn.q_proj.bias model.layers.23.mlp.up_proj.weight model.layers.23.mlp.down_proj.weight model.layers.23.post_attention_layernorm.weight model.layers.22.mlp.down_proj.weight model.layers.22.self_attn.v_proj.weight model.layers.23.input_layernorm.weight model.layers.22.post_attention_layernorm.weight model.layers.23.self_attn.k_proj.weight model.layers.22.input_layernorm.weight model.layers.23.self_attn.v_proj.weight model.layers.23.self_attn.k_proj.bias model.layers.22.mlp.gate_proj.weight model.layers.22.self_attn.o_proj.weight model.layers.23.self_attn.q_proj.weight model.layers.22.mlp.up_proj.weight model.layers.23.self_attn.o_proj.weight model.layers.23.mlp.gate_proj.weight model.layers.23.self_attn.v_proj.bias model.layers.22.self_attn.k_proj.weight model.layers.22.self_attn.q_proj.weight.
That was after --trim-layers 14-21 , apparently it has the last two layers but the mlx_lm.convert
was not expecting it?
Yeah, I haven't had great success with my attempts using
llama.cpp
anddeepseek-r1
yet. I can get around 15-20% increase after fine-tuning the0.5b
model, but was hoping for much more considering I used over 3B tokens of realr1
data to fine-tune it on :/
Wow I tried some small finetuning to see if there was any improvements but it was just a few M tokens, clearly I would need way more tokens.
@Echo9Zulu This discussion might be interesting for you , I wonder what kind of "Head initialized with" numbers you are getting with EXAONE.
I tried the trimming here but I am finding an error when converting the mlx ( I was trying to convert it without further finetuning to see the impact )
raise ValueError(f"Received parameters not in model: {extras}.") ValueError: Received parameters not in model: model.layers.22.self_attn.k_proj.bias model.layers.23.self_attn.q_proj.bias model.layers.22.self_attn.v_proj.bias model.layers.22.self_attn.q_proj.bias model.layers.23.mlp.up_proj.weight model.layers.23.mlp.down_proj.weight model.layers.23.post_attention_layernorm.weight model.layers.22.mlp.down_proj.weight model.layers.22.self_attn.v_proj.weight model.layers.23.input_layernorm.weight model.layers.22.post_attention_layernorm.weight model.layers.23.self_attn.k_proj.weight model.layers.22.input_layernorm.weight model.layers.23.self_attn.v_proj.weight model.layers.23.self_attn.k_proj.bias model.layers.22.mlp.gate_proj.weight model.layers.22.self_attn.o_proj.weight model.layers.23.self_attn.q_proj.weight model.layers.22.mlp.up_proj.weight model.layers.23.self_attn.o_proj.weight model.layers.23.mlp.gate_proj.weight model.layers.23.self_attn.v_proj.bias model.layers.22.self_attn.k_proj.weight model.layers.22.self_attn.q_proj.weight.
That was after --trim-layers 14-21 , apparently it has the last two layers but the
mlx_lm.convert
was not expecting it?
It could be a bug as haven't tested the final model after trimming yet. If you upload the safetensors files after trimming to a private huggingface repo then click on it in the files section it should show you the tensors in the file. Can you check if there is still some layers 22 and 23 hanging on in the file? They should all be renumbered 0 to 15, but it's possible I've messed the code up for this bit.
Yeah, I haven't had great success with my attempts using
llama.cpp
anddeepseek-r1
yet. I can get around 15-20% increase after fine-tuning the0.5b
model, but was hoping for much more considering I used over 3B tokens of realr1
data to fine-tune it on :/Wow I tried some small finetuning to see if there was any improvements but it was just a few M tokens, clearly I would need way more tokens.
Yeah, was pretty disappointing as I was getting nearly that level of speedup without any fine-tuning :(
I tried the trimming here but I am finding an error when converting the mlx ( I was trying to convert it without further finetuning to see the impact )
raise ValueError(f"Received parameters not in model: {extras}.") ValueError: Received parameters not in model: model.layers.22.self_attn.k_proj.bias model.layers.23.self_attn.q_proj.bias model.layers.22.self_attn.v_proj.bias model.layers.22.self_attn.q_proj.bias model.layers.23.mlp.up_proj.weight model.layers.23.mlp.down_proj.weight model.layers.23.post_attention_layernorm.weight model.layers.22.mlp.down_proj.weight model.layers.22.self_attn.v_proj.weight model.layers.23.input_layernorm.weight model.layers.22.post_attention_layernorm.weight model.layers.23.self_attn.k_proj.weight model.layers.22.input_layernorm.weight model.layers.23.self_attn.v_proj.weight model.layers.23.self_attn.k_proj.bias model.layers.22.mlp.gate_proj.weight model.layers.22.self_attn.o_proj.weight model.layers.23.self_attn.q_proj.weight model.layers.22.mlp.up_proj.weight model.layers.23.self_attn.o_proj.weight model.layers.23.mlp.gate_proj.weight model.layers.23.self_attn.v_proj.bias model.layers.22.self_attn.k_proj.weight model.layers.22.self_attn.q_proj.weight.
That was after --trim-layers 14-21 , apparently it has the last two layers but the
mlx_lm.convert
was not expecting it?It could be a bug as haven't tested the final model after trimming yet. If you upload the safetensors files after trimming to a private huggingface repo then click on it in the files section it should show you the tensors in the file. Can you check if there is still some layers 22 and 23 hanging on in the file? They should all be renumbered 0 to 15, but it's possible I've messed the code up for this bit.
Yeah, I have messed up.
Using safetensormetadump
from https://github.com/huggingface/safetensors/discussions/275
"model.layers.22.input_layernorm.weight": {
"dtype": "F32",
"shape": [
896
],
"data_offsets": [
1463524864,
1463528448
]
},
"model.layers.22.mlp.down_proj.weight": {
"dtype": "F32",
"shape": [
896,
4864
],
"data_offsets": [
1463528448,
1480961024
]
},
"model.layers.22.mlp.gate_proj.weight": {
"dtype": "F32",
"shape": [
4864,
896
],
"data_offsets": [
1480961024,
1498393600
]
},
"model.layers.22.mlp.up_proj.weight": {
I'll try and fix it now.
yeah, faster than me, I see the same.
I think I have fixed it now, but weirdly the transformers
library didn't even seem to care they were there :/
Shows how hard it is to do this sort of thing robustly!
Yes, so many different implementations in this AI world.
I was able to get it working by adding layer_match = match
while you are doing the final fix by adding:
# Create a new tensor to avoid shared memory issues
new_state_dict[new_key] = tensor.clone()
layer_match = match
break
Anyway, disacrded the local changes and I am now on sync with your repo again and re-generated the model.
I can confirm that I can now create the MLX model and that as expected it is terrible and will need finetuning :)) I will let it overnight training overnight on a dataset that I made earlier today.
Without any fine-tuning it will be absolutely horrible though (I get a starting perplexity of 400,000 vs 90 at the start of fine-tuning!).
The biggest problem is the final hidden-state vectors' magnitudes (ie: that go into the lm_head
tensor) will all be too small and these get fixed in the first few steps of fine-tuning:
It doesn't take much fine-tuning to fix this part of the problem though and very quickly it will have a much more sensible perplexity and start to catch the non-trimmed model...
I can confirm that I can now create the MLX model and that as expected it is terrible and will need finetuning :)) I will let it overnight training overnight on a dataset that I made earlier today.
Sorry, posted right at the same time as your reply :)
Yeah, it might be worth trying it (and even removing more layers if you can) as at least for llama.cpp
the faster and smaller the speculative model; the better it works.
@rdsm thanks for the mention! @jukofyork you should check out this paper FastDraft: How to Train Your Draft. Chances are this work was done before Qwen2 0.5b launched and it was feasible from a compute perspective to implement transplant_vocab without fine tuning. My understanding is that IBM implemented this approach with their Granite accelerator series.
No, I don't have any progress yet lol and I will keep tabs on this.
By the time it's done 200-300 more steps it won't be that far behind the non-trimmed model in terms of loss/perplexity and top-1 hit-rate.
Great! I have started the training here, I will leave it running overnight as it is already quite late for me and resume playing with it tomorrow.
Thanks for the quick turnaround here! it was very fun.
@Echo9Zulu Thanks I'll give it a read tomorrow.
Yeah, I'm in the UK and the clocks have just gone forward and I was wondering where the last hour had gone :D
@jukofyork Wait you have "daylight savings" in the UK like we have in the US? thats wild, thought the acceptance rate would be lower on that sort of thing lol
@jukofyork oh, I am in Ireland, that explains it, the clocks also changed here. Forgot about that!
@Echo9Zulu not sure about the UK but here is the opposite the official hour is the one ahead (Irish Standard Time) and we change to the different one, one hour behind during the winter (not sure whyβ¦)
It recovered most of the performance (cyan line) vs the full model (grey line):
The magenta line is my new experiment that uses just 12 out of 24 layers, and also the --trim-intermediate-size
option that I just merged:
python3 ./transplant_vocab.py \
./Qwen2.5-0.5B-Instruct \
./DeepSeek-R1 \
./finetunes/STAGE1/DeepSeek-R1-DRAFT-0.25B \
--overwrite \
--use-cpu-only \
--trim-layers 10-21 \
--trim-intermediate-size 2432 \
--override "<ο½βpadβο½>" "<|endoftext|>" \
--override "<ο½fimβholeο½>" "<|fim_middle|>" \
--override "<ο½fimβbeginο½>" "<|fim_prefix|>" \
--override "<ο½fimβendο½>" "<|fim_suffix|>" \
--override "<ο½Userο½>" "<|im_start|>user\\n" \
--override "<ο½Assistantο½>" "<|im_start|>assistant\\n" \
--override "<|EOT|>" "<|endoftext|>" \
--override "<ο½toolβcallsβbeginο½>" "<tool_call>" \
--override "<ο½toolβcallβbeginο½>" "<tool_call>" \
--override "<ο½toolβoutputsβbeginο½>" "<tool_call>" \
--override "<ο½toolβoutputβbeginο½>" "<tool_call>" \
--override "<ο½toolβcallsβendο½>" "</tool_call>" \
--override "<ο½toolβcallβendο½>" "</tool_call>" \
--override "<ο½toolβoutputsβendο½>" "</tool_call>" \
--override "<ο½toolβoutputβendο½>" "</tool_call>" \
--override "<ο½toolβsepο½>" "</tool_call>"
The small qwen
models seem to use quite a high hidden:intermediate
ratio:
- Donor hidden size : 896
- Donor intermediate size : 4864 (ratio = 1:5.4)
compared to other small models, and I doubt it is all that useful if we mainly care about speculative decoding predicting summarised text or long variable names in code, etc.
I've also tried to use a more robust way of saving the final model:
model = type(model)(model.config)
model.load_state_dict(new_state_dict)
and added a debugging function that is commented out by default:
# debug_model_tensors(model, new_state_dict)
and added the safetensor_meta_dump.sh
code to the repo to help with debugging too:
#!/bin/bash
# Specify the file
FILE="$1"
# Extract the first 8 bytes and convert them to a decimal integer
HEADER_LENGTH=$(dd "if=$FILE" bs=1 count=8 2>/dev/null | od -An -vtu8)
# Extract the metadata, starting from the 9th byte
dd "if=$FILE" bs=1 skip=8 "count=$HEADER_LENGTH" 2>/dev/null | jq
I probably won't have time to do much more now, but I've implemented most of what I could think of (hidden_size
reduction was too much hassle as you have to deal with all the layer_norm
tensors, the head sizes need reducing and so on).
Again, I'm still only testing this on deepseek-r1
and qwen-2.5-instruct:0.5b
, so there may be bugs that aren't showing for me and I'd appreciate it if you can test on your models to check it works! :)
@rdsm thanks for the mention! @jukofyork you should check out this paper FastDraft: How to Train Your Draft.
Thanks for this reference! I think it may explain why I wasn't getting very good results:
I was only using a subset of the common-crawl dataset for the continued pre-training, so now I've added an equal amount of this dataset to see the effect.
@jukofyork Woah that's a cool finding. Definitely let us know your results
That is interesting, tried it here and my dataset is divided in categories and I was not going through all of it, shuffling it before training seems to have made the improvements faster.
But clearly training on a couple of million tokens is not enough for trimmed models, I think I will focus on trying to improve the untrimmed model.
@jukofyork you might find it interesting:
Almost 50/50 and it works fine (deephermes-3-mistral-24b) a nice bump in performance from 7tk/s to 10tk/s on MLX.
Loaded OK:
- Donor vocab size : 151936
- Target vocab size : 131078 (used = 131078, unused = 0)
- Donor hidden size : 896
Processing 3 automatic token overrides:
β 'bos_token_id' : 1 '<s>' β [151643] '<|endoftext|>'
β 'eos_token_id' : 131072 '<|eot_id|>' β [151645] '<|im_end|>'
β 'pad_token_id' : 131077 '<|end_of_text|>' β [151643] '<|endoftext|>'
Transplanting tokens: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 131078/131078 [00:35<00:00, 3641.54token/s]
Transplant mappings:
- 1 to 1 : 67905 (52%)
- 2 to 1 : 45124 (34%)
- 3 to 1 : 11133 (8.5%)
- 4 to 1 : 3358 (2.6%)
- 5 to 1 : 1261 (0.96%)
- 6 to 1 : 635 (0.48%)
- 7 to 1 : 1169 (0.89%)
- 8 to 1 : 186 (0.14%)
- 9 to 1 : 95 (0.072%)
- 10 to 1 : 66 (0.05%)
- 11 to 1 : 34 (0.026%)
- 12 to 1 : 28 (0.021%)
- 13 to 1 : 22 (0.017%)
- 14 to 1 : 15 (0.011%)
- 15 to 1 : 9 (0.0069%)
- 16 to 1 : 11 (0.0084%)
- 17 to 1 : 12 (0.0092%)
- 18 to 1 : 2 (0.0015%)
- 19 to 1 : 4 (0.0031%)
- 20 to 1 : 4 (0.0031%)
- 21 to 1 : 2 (0.0015%)
- 22 to 1 : 1 (0.00076%)
- 23 to 1 : 1 (0.00076%)
- 28 to 1 : 1 (0.00076%)
Head initialized with:
- Copies : 67905 (52%)
- Means : 63173 (48%)
- Zeros : 0 (0%)
@rdsm It's interesting it seems to work better on MLX!? I tried twice to buy a 128GB M2 Ultra off eBay about a month ago and both tried to scam me lol (well the second guy tried to scam the postal insurance and not me directly), so sadly no way I can test or improve it for MLX yet.
Regarding the paper: that turned out to be super useful as I think the problem I was having all along was not including any raw code in the "pre-training" datasets! I'm now trying the following mixes with ~3B tokens all trained in one go (ie: rather than 2 stages):
For a general draft model:
- 30%
deepseek-r1
data that is mostly textual (the dolphin dataset). - 30%
deepseek-r1
data that is mostly mathematics (the open thoughts dataset). - 30% from the stack-v1-smol-xl dataset.
- 10% from the creative commons sample dataset.
For the coder-specific draft model I use 22.5% / 22.5% / 45% / 10% of the same datasets.
It will be 4-5 days before I have these two trained up for r1
and v3
, but so far it looks like this may be what I was missing and either: using 2-stages, or the lack of code data, was causing the problems!
It's interesting it seems to work better on MLX!?
The only reference that I have are my two machines (M1 Max 64gb and m4 base 32gb), on the m1 max I don't see much difference, on the m4 I do see a lot. I heard that the m1 don't do well with spec. dec. so might be it.
It will be 4-5 days before I have these two trained up
interesting , anxious to see by how much they outperform Qwen 2.5 0.5b.