rdsm/QwenPhi-4-0.5b-Draft · `transplant_vocab` code improvements soon

11 days ago

Just saw this posted on Reddit and came to say that I'll be improving the transplant_vocab code over the next few days so keep and eye on the repo!

rdsm

Owner 11 days ago

Wow that is great! will sure keep an eye on it.

jukofyork

11 days ago

I've merged all the changes for now, but not 100% sure if they will work for all models:

https://github.com/jukofyork/transplant-vocab/pulls?q=is%3Apr+is%3Aclosed

(I'm still working on the full deepseek-r1 and 1deepseek-v3` models currently)

rdsm

Owner 11 days ago

@jukofyork did a new run here for the QwenUnslothPhi-4 and I don't notice any significant change on acceptance or speed.

jukofyork

11 days ago

•

edited 11 days ago

@jukofyork did a new run here for the QwenUnslothPhi-4 and I don't notice any significant change on acceptance or speed.

Yeah, it probably needs fine-tuning to get working properly then.

Out of interest, what percentage of lm_head are getting copy vs mean?

For deepseek-r1 it's around 65%/35%, but for cohere models it's was more like 30%/70% and they didn't work at all.

I think that stat may be useful for diagnostics to see if the draft has a good chance of working or not without fine-tuning.

rdsm

Owner 11 days ago

@jukofyork

Loaded OK:
- Donor vocab size  : 151936
- Target vocab size : 100352 (used = 100352, unused = 0)
- Donor hidden size : 896

Processing 3 automatic token overrides:
✔ 'bos_token_id' : 100257 '<|endoftext|>' → [151643] '<|endoftext|>'
✔ 'eos_token_id' : 100265 '<|im_end|>' → [151645] '<|im_end|>'
✔ 'pad_token_id' : 100351 '<|dummy_87|>' → [151643] '<|endoftext|>'

Transplanting tokens: 100%|██████████████████████████████████████████████████████████████████████████████| 100352/100352 [00:03<00:00, 27595.94token/s]

Transplant mappings:
- 1 to 1  : 99084 (99%)
- 2 to 1  : 174 (0.17%)
- 3 to 1  : 1005 (1%)
- 6 to 1  : 1 (0.001%)
- 7 to 1  : 11 (0.011%)
- 8 to 1  : 77 (0.077%)

Head initialized with:
- Copies : 99084 (99%)
- Means  : 1268 (1.3%)
- Zeros  : 0 (0%)

rdsm

Owner 11 days ago

•

edited 11 days ago

@jukofyork and maybe I wasn't clear, I mean that I don't notice any change on acceptance or speed related to the previous draft made using the previous version , I still see great results in MLX (and not so great in GGUF) , so it works, just as good as the previous version.

jukofyork

11 days ago

Transplant mappings:
- 1 to 1  : 99084 (99%)
- 2 to 1  : 174 (0.17%)
- 3 to 1  : 1005 (1%)
- 6 to 1  : 1 (0.001%)
- 7 to 1  : 11 (0.011%)
- 8 to 1  : 77 (0.077%)

Head initialized with:
- Copies : 99084 (99%)
- Means  : 1268 (1.3%)
- Zeros  : 0 (0%)

Thanks - this explains why you are getting such good results I think! That's an almost perfect mapping for the tokeniser!

jukofyork

11 days ago

@jukofyork and maybe I wasn't clear, I mean that I don't notice any change on acceptance or speed related to the previous draft made using the previous version , I still see great results in MLX (and not so great in GGUF) , so it works, just as good as the previous version.

Yeah, I haven't had great success with my attempts using llama.cpp and deepseek-r1 yet. I can get around 15-20% increase after fine-tuning the 0.5b model, but was hoping for much more considering I used over 3B tokens of real r1 data to fine-tune it on :/

Having one more try now with a 0.5b model cut down to 16/24 layers and better token mapping/initialisation, but if this doesn't work any better I'm just gonna move onto trying a coder-specific draft for deepseek-v3 and give up on r1.

rdsm

Owner 11 days ago

I tried the trimming here but I am finding an error when converting the mlx ( I was trying to convert it without further finetuning to see the impact )

    raise ValueError(f"Received parameters not in model: {extras}.")
ValueError: Received parameters not in model: model.layers.22.self_attn.k_proj.bias model.layers.23.self_attn.q_proj.bias model.layers.22.self_attn.v_proj.bias model.layers.22.self_attn.q_proj.bias model.layers.23.mlp.up_proj.weight model.layers.23.mlp.down_proj.weight model.layers.23.post_attention_layernorm.weight model.layers.22.mlp.down_proj.weight model.layers.22.self_attn.v_proj.weight model.layers.23.input_layernorm.weight model.layers.22.post_attention_layernorm.weight model.layers.23.self_attn.k_proj.weight model.layers.22.input_layernorm.weight model.layers.23.self_attn.v_proj.weight model.layers.23.self_attn.k_proj.bias model.layers.22.mlp.gate_proj.weight model.layers.22.self_attn.o_proj.weight model.layers.23.self_attn.q_proj.weight model.layers.22.mlp.up_proj.weight model.layers.23.self_attn.o_proj.weight model.layers.23.mlp.gate_proj.weight model.layers.23.self_attn.v_proj.bias model.layers.22.self_attn.k_proj.weight model.layers.22.self_attn.q_proj.weight.

That was after --trim-layers 14-21 , apparently it has the last two layers but the mlx_lm.convert was not expecting it?

rdsm

Owner 11 days ago

Yeah, I haven't had great success with my attempts using llama.cpp and deepseek-r1 yet. I can get around 15-20% increase after fine-tuning the 0.5b model, but was hoping for much more considering I used over 3B tokens of real r1 data to fine-tune it on :/

Wow I tried some small finetuning to see if there was any improvements but it was just a few M tokens, clearly I would need way more tokens.

rdsm

Owner 11 days ago

@Echo9Zulu This discussion might be interesting for you , I wonder what kind of "Head initialized with" numbers you are getting with EXAONE.

jukofyork

11 days ago

I tried the trimming here but I am finding an error when converting the mlx ( I was trying to convert it without further finetuning to see the impact )

    raise ValueError(f"Received parameters not in model: {extras}.")
ValueError: Received parameters not in model: model.layers.22.self_attn.k_proj.bias model.layers.23.self_attn.q_proj.bias model.layers.22.self_attn.v_proj.bias model.layers.22.self_attn.q_proj.bias model.layers.23.mlp.up_proj.weight model.layers.23.mlp.down_proj.weight model.layers.23.post_attention_layernorm.weight model.layers.22.mlp.down_proj.weight model.layers.22.self_attn.v_proj.weight model.layers.23.input_layernorm.weight model.layers.22.post_attention_layernorm.weight model.layers.23.self_attn.k_proj.weight model.layers.22.input_layernorm.weight model.layers.23.self_attn.v_proj.weight model.layers.23.self_attn.k_proj.bias model.layers.22.mlp.gate_proj.weight model.layers.22.self_attn.o_proj.weight model.layers.23.self_attn.q_proj.weight model.layers.22.mlp.up_proj.weight model.layers.23.self_attn.o_proj.weight model.layers.23.mlp.gate_proj.weight model.layers.23.self_attn.v_proj.bias model.layers.22.self_attn.k_proj.weight model.layers.22.self_attn.q_proj.weight.

That was after --trim-layers 14-21 , apparently it has the last two layers but the mlx_lm.convert was not expecting it?

It could be a bug as haven't tested the final model after trimming yet. If you upload the safetensors files after trimming to a private huggingface repo then click on it in the files section it should show you the tensors in the file. Can you check if there is still some layers 22 and 23 hanging on in the file? They should all be renumbered 0 to 15, but it's possible I've messed the code up for this bit.

jukofyork

11 days ago

Yeah, I haven't had great success with my attempts using llama.cpp and deepseek-r1 yet. I can get around 15-20% increase after fine-tuning the 0.5b model, but was hoping for much more considering I used over 3B tokens of real r1 data to fine-tune it on :/

Wow I tried some small finetuning to see if there was any improvements but it was just a few M tokens, clearly I would need way more tokens.

Yeah, was pretty disappointing as I was getting nearly that level of speedup without any fine-tuning :(

jukofyork

11 days ago

I tried the trimming here but I am finding an error when converting the mlx ( I was trying to convert it without further finetuning to see the impact )
    raise ValueError(f"Received parameters not in model: {extras}.")
ValueError: Received parameters not in model: model.layers.22.self_attn.k_proj.bias model.layers.23.self_attn.q_proj.bias model.layers.22.self_attn.v_proj.bias model.layers.22.self_attn.q_proj.bias model.layers.23.mlp.up_proj.weight model.layers.23.mlp.down_proj.weight model.layers.23.post_attention_layernorm.weight model.layers.22.mlp.down_proj.weight model.layers.22.self_attn.v_proj.weight model.layers.23.input_layernorm.weight model.layers.22.post_attention_layernorm.weight model.layers.23.self_attn.k_proj.weight model.layers.22.input_layernorm.weight model.layers.23.self_attn.v_proj.weight model.layers.23.self_attn.k_proj.bias model.layers.22.mlp.gate_proj.weight model.layers.22.self_attn.o_proj.weight model.layers.23.self_attn.q_proj.weight model.layers.22.mlp.up_proj.weight model.layers.23.self_attn.o_proj.weight model.layers.23.mlp.gate_proj.weight model.layers.23.self_attn.v_proj.bias model.layers.22.self_attn.k_proj.weight model.layers.22.self_attn.q_proj.weight.
That was after --trim-layers 14-21 , apparently it has the last two layers but the mlx_lm.convert was not expecting it?
It could be a bug as haven't tested the final model after trimming yet. If you upload the safetensors files after trimming to a private huggingface repo then click on it in the files section it should show you the tensors in the file. Can you check if there is still some layers 22 and 23 hanging on in the file? They should all be renumbered 0 to 15, but it's possible I've messed the code up for this bit.

Yeah, I have messed up.

Using safetensormetadump from https://github.com/huggingface/safetensors/discussions/275

  "model.layers.22.input_layernorm.weight": {
    "dtype": "F32",
    "shape": [
      896
    ],
    "data_offsets": [
      1463524864,
      1463528448
    ]
  },
  "model.layers.22.mlp.down_proj.weight": {
    "dtype": "F32",
    "shape": [
      896,
      4864
    ],
    "data_offsets": [
      1463528448,
      1480961024
    ]
  },
  "model.layers.22.mlp.gate_proj.weight": {
    "dtype": "F32",
    "shape": [
      4864,
      896
    ],
    "data_offsets": [
      1480961024,
      1498393600
    ]
  },
  "model.layers.22.mlp.up_proj.weight": {

I'll try and fix it now.

rdsm

Owner 11 days ago

yeah, faster than me, I see the same.

jukofyork

11 days ago

I think I have fixed it now, but weirdly the transformers library didn't even seem to care they were there :/

Shows how hard it is to do this sort of thing robustly!

rdsm

Owner 11 days ago

Yes, so many different implementations in this AI world.

I was able to get it working by adding layer_match = match while you are doing the final fix by adding:

                    # Create a new tensor to avoid shared memory issues
                    new_state_dict[new_key] = tensor.clone()
                    layer_match = match
                    break

Anyway, disacrded the local changes and I am now on sync with your repo again and re-generated the model.

I can confirm that I can now create the MLX model and that as expected it is terrible and will need finetuning :)) I will let it overnight training overnight on a dataset that I made earlier today.

jukofyork

11 days ago

Without any fine-tuning it will be absolutely horrible though (I get a starting perplexity of 400,000 vs 90 at the start of fine-tuning!).

The biggest problem is the final hidden-state vectors' magnitudes (ie: that go into the lm_head tensor) will all be too small and these get fixed in the first few steps of fine-tuning:

It doesn't take much fine-tuning to fix this part of the problem though and very quickly it will have a much more sensible perplexity and start to catch the non-trimmed model...

jukofyork

11 days ago

I can confirm that I can now create the MLX model and that as expected it is terrible and will need finetuning :)) I will let it overnight training overnight on a dataset that I made earlier today.

Sorry, posted right at the same time as your reply :)

Yeah, it might be worth trying it (and even removing more layers if you can) as at least for llama.cpp the faster and smaller the speculative model; the better it works.

jukofyork

11 days ago

By step 40 it's already fixed this part:

and the perplexity is 1000x less.

By the time it's done 200-300 more steps it won't be that far behind the non-trimmed model in terms of loss/perplexity and top-1 hit-rate.

Echo9Zulu

11 days ago

@rdsm thanks for the mention! @jukofyork you should check out this paper FastDraft: How to Train Your Draft. Chances are this work was done before Qwen2 0.5b launched and it was feasible from a compute perspective to implement transplant_vocab without fine tuning. My understanding is that IBM implemented this approach with their Granite accelerator series.

No, I don't have any progress yet lol and I will keep tabs on this.

rdsm

Owner 11 days ago

•

edited 11 days ago

By the time it's done 200-300 more steps it won't be that far behind the non-trimmed model in terms of loss/perplexity and top-1 hit-rate.

Great! I have started the training here, I will leave it running overnight as it is already quite late for me and resume playing with it tomorrow.

Thanks for the quick turnaround here! it was very fun.

jukofyork

11 days ago

@Echo9Zulu Thanks I'll give it a read tomorrow.

Yeah, I'm in the UK and the clocks have just gone forward and I was wondering where the last hour had gone :D

Echo9Zulu

11 days ago

@jukofyork Wait you have "daylight savings" in the UK like we have in the US? thats wild, thought the acceptance rate would be lower on that sort of thing lol

rdsm

Owner 11 days ago

@jukofyork oh, I am in Ireland, that explains it, the clocks also changed here. Forgot about that!

@Echo9Zulu not sure about the UK but here is the opposite the official hour is the one ahead (Irish Standard Time) and we change to the different one, one hour behind during the winter (not sure why…)

jukofyork

11 days ago

•

edited 11 days ago

It recovered most of the performance (cyan line) vs the full model (grey line):

The magenta line is my new experiment that uses just 12 out of 24 layers, and also the --trim-intermediate-size option that I just merged:

python3 ./transplant_vocab.py \
    ./Qwen2.5-0.5B-Instruct \
    ./DeepSeek-R1 \
    ./finetunes/STAGE1/DeepSeek-R1-DRAFT-0.25B \
    --overwrite \
    --use-cpu-only \
    --trim-layers 10-21 \
    --trim-intermediate-size 2432 \
    --override "<｜▁pad▁｜>" "<|endoftext|>" \
    --override "<｜fim▁hole｜>" "<|fim_middle|>" \
    --override "<｜fim▁begin｜>" "<|fim_prefix|>" \
    --override "<｜fim▁end｜>" "<|fim_suffix|>" \
    --override "<｜User｜>" "<|im_start|>user\\n" \
    --override "<｜Assistant｜>" "<|im_start|>assistant\\n" \
    --override "<|EOT|>" "<|endoftext|>" \
    --override "<｜tool▁calls▁begin｜>" "<tool_call>" \
    --override "<｜tool▁call▁begin｜>" "<tool_call>" \
    --override "<｜tool▁outputs▁begin｜>" "<tool_call>" \
    --override "<｜tool▁output▁begin｜>" "<tool_call>" \
    --override "<｜tool▁calls▁end｜>" "</tool_call>" \
    --override "<｜tool▁call▁end｜>" "</tool_call>" \
    --override "<｜tool▁outputs▁end｜>" "</tool_call>" \
    --override "<｜tool▁output▁end｜>" "</tool_call>" \
    --override "<｜tool▁sep｜>" "</tool_call>"

The small qwen models seem to use quite a high hidden:intermediate ratio:

- Donor hidden size       : 896
- Donor intermediate size : 4864 (ratio = 1:5.4)

compared to other small models, and I doubt it is all that useful if we mainly care about speculative decoding predicting summarised text or long variable names in code, etc.

I've also tried to use a more robust way of saving the final model:

    model = type(model)(model.config)
    model.load_state_dict(new_state_dict)

and added a debugging function that is commented out by default:

    # debug_model_tensors(model, new_state_dict)

and added the safetensor_meta_dump.sh code to the repo to help with debugging too:

#!/bin/bash

# Specify the file
FILE="$1"

# Extract the first 8 bytes and convert them to a decimal integer
HEADER_LENGTH=$(dd "if=$FILE" bs=1 count=8 2>/dev/null | od -An -vtu8)

# Extract the metadata, starting from the 9th byte
dd "if=$FILE" bs=1 skip=8 "count=$HEADER_LENGTH" 2>/dev/null | jq

I probably won't have time to do much more now, but I've implemented most of what I could think of (hidden_size reduction was too much hassle as you have to deal with all the layer_norm tensors, the head sizes need reducing and so on).

Again, I'm still only testing this on deepseek-r1 and qwen-2.5-instruct:0.5b, so there may be bugs that aren't showing for me and I'd appreciate it if you can test on your models to check it works! :)

jukofyork

10 days ago

@rdsm thanks for the mention! @jukofyork you should check out this paper FastDraft: How to Train Your Draft.

Thanks for this reference! I think it may explain why I wasn't getting very good results:

I was only using a subset of the common-crawl dataset for the continued pre-training, so now I've added an equal amount of this dataset to see the effect.

Echo9Zulu

10 days ago

@jukofyork Woah that's a cool finding. Definitely let us know your results

rdsm

Owner 10 days ago

That is interesting, tried it here and my dataset is divided in categories and I was not going through all of it, shuffling it before training seems to have made the improvements faster.
But clearly training on a couple of million tokens is not enough for trimmed models, I think I will focus on trying to improve the untrimmed model.

rdsm

Owner 10 days ago

@jukofyork you might find it interesting:

Almost 50/50 and it works fine (deephermes-3-mistral-24b) a nice bump in performance from 7tk/s to 10tk/s on MLX.

Loaded OK:
- Donor vocab size  : 151936
- Target vocab size : 131078 (used = 131078, unused = 0)
- Donor hidden size : 896

Processing 3 automatic token overrides:
✔ 'bos_token_id' : 1 '<s>' → [151643] '<|endoftext|>'
✔ 'eos_token_id' : 131072 '<|eot_id|>' → [151645] '<|im_end|>'
✔ 'pad_token_id' : 131077 '<|end_of_text|>' → [151643] '<|endoftext|>'

Transplanting tokens: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 131078/131078 [00:35<00:00, 3641.54token/s]

Transplant mappings:
- 1 to 1  : 67905 (52%)
- 2 to 1  : 45124 (34%)
- 3 to 1  : 11133 (8.5%)
- 4 to 1  : 3358 (2.6%)
- 5 to 1  : 1261 (0.96%)
- 6 to 1  : 635 (0.48%)
- 7 to 1  : 1169 (0.89%)
- 8 to 1  : 186 (0.14%)
- 9 to 1  : 95 (0.072%)
- 10 to 1 : 66 (0.05%)
- 11 to 1 : 34 (0.026%)
- 12 to 1 : 28 (0.021%)
- 13 to 1 : 22 (0.017%)
- 14 to 1 : 15 (0.011%)
- 15 to 1 : 9 (0.0069%)
- 16 to 1 : 11 (0.0084%)
- 17 to 1 : 12 (0.0092%)
- 18 to 1 : 2 (0.0015%)
- 19 to 1 : 4 (0.0031%)
- 20 to 1 : 4 (0.0031%)
- 21 to 1 : 2 (0.0015%)
- 22 to 1 : 1 (0.00076%)
- 23 to 1 : 1 (0.00076%)
- 28 to 1 : 1 (0.00076%)

Head initialized with:
- Copies : 67905 (52%)
- Means  : 63173 (48%)
- Zeros  : 0 (0%)

jukofyork

9 days ago

•

edited 9 days ago

@rdsm It's interesting it seems to work better on MLX!? I tried twice to buy a 128GB M2 Ultra off eBay about a month ago and both tried to scam me lol (well the second guy tried to scam the postal insurance and not me directly), so sadly no way I can test or improve it for MLX yet.

Regarding the paper: that turned out to be super useful as I think the problem I was having all along was not including any raw code in the "pre-training" datasets! I'm now trying the following mixes with ~3B tokens all trained in one go (ie: rather than 2 stages):

For a general draft model:

30% deepseek-r1 data that is mostly textual (the dolphin dataset).
30% deepseek-r1 data that is mostly mathematics (the open thoughts dataset).
30% from the stack-v1-smol-xl dataset.
10% from the creative commons sample dataset.

For the coder-specific draft model I use 22.5% / 22.5% / 45% / 10% of the same datasets.

It will be 4-5 days before I have these two trained up for r1 and v3, but so far it looks like this may be what I was missing and either: using 2-stages, or the lack of code data, was causing the problems!

rdsm

Owner 9 days ago

It's interesting it seems to work better on MLX!?

The only reference that I have are my two machines (M1 Max 64gb and m4 base 32gb), on the m1 max I don't see much difference, on the m4 I do see a lot. I heard that the m1 don't do well with spec. dec. so might be it.

It will be 4-5 days before I have these two trained up

interesting , anxious to see by how much they outperform Qwen 2.5 0.5b.