license: apache-2.0
QwQ-1.5B-Persona
Introduction
QwQ-1.5B-Persona is finetuned from Qwen2.5-1.5B-Instruct on 1 million math persona data (see this paper for details about how to construct the data).
Currently QwQ-1.5B-Persona is meant to serve as a draft model for losslessly accelerating the inference of QwQ-32B, but you may also use it as a standalone model.
Evaluation
We provide the speedup of QwQ 32B on MATH (200 samples spanning level 1-5), GPQA (diamond), and AIME (2023,2024). All experiments are run on 2 A100 with 40G memory each.
Qwen2.5-1.5B-Instruct as draft model:
Draft Length Policy | MATH (l1) | MATH (l2) | MATH (l3) | MATH (l4) | MATH (l5) | GPQA | AIME | Avg |
---|---|---|---|---|---|---|---|---|
Constant | 1.18 | 1.16 | 1.22 | 1.29 | 1.30 | 1.28 | 1.13 | 1.22 |
Heuristics | 1.12 | 1.15 | 1.17 | 1.19 | 1.22 | 1.25 | 1.13 | 1.18 |
SVIP | 1.45 | 1.47 | 1.51 | 1.58 | 1.61 | 1.57 | 1.45 | 1.52 |
QwQ-1.5B-Persona as draft model:
Draft Length Policy | MATH (l1) | MATH (l2) | MATH (l3) | MATH (l4) | MATH (l5) | GPQA | AIME | Avg |
---|---|---|---|---|---|---|---|---|
Constant | 1.45 | 1.50 | 1.52 | 1.56 | 1.56 | 1.58 | 1.25 | 1.49 |
Heuristics | 1.29 | 1.26 | 1.27 | 1.30 | 1.33 | 1.34 | 1.18 | 1.28 |
SVIP | 1.65 | 1.68 | 1.75 | 1.78 | 1.82 | 1.77 | 1.52 | 1.71 |
The three draft length policies are:
- Constant: A constant draft length of 5
- Heuristics: If all draft tokens are accepted in one round, draft length is increased by 2 in the next round; otherwise it's decreased by 1. This policy is implemented in the transformers library.
- SVIP: a dynamic draft length policy that adaptively determines when to stop drafting based on draft model entropy. See this paper for details.
Quickstart
The constant and heuristics draft length policies have been integrated into transformers library. Here is a code snippet for using QwQ-1.5B-Persona to accelerate the inference of QwQ 32B:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Qwen/QwQ-32B-Preview",
torch_dtype="auto",
device_map={'': 0}
)
draft_model = AutoModelForCausalLM.from_pretrained(
"Geralt-Targaryen/QwQ-1.5B-Persona",
torch_dtype="auto",
device_map={'': 0}
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B-Preview")
prompt = "How many r in strawberry."
messages = [
{"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
assistant_model=draft_model
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
For the more advanced SVIP draft length policy, please refer to this GitHub repo.
Citation
If you find QwQ-1.5B-Persona to be helpful, please cite the following paper.
@misc{zhang2024svip,
title={Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding},
author={Ziyin Zhang and Jiahao Xu and Tian Liang and Xingyu Chen and Zhiwei He and Rui Wang and Zhaopeng Tu},
year={2024},
eprint={2411.18462},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.18462},
}