QwQ-1.5B-Persona

Introduction

QwQ-1.5B-Persona is finetuned from Qwen2.5-1.5B-Instruct on 1 million math persona data (see this paper for details about how to construct the data).

Currently QwQ-1.5B-Persona is meant to serve as a draft model for losslessly accelerating the inference of QwQ-32B, but you may also use it as a standalone model.

Evaluation

We provide the speedup of QwQ 32B on MATH (200 samples spanning level 1-5), GPQA (diamond), and AIME (2023,2024). All experiments are run on 2 A100 with 40G memory each.

Qwen2.5-1.5B-Instruct as draft model:

Draft Length Policy	MATH (l1)	MATH (l2)	MATH (l3)	MATH (l4)	MATH (l5)	GPQA	AIME	Avg
Constant	1.18	1.16	1.22	1.29	1.30	1.28	1.13	1.22
Heuristics	1.12	1.15	1.17	1.19	1.22	1.25	1.13	1.18
SVIP	1.45	1.47	1.51	1.58	1.61	1.57	1.45	1.52

QwQ-1.5B-Persona as draft model:

Draft Length Policy	MATH (l1)	MATH (l2)	MATH (l3)	MATH (l4)	MATH (l5)	GPQA	AIME	Avg
Constant	1.45	1.50	1.52	1.56	1.56	1.58	1.25	1.49
Heuristics	1.29	1.26	1.27	1.30	1.33	1.34	1.18	1.28
SVIP	1.65	1.68	1.75	1.78	1.82	1.77	1.52	1.71

The three draft length policies are:

Constant: A constant draft length of 5
Heuristics: If all draft tokens are accepted in one round, draft length is increased by 2 in the next round; otherwise it's decreased by 1. This policy is implemented in the transformers library.
SVIP: a dynamic draft length policy that adaptively determines when to stop drafting based on draft model entropy. See this paper for details.

Quickstart

The constant and heuristics draft length policies have been integrated into transformers library. Here is a code snippet for using QwQ-1.5B-Persona to accelerate the inference of QwQ 32B:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/QwQ-32B-Preview",
    torch_dtype="auto",
    device_map={'': 0}
)

draft_model = AutoModelForCausalLM.from_pretrained(
    "Geralt-Targaryen/QwQ-1.5B-Persona",
    torch_dtype="auto",
    device_map={'': 0}
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B-Preview")

prompt = "How many r in strawberry."
messages = [
    {"role": "system", "content": "You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    assistant_model=draft_model
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

For the more advanced SVIP draft length policy, please refer to this GitHub repo.

Citation

If you find QwQ-1.5B-Persona to be helpful, please cite the following paper.

@misc{zhang2024svip,
      title={Draft Model Knows When to Stop: A Self-Verification Length Policy for Speculative Decoding},
      author={Ziyin Zhang and Jiahao Xu and Tian Liang and Xingyu Chen and Zhiwei He and Rui Wang and Zhaopeng Tu},
      year={2024},
      eprint={2411.18462},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.18462},
}