arxiv:2601.05524

Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

Published on Jan 9

Authors:

Abstract

Double Retrieval Speculative Parallelism addresses limitations in parallel speculative decoding by enabling iterative retrieval speculations and authoritative retrieval for multi-token guidance, achieving significant speedup without training overhead.

AI-generated summary

Parallel Speculative Decoding (PSD) accelerates traditional Speculative Decoding (SD) by overlapping draft generation with verification. However, it remains hampered by two fundamental challenges: (1) a theoretical speedup ceiling dictated by the speed ratio between the draft and target models, and (2) high computational waste and pipeline stall due to mid-sequence token rejections of early errors. To address these limitations, we introduce Double (Double Retrieval Speculative Parallelism). By bridging the gap between SD and PSD, our framework resolves the Retrieval Precision-Efficiency Dilemma through a novel synchronous mechanism. Specifically, we enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits; to alleviate rejections without rollback, the target model performs authoritative retrieval to generate multi-token guidance. Double is entirely training-free and lossless. Extensive experiments demonstrate state-of-the-art speedup of 5.3times on LLaMA3.3-70B and 2.8times on Qwen3-32B, significantly outperforming the advanced method EAGLE-3 that requires extensive model training.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.05524 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.05524 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.05524 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.