arxiv:2502.18890

From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

Published on Feb 26

· Submitted by

zlzheng on Mar 4

Upvote

Authors:

Tong Wu ,

Junzhe Shen ,

Zixia Jia ,

Yuxuan Wang ,

Zilong Zheng

Abstract

Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.

View arXiv page View PDF GitHub repository Add to collection

Community

zlzheng

Paper author Paper submitter 2 days ago

TokenSwift is a novel framework designed to substantially accelerate the generation process of ultra-long sequences, up to 100K tokens, while maintaining the target model's inherent quality.

Highlights	Description	Emoji
⚡ Speed	3× faster than vanilla Transformers	⏩
🎯 Lossless	Matches original model's output quality	✅
📈 Scalability	Linear time complexity for 100K+ sequences	📏
🛠️ Plug & Play	Works with most HuggingFace models	🤗