stereoplegic
's Collections
Speculative
updated
AutoMix: Automatically Mixing Language Models
Paper
•
2310.12963
•
Published
•
14
Large Language Model Cascades with Mixture of Thoughts Representations
for Cost-efficient Reasoning
Paper
•
2310.03094
•
Published
•
12
MatFormer: Nested Transformer for Elastic Inference
Paper
•
2310.07707
•
Published
•
1
DistillSpec: Improving Speculative Decoding via Knowledge Distillation
Paper
•
2310.08461
•
Published
•
1
DialCoT Meets PPO: Decomposing and Exploring Reasoning Paths in Smaller
Language Models
Paper
•
2310.05074
•
Published
•
1
Confident Adaptive Language Modeling
Paper
•
2207.07061
•
Published
•
1
LLMCad: Fast and Scalable On-device Large Language Model Inference
Paper
•
2309.04255
•
Published
•
1
SpecInfer: Accelerating Generative LLM Serving with Speculative
Inference and Token Tree Verification
Paper
•
2305.09781
•
Published
•
4
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
Paper
•
2311.04897
•
Published
•
1
Depth-Adaptive Transformer
Paper
•
1910.10073
•
Published
•
1
Fast Inference from Transformers via Speculative Decoding
Paper
•
2211.17192
•
Published
•
4
Accelerating Large Language Model Decoding with Speculative Sampling
Paper
•
2302.01318
•
Published
•
2
RecycleGPT: An Autoregressive Language Model with Recyclable Module
Paper
•
2308.03421
•
Published
•
7
OrchestraLLM: Efficient Orchestration of Language Models for Dialogue
State Tracking
Paper
•
2311.09758
•
Published
•
1
Small Language Models Improve Giants by Rewriting Their Outputs
Paper
•
2305.13514
•
Published
•
2
SortedNet, a Place for Every Network and Every Network in its Place:
Towards a Generalized Solution for Training Many-in-One Neural Networks
Paper
•
2309.00255
•
Published
•
1
Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large
Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT)
Paper
•
2309.08968
•
Published
•
22
Fast and Robust Early-Exiting Framework for Autoregressive Language
Models with Synchronized Parallel Decoding
Paper
•
2310.05424
•
Published
•
1
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads
to Answers Faster
Paper
•
2311.08263
•
Published
•
15
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language
Models
Paper
•
2401.12522
•
Published
•
11
Unlocking Efficiency in Large Language Model Inference: A Comprehensive
Survey of Speculative Decoding
Paper
•
2401.07851
•
Published
•
1
APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding
Paper
•
2401.06761
•
Published
•
1
Cascade Speculative Drafting for Even Faster LLM Inference
Paper
•
2312.11462
•
Published
•
8
Speculative Contrastive Decoding
Paper
•
2311.08981
•
Published
•
2
Answering Unseen Questions With Smaller Language Models Using Rationale
Generation and Dense Retrieval
Paper
•
2308.04711
•
Published
•
1
Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding
Paper
•
2402.05109
•
Published
Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
Paper
•
2402.12374
•
Published
•
3
Online Speculative Decoding
Paper
•
2310.07177
•
Published
•
1
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper
•
2402.11131
•
Published
•
42
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting
Paper
•
2402.13720
•
Published
•
6
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper
•
2403.09919
•
Published
•
20
Better & Faster Large Language Models via Multi-token Prediction
Paper
•
2404.19737
•
Published
•
73
Clover: Regressive Lightweight Speculative Decoding with Sequential
Knowledge
Paper
•
2405.00263
•
Published
•
14
TriForce: Lossless Acceleration of Long Sequence Generation with
Hierarchical Speculative Decoding
Paper
•
2404.11912
•
Published
•
16
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Paper
•
2404.18911
•
Published
•
29
Chimera: A Lossless Decoding Method for Accelerating Large Language
Models Inference by Fusing all Tokens
Paper
•
2402.15758
•
Published
REST: Retrieval-Based Speculative Decoding
Paper
•
2311.08252
•
Published
Accelerating Production LLMs with Combined Token/Embedding Speculators
Paper
•
2404.19124
•
Published
Parallel Decoding via Hidden Transfer for Lossless Large Language Model
Acceleration
Paper
•
2404.12022
•
Published
Speculative Decoding with Big Little Decoder
Paper
•
2302.07863
•
Published
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache
Generation
Paper
•
2405.05329
•
Published
Accelerating Speculative Decoding using Dynamic Speculation Length
Paper
•
2405.04304
•
Published
•
2
SDSAT: Accelerating LLM Inference through Speculative Decoding with
Semantic Adaptive Tokens
Paper
•
2403.18647
•
Published
On Speculative Decoding for Multimodal Large Language Models
Paper
•
2404.08856
•
Published
•
13
You Only Cache Once: Decoder-Decoder Architectures for Language Models
Paper
•
2405.05254
•
Published
•
10
Lossless Acceleration of Large Language Model via Adaptive N-gram
Parallel Decoding
Paper
•
2404.08698
•
Published
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient
Acceleration of LLM Inference
Paper
•
2405.18628
•
Published
EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating
Large Language Models
Paper
•
2405.07542
•
Published
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
Paper
•
2406.17276
•
Published
Speculative Decoding via Early-exiting for Faster LLM Inference with
Thompson Sampling Control Mechanism
Paper
•
2406.03853
•
Published
Optimizing Speculative Decoding for Serving Large Language Models Using
Goodput
Paper
•
2406.14066
•
Published
•
1
S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested
Large Language Models
Paper
•
2407.01955
•
Published
Make Some Noise: Unlocking Language Model Parallel Inference Capability
through Noisy Training
Paper
•
2406.17404
•
Published
•
1
Adaptive Draft-Verification for Efficient Large Language Model Decoding
Paper
•
2407.12021
•
Published
SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative
Decoding
Paper
•
2406.18200
•
Published
•
1
Parallel Speculative Decoding with Adaptive Draft Length
Paper
•
2408.11850
•
Published
Improving Multi-candidate Speculative Decoding
Paper
•
2409.10644
•
Published
Learning Harmonized Representations for Speculative Sampling
Paper
•
2408.15766
•
Published
Turning Trash into Treasure: Accelerating Inference of Large Language
Models with Token Recycling
Paper
•
2408.08696
•
Published
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined
Speculation
Paper
•
2407.11798
•
Published
•
1