Post
97
โก FlashHead: Fast LM Head Inference - Now a Simple vLLM Plugin
flash-head replaces the dense LM head with a two-stage retrieval pipeline - up to 2x inference speedup, training-free. Previously required custom Docker images; now it's just:
โจ The plugin activates automatically via vLLM's
๐งฉ Supported models (full collection):
Qwen Qwen3,
meta-llama Llama3,
google Gemma3,
nvidia Cosmos-Reason2 - BF16 and W4A16 variants.
https://huggingface.co/collections/embedl/flashhead
๐ embedl/Edge-Inference-Benchmarks
๐ง Benchmark it yourself:
FlashHead shines at low batch sizes; the typical real-time / on-device use case. ๐
flash-head replaces the dense LM head with a two-stage retrieval pipeline - up to 2x inference speedup, training-free. Previously required custom Docker images; now it's just:
pip install flash-head
vllm serve embedl/Qwen3-1.7B-FlashHead-W4A16โจ The plugin activates automatically via vLLM's
vllm.general_plugins entry point. No source patches, no custom imports.๐งฉ Supported models (full collection):
https://huggingface.co/collections/embedl/flashhead
๐ embedl/Edge-Inference-Benchmarks
๐ง Benchmark it yourself:
vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1
# Baseline comparison
FLASHHEAD_ENABLED=0 vllm bench latency --model embedl/Qwen3-1.7B-FlashHead-W4A16 --batch-size 1FlashHead shines at low batch sizes; the typical real-time / on-device use case. ๐