Can you please add Nemotron 70B static?

#1
by nickandbro - opened

Unless I am mistaken the static one is the only fp8 quantization method that achieves 2x throughput.

Neural Magic org

Hi @nickandbro , both static and dynamic achieve fairly similar speedups (generally within 10%). Dynamic is generally preferred and the one we recommend because it maintains better accuracy recovery.

Is there a big performance difference when activations are dynamically quantized on the per tensor basis VS per token basis?

vLLM docs about quick fp8 quantization that's done when launching an engine suggest that there is a limited latency (as in performance?) gain from doing dynamic per-tensor FP8 activations quantization instead of fp16 inference. LINK.

Which, if you don't focus too much on the flavor of of dynamic quantization, would make one believe that static quants are faster.

@adamo1139 That's awesome! Thanks!

Sign up or log in to comment