Can you please add Nemotron 70B static?

by nickandbro - opened Oct 22, 2024

Discussion

nickandbro

Oct 22, 2024

Unless I am mistaken the static one is the only fp8 quantization method that achieves 2x throughput.

markurtz

Neural Magic org Oct 22, 2024

Hi @nickandbro , both static and dynamic achieve fairly similar speedups (generally within 10%). Dynamic is generally preferred and the one we recommend because it maintains better accuracy recovery.

adamo1139

Oct 22, 2024

Is there a big performance difference when activations are dynamically quantized on the per tensor basis VS per token basis?

vLLM docs about quick fp8 quantization that's done when launching an engine suggest that there is a limited latency (as in performance?) gain from doing dynamic per-tensor FP8 activations quantization instead of fp16 inference. LINK.

Which, if you don't focus too much on the flavor of of dynamic quantization, would make one believe that static quants are faster.

nickandbro

Oct 22, 2024

@adamo1139 That's awesome! Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment