Can you do a fp4?

by etohimself - opened 1 day ago

1 day ago

I tried the fp16 model and it takes 5-10 seconds to generate 1-2 sentences. It's too slow. We are very close to
replicating real-time conversation, all we need is a fast CSM now. It would be amazing if you can quantize it further

lunahr

Owner 1 day ago

•

edited 1 day ago

I tried the fp16 model and it takes 5-10 seconds to generate 1-2 sentences. It's too slow. We are very close to
replicating real-time conversation, all we need is a fast CSM now. It would be amazing if you can quantize it further

The lower you try to go with precision levels, the model will be inaccurate and start talking in gibberish.

It doesn't seem like there is an FP4, and FP8 is non-standard and not a PyTorch format, I can only go down to INT8 (torch.qint8) and/or uint8 (torch.uint8)

etohimself

about 24 hours ago

Hmm. Okay thank you. How long does it take for you to generate 1-2 sentences? I see 50% gpu utilization on H100 , but still takes 5-10 seconds :/

lunahr

Owner about 13 hours ago

I think if you're getting that on a H100 then it's not optimized for your card, I think you may want BF16 instead

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment