Can you do a fp4?

#1
by etohimself - opened

I tried the fp16 model and it takes 5-10 seconds to generate 1-2 sentences. It's too slow. We are very close to
replicating real-time conversation, all we need is a fast CSM now. It would be amazing if you can quantize it further

I tried the fp16 model and it takes 5-10 seconds to generate 1-2 sentences. It's too slow. We are very close to
replicating real-time conversation, all we need is a fast CSM now. It would be amazing if you can quantize it further

The lower you try to go with precision levels, the model will be inaccurate and start talking in gibberish.

It doesn't seem like there is an FP4, and FP8 is non-standard and not a PyTorch format, I can only go down to INT8 (torch.qint8) and/or uint8 (torch.uint8)

Hmm. Okay thank you. How long does it take for you to generate 1-2 sentences? I see 50% gpu utilization on H100 , but still takes 5-10 seconds :/

I think if you're getting that on a H100 then it's not optimized for your card, I think you may want BF16 instead

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment