Can you do a fp4?

#1
by etohimself - opened

I tried the fp16 model and it takes 5-10 seconds to generate 1-2 sentences. It's too slow. We are very close to
replicating real-time conversation, all we need is a fast CSM now. It would be amazing if you can quantize it further

I tried the fp16 model and it takes 5-10 seconds to generate 1-2 sentences. It's too slow. We are very close to
replicating real-time conversation, all we need is a fast CSM now. It would be amazing if you can quantize it further

The lower you try to go with precision levels, the model will be inaccurate and start talking in gibberish.

It doesn't seem like there is an FP4, and FP8 is non-standard and not a PyTorch format, I can only go down to INT8 (torch.qint8) and/or uint8 (torch.uint8)

Hmm. Okay thank you. How long does it take for you to generate 1-2 sentences? I see 50% gpu utilization on H100 , but still takes 5-10 seconds :/

Owner

I think if you're getting that on a H100 then it's not optimized for your card, I think you may want BF16 instead

Owner

More variants are now available.

While they are not a traditional quantized model, they may work.

I forked an interesting repository with multiple training options for this model, unfortunately its primary focus is MLX, and I've been unable to get the pytorch training that is present to work. I'm far from knowledgeable enough, it would be great if those that are would get involved and help make it possible, my fork https://github.com/imaginateit/csm-train-pytorch.git

Sign up or log in to comment