Can you do a fp4?

by etohimself - opened Mar 16

Mar 16

I tried the fp16 model and it takes 5-10 seconds to generate 1-2 sentences. It's too slow. We are very close to
replicating real-time conversation, all we need is a fast CSM now. It would be amazing if you can quantize it further

lunahr

Owner Mar 16

•

edited Mar 16

I tried the fp16 model and it takes 5-10 seconds to generate 1-2 sentences. It's too slow. We are very close to
replicating real-time conversation, all we need is a fast CSM now. It would be amazing if you can quantize it further

The lower you try to go with precision levels, the model will be inaccurate and start talking in gibberish.

It doesn't seem like there is an FP4, and FP8 is non-standard and not a PyTorch format, I can only go down to INT8 (torch.qint8) and/or uint8 (torch.uint8)

etohimself

Mar 16

Hmm. Okay thank you. How long does it take for you to generate 1-2 sentences? I see 50% gpu utilization on H100 , but still takes 5-10 seconds :/

lunahr

Owner Mar 17

I think if you're getting that on a H100 then it's not optimized for your card, I think you may want BF16 instead

lunahr

Owner Mar 22

More variants are now available.

While they are not a traditional quantized model, they may work.

reitt

Mar 25

I forked an interesting repository with multiple training options for this model, unfortunately its primary focus is MLX, and I've been unable to get the pytorch training that is present to work. I'm far from knowledgeable enough, it would be great if those that are would get involved and help make it possible, my fork https://github.com/imaginateit/csm-train-pytorch.git

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment