FP16 Lora on FP8 scaled model

#52
by NielsGx - opened

Hi, I am using Wan2.2 T2V/I2V in fp8 scaled format.
I was wondering if using a FP16 lora (like wan lighting) will slow things down/increase vram consumption (if these parameters are loaded in FP16, or if they are converted to FP8 (scaled?) on the spot)

If yes, any infos on how to convert the wan2.2 lighting loras to fp8 scaled ?

Thank you

When used unmerged the fp8 scaled weights are first upcast to fp16 (or whatever you run base precision at) and the LoRA weight is added after that, so yeah it does have small memory impact but the quality is also higher. When merged the memory use doesn't change, but some quality is lost I believe.

I am also about to implement wan2.2 lora . the wan2.2 pipeline seems has no support for load_lora method. it will be tricky

When used unmerged the fp8 scaled weights are first upcast to fp16 (or whatever you run base precision at) and the LoRA weight is added after that, so yeah it does have small memory impact but the quality is also higher. When merged the memory use doesn't change, but some quality is lost I believe.

Pretty sure defaut ComfyUI run inference at FP16?
If I set to model weight_dtype to fp8_e4m3fn_fast, it upcast to bf16.
If I set to auto, it load as bf16 directly, and doesn't upcast.
(It looks a bit different but still really similar, and idk which one is better...)

So in that scenario it increase vram usage bc the weights altered by the lora are now fp16 instead of fp8, while the inference itself is all in fp16 in defaut ?

RTX4xxx can run at fp8, but the quality loss is probably too big (if I understand it, model precision and inference precision is different)
But vram usage for inference (the context or kv cache or idk) gets pretty significant for video models as it's obviously way bigger than image models 🤔 like generating 5sec use way more than 2sec

Thanks

When used unmerged the fp8 scaled weights are first upcast to fp16 (or whatever you run base precision at) and the LoRA weight is added after that, so yeah it does have small memory impact but the quality is also higher. When merged the memory use doesn't change, but some quality is lost I believe.

Pretty sure defaut ComfyUI run inference at FP16?
If I set to model weight_dtype to fp8_e4m3fn_fast, it upcast to bf16.
If I set to auto, it load as bf16 directly, and doesn't upcast.
(It looks a bit different but still really similar, and idk which one is better...)

So in that scenario it increase vram usage bc the weights altered by the lora are now fp16 instead of fp8, while the inference itself is all in fp16 in defaut ?

RTX4xxx can run at fp8, but the quality loss is probably too big (if I understand it, model precision and inference precision is different)
But vram usage for inference (the context or kv cache or idk) gets pretty significant for video models as it's obviously way bigger than image models 🤔 like generating 5sec use way more than 2sec

Thanks

The weights are only temporarily upcast when they are used. Default Comfy behavior is to merge LoRAs to the model before inference, that doesn't change the precision or inference time memory use.

When using GGUF in native Comfy or the unmerged LoRA mode in my wrapper, the LoRA weights are not merged and instead handled like I explained.

4000 series and up can do the Linear layer matrix multiplications in fp8 yes, but that part generally is not the peak VRAM memory usage in the model inference, and rest of the stuff still has to run at higher precision.

Sign up or log in to comment