About memory usage

#3
by Kijai - opened

Hey,

Is this supposed to use a lot more VRAM with WanVideo, as I'm experiencing in my initial tests? I can get the speed gain on a 5090 at 1280x720x81 for sure, but I can't fit that without block offloading unlike with sage2, ending up with barely any speedup in practice.

Seems that the delta_s tensor in the preprocess_qkv just becomes too big...

Maybe unrelated here, hi Kijai. I wonder, I did a preliminary PR on Comfy adding sage3 support, but I get lower speeds vs sage2, do you have an idea? I think it is highly factible my implementation is not correct?

For now I've tested it only txt2img and img2img models https://github.com/comfyanonymous/ComfyUI/pull/9047

Setting per_block_mean=False makes it run actually, dunno what the implication of this is though.

At 832x480 it's not faster than sage2, but at 1280x720 it's considerably faster for me.

Perfect, many thanks! I will try it out. I'm not sure about the implications about per_block_mean as well.

Applied the Comfy patch from Panchovix and tested on Wan2.1 I2V model on the Comfy native workflow. No speed difference compared to Sage 2 on RTX 5080. I tested both 832 x 480 and 1280 x 720.

This was my test on my WanVideoWrapper, 1280x640x81, 6 steps with Lightx2v distill LoRA, fp8, per_block_mean=False

My test on the native, 1280 x 720 x 81, 6 steps LightX2v or FusioniX, fp8, per_block_mean=False:

RTX 5080 16GB VRAM + 64GB DDR5
Sage2: 26.10s/it ( pytorch 2.7.1, Cuda 12.9, Python 3.12.9 )
Sage3: 24.74s/it ( pytorch 2.9.0, Cuda 12.9, Python 3.13.5 )

I'm not sure about the state of the implementation yet, but I was really hoping Sage3 was going to be faster.
The slight speed boost I'm getting here could be just due to using newer python/pytorch stack.

It may be that my implementation is not correct? At the end I didn't see a difference on SDXL/Flux.

It may be that my implementation is not correct? At the end I didn't see a difference on SDXL/Flux.

I'm not a python dev, can't tell but thank you for being the first one to provide a pull request. Let's see what the other Comfy devs say after the review.
Speaking of Flux / SDXL, i didn't try that but on my end Sage2 slows down SDXL by a lot. Flux is fine and gets the speed up, but SDXL runs 3-4 times faster on pytorch attention for me.

Is this going to be available to work with RTX 30XX & 40XX GPU's? Thank you.

Hi everyone- how's it going?
Where would I find the per _block_mean= ?
I'd like to set per _block_mean=FALSE
sincerely

I'm struggling now with a go between alternative - Its not bad but true FP4 optimization is my focus.
please help me do better than this (Blue dress high heels fresh rain cobble stones 10 steps 5/5 230 secs) this was my first ComfyUI generation.

over 50 generations later I don't think I've gotten much better results (Reads Phone camera pulls in revealing joy? dark pre-dawn sky? no propellers?!? no petro motorcycles? 8\10 +1700! secs) either the text encoder is limited or...

Hi everyone- how's it going?
Where would I find the per _block_mean= ?
I'd like to set per _block_mean=FALSE
sincerely

I'm struggling now with a go between alternative - Its not bad but true FP4 optimization is my focus.
please help me do better than this (Blue dress high heels fresh rain cobble stones 10 steps 5/5 230 secs) this was my first ComfyUI generation.

over 50 generations later I don't think I've gotten much better results (Reads Phone camera pulls in revealing joy? dark pre-dawn sky? no propellers?!? no petro motorcycles? 8\10 +1700! secs) either the text encoder is limited or...

I found it - it was already set when I used the proposed changes edit for the 3 .py files -duhhhh. I'm learning so much thanks to all for allowing early access.
my personal experience with SageAttention 3 is limited to my below average ComfyUI abilities. So far I can say it works with mixed results. heres an example of what Sage Attention 3 does to the output in the same workflow (or any workflow for that matter - this is my fault. I haven't learned how to swap Attention 2 with Attention 3 as suggested in the readme file.).

Sign up or log in to comment