QuantStack/Qwen-Image-Edit-GGUF · Suggestion for 6GB VRAM

3 days ago

I have a Ryzen 7 5700 with 16GB RAM and RTX3050, suggest me an AI image editing model to run locally to replace Photoshop Editing. I am currently using Flux Kontext Dev gguf5 but its taking 5-6 minutes to generate with image. Please suggest me a better option with workflow.

Deaquay

3 days ago

Your hardware is not fit for your use case.

willhsmit

3 days ago

Qwen is probably not going to happen for you in 6G, even at the lowest quant. Next best suggestion would be to use Kontext with Nunchaku int4 instead of gguf 5; that will give you a good speed boost with minimal quality drop.

Able2

3 days ago

Qwen is probably not going to happen for you in 6G, even at the lowest quant. Next best suggestion would be to use Kontext with Nunchaku int4 instead of gguf 5; that will give you a good speed boost with minimal quality drop.

I disagree. I am on RTX3060 laptop with 6GB VRAM and 16GB RAM. Both Flux Kontext and Qwen Image Q4 quants work perfectly fine (if you are willing to wait a while). Memory optimizations are quite mature these days, as long as you are willing to wait for it you can fit quite a lot of things into constrained hardwares. Of course quants do degrade quality a bit, but I doubt that to be visible for recreational or personal use though. For production use you'd better just get yourself better hardware, and all these compromises will go away.

PS Qwen image edit with lightning lora can do edit with only 1 step, if you found that 20 steps or even 4 steps (the recommended step count for the lightning 4 step lora) gruesome to wait for you can reduce the step count even more. For better quality I would say even 2 steps is enough to get decently detailed output.

mingyi456

2 days ago

•

edited 2 days ago

Qwen is probably not going to happen for you in 6G, even at the lowest quant. Next best suggestion would be to use Kontext with Nunchaku int4 instead of gguf 5; that will give you a good speed boost with minimal quality drop.

I disagree. I am on RTX3060 laptop with 6GB VRAM and 16GB RAM. Both Flux Kontext and Qwen Image Q4 quants work perfectly fine (if you are willing to wait a while). Memory optimizations are quite mature these days, as long as you are willing to wait for it you can fit quite a lot of things into constrained hardwares. Of course quants do degrade quality a bit, but I doubt that to be visible for recreational or personal use though. For production use you'd better just get yourself better hardware, and all these compromises will go away.

PS Qwen image edit with lightning lora can do edit with only 1 step, if you found that 20 steps or even 4 steps (the recommended step count for the lightning 4 step lora) gruesome to wait for you can reduce the step count even more. For better quality I would say even 2 steps is enough to get decently detailed output.

The laptop 3050 has 4GB of vram, not 6, and at this point I am sure even nunchaku is not going to save OP. Nunchaku uses a 4 bit quant type that is extremely accurate for its size, comparable to GGUF Q5 or Q6, but Qwen Image at 20B is still going to use 10GB for the weights alone.

@Able2 What quant do you use for the text encoder? With my 48GB of system RAM, the diffusers library seems to regularly glitch out and overflow into swap when using enable_model_cpu_offload(), even though it should be more than enough to hold all the components.

Edit: A 4-bit quant of Qwen Image should be 10GB, not 5 GB.

Able2

2 days ago

Qwen is probably not going to happen for you in 6G, even at the lowest quant. Next best suggestion would be to use Kontext with Nunchaku int4 instead of gguf 5; that will give you a good speed boost with minimal quality drop.

I disagree. I am on RTX3060 laptop with 6GB VRAM and 16GB RAM. Both Flux Kontext and Qwen Image Q4 quants work perfectly fine (if you are willing to wait a while). Memory optimizations are quite mature these days, as long as you are willing to wait for it you can fit quite a lot of things into constrained hardwares. Of course quants do degrade quality a bit, but I doubt that to be visible for recreational or personal use though. For production use you'd better just get yourself better hardware, and all these compromises will go away.

PS Qwen image edit with lightning lora can do edit with only 1 step, if you found that 20 steps or even 4 steps (the recommended step count for the lightning 4 step lora) gruesome to wait for you can reduce the step count even more. For better quality I would say even 2 steps is enough to get decently detailed output.

The laptop 3050 has 4GB of vram, not 6, and at this point I am sure even nunchaku is not going to save OP. Nunchaku uses a 4 bit quant type that is extremely accurate for its size, comparable to GGUF Q5 or Q6, but Qwen Image at 20B is still going to use 5GB for the weights alone.

@Able2 What quant do you use for the text encoder? With my 48GB of system RAM, the diffusers library seems to regularly glitch out and overflow into swap when using enable_model_cpu_offload(), even though it should be more than enough to hold all the components.

RTx 3050 laptop does only have 4GB VRAM, but it seems that OP is on desktop though, since the title clearly stated that OP wants suggestions on models that fits within 6GB of VRAM.
As of the text encoder, I am actually using the fp8 version since gguf nodes wasn't ready at the point I was testing it. I do get OOM when loading it from swap, but rerunning the same workflow again solves the problem. (Fun fact, even with the Q4 quants of the text encoder I still get the same OOM issue when using Qwen image.) BTW I am on comfy, and the memory management system might be quite different from the diffusers' implementation.

mingyi456

2 days ago

RTx 3050 laptop does only have 4GB VRAM, but it seems that OP is on desktop though, since the title clearly stated that OP wants suggestions on models that fits within 6GB of VRAM.
As of the text encoder, I am actually using the fp8 version since gguf nodes wasn't ready at the point I was testing it. I do get OOM when loading it from swap, but rerunning the same workflow again solves the problem. (Fun fact, even with the Q4 quants of the text encoder I still get the same OOM issue when using Qwen image.) BTW I am on comfy, and the memory management system might be quite different from the diffusers' implementation.

Oops, I somehow thought the 5700 was a mobile CPU, which was why I assumed OP was on laptop. And I just remembered there exists a 6gb version of the 3050 desktop exists, which has a cut down bus width and even fewer cuda cores. Ouch.

But actually, OP's main complaint was more about taking 5-6 minutes for a single image edit with Flux Kontext, and he just mentioned he has 6GB of VRAM as context. So in his case the raw compute power of his GPU (or rather lack thereof) also matters. Even if he manages to squeeze everything into VRAM, the 3050 simply has too little power to finish in a reasonable time. My 4090 running the DF11 version of Flux Kontext (which is about 10% slower than BF16) already takes about a minute to run, and it is much more than 5 times as powerful than the 3050. Meanwhile the Q6_K version of Qwen Image takes like 3-5 minutes or so to run. So I do not see much room for OP to shave his time down.

I guess the best OP can do is try running nunchaku for Kontext, enable sage attention and torch compilation, load a speed lora (not sure which speed loras exist for Kontext specifically, speed loras for Dev might not work well with Kontext), and try to quantise the text encoder as much as possible to reduce swapping from his SSD.

But I am certain Qwen Image can never run at any reasonable speed on a 3050. Being 20B in size means each iteration takes almost twice as long as Flux, even if all the weights are loaded into VRAM.

@Able2 I recommend trying a smaller GGUF quant of the text encoder on your system, if ComfyUI manages swaps by writing to your SSD, because that reduces the lifespan of the SSD. If it only performs reads then everything is fine except for the lack of speed. Since the text encoder is Qwen2.5 VL 7B, there already exists many excellent GGUF quants made for llama.cpp. I personally recommend IQ4_XS with imatrix, from either Unsloth or Bartowski.

YarvixPA changed discussion title from Model suggestion for 6GB VRAM to Suggestion for 6GB VRAM about 23 hours ago

YarvixPA changed discussion status to closed about 15 hours ago