Text-to-Image
Diffusers
majian0318 commited on
Commit
44b52a9
·
verified ·
1 Parent(s): 710e168

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +240 -3
README.md CHANGED
@@ -1,3 +1,240 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ - fr
7
+ - ko
8
+ - ja
9
+ - de
10
+ - it
11
+ - pt
12
+ base_model:
13
+ - black-forest-labs/FLUX.1-schnell
14
+ pipeline_tag: text-to-image
15
+ library_name: diffusers
16
+ ---
17
+
18
+ ![image](./chinese.png)
19
+
20
+
21
+ ![FLUX.1 [schnell] Grid](./PEA-Diffusion.png)
22
+
23
+ `MultilingualFLUX.1-adapter` is a multilingual adapter tailored for the Flux.1 series models, theoretically, it inherits ByT5 and can support over 100 languages, but with additional optimizations in Chinese. Originating from an ECCV 2024 paper titled [PEA-Diffusion](https://arxiv.org/abs/2311.17086). The open-source code is available at https://github.com/OPPO-Mente-Lab/PEA-Diffusion.
24
+
25
+
26
+
27
+ # Usage
28
+ We used the multilingual encoder [byt5-xxl](https://huggingface.co/google/byt5-xxl/tree/main), and the teacher model used in the adaptation process was FLUX.1-schnell. We implemented a reverse denoising process for distillation training. The adapter can be applied to any FLUX.1 series model in theory. Here we provide the following application examples.
29
+
30
+
31
+ ## `MultilingualFLUX.1`
32
+ The example below demonstrates a. The same applies to other FLUX.1 series models such as b, c; just remember to adjust the `num_inference_steps` and `guidance_scale` as needed.
33
+
34
+
35
+ ```python
36
+ from diffusers import FluxPipeline, AutoencoderKL
37
+ from diffusers.image_processor import VaeImageProcessor
38
+ from transformers import T5ForConditionalGeneration,AutoTokenizer
39
+ import torch
40
+ import torch.nn as nn
41
+
42
+
43
+ class MLP(nn.Module):
44
+ def __init__(self, in_dim=4096, out_dim=4096, hidden_dim=4096, out_dim1=768, use_residual=True):
45
+ super().__init__()
46
+ self.layernorm = nn.LayerNorm(in_dim)
47
+ self.projector = nn.Sequential(
48
+ nn.Linear(in_dim, hidden_dim, bias=False),
49
+ nn.GELU(),
50
+ nn.Linear(hidden_dim, hidden_dim, bias=False),
51
+ nn.GELU(),
52
+ nn.Linear(hidden_dim, out_dim, bias=False),
53
+ )
54
+ self.fc = nn.Linear(out_dim, out_dim1)
55
+ def forward(self, x):
56
+ x = self.layernorm(x)
57
+ x = self.projector(x)
58
+ x2 = nn.GELU()(x)
59
+ x1 = self.fc(x2)
60
+ x1 = torch.mean(x1,1)
61
+ return x1,x2
62
+
63
+
64
+ dtype = torch.bfloat16
65
+ device = "cuda"
66
+ ckpt_id = "black-forest-labs/FLUX.1-schnell"
67
+ text_encoder_ckpt_id = 'google/byt5-xxl'
68
+ proj_t5 = MLP(in_dim=4672, out_dim=4096, hidden_dim=4096, out_dim1=768).to(device=device,dtype=dtype)
69
+ text_encoder_t5 = T5ForConditionalGeneration.from_pretrained(text_encoder_ckpt_id).get_encoder().to(device=device,dtype=dtype)
70
+ tokenizer_t5 = AutoTokenizer.from_pretrained(text_encoder_ckpt_id)
71
+
72
+
73
+ proj_t5_save_path = f"diffusion_pytorch_model.bin"
74
+ state_dict = torch.load(proj_t5_save_path, map_location="cpu")
75
+ state_dict_new = {}
76
+ for k,v in state_dict.items():
77
+ k_new = k.replace("module.","")
78
+ state_dict_new[k_new] = v
79
+
80
+ proj_t5.load_state_dict(state_dict_new)
81
+
82
+ pipeline = FluxPipeline.from_pretrained(
83
+ ckpt_id, text_encoder=None, text_encoder_2=None,
84
+ tokenizer=None, tokenizer_2=None, vae=None,
85
+ torch_dtype=torch.bfloat16
86
+ ).to(device)
87
+
88
+ vae = AutoencoderKL.from_pretrained(
89
+ ckpt_id,
90
+ subfolder="vae",
91
+ torch_dtype=torch.bfloat16
92
+ ).to(device)
93
+ vae_scale_factor = 2 ** (len(vae.config.block_out_channels))
94
+ image_processor = VaeImageProcessor(vae_scale_factor=vae_scale_factor)
95
+
96
+ while True:
97
+ raw_text = input("\nPlease Input Query (stop to exit) >>> ")
98
+ if not raw_text:
99
+ print('Query should not be empty!')
100
+ continue
101
+ if raw_text == "stop":
102
+ break
103
+
104
+ with torch.no_grad():
105
+ text_inputs = tokenizer_t5(
106
+ raw_text,
107
+ padding="max_length",
108
+ max_length=256,
109
+ truncation=True,
110
+ add_special_tokens=True,
111
+ return_tensors="pt",
112
+ ).input_ids.to(device)
113
+ text_embeddings = text_encoder_t5(text_inputs)[0]
114
+ pooled_prompt_embeds,prompt_embeds = proj_t5(text_embeddings)
115
+ height, width = 1024, 1024
116
+ latents = pipeline(
117
+ prompt_embeds=prompt_embeds,
118
+ pooled_prompt_embeds=pooled_prompt_embeds,
119
+ num_inference_steps=4, guidance_scale=0,
120
+ height=height, width=width,
121
+ output_type="latent",
122
+ ).images
123
+
124
+ latents = FluxPipeline._unpack_latents(latents, height, width, vae_scale_factor)
125
+ latents = (latents / vae.config.scaling_factor) + vae.config.shift_factor
126
+ image = vae.decode(latents, return_dict=False)[0]
127
+ image = image_processor.postprocess(image, output_type="pil")
128
+ image[0].save("MultilingualFLUX.jpg")
129
+
130
+ ```
131
+
132
+
133
+ ## `MultilingualOpenFLUX.1`
134
+ [OpenFLUX.1] (https://huggingface.co/ostris/OpenFLUX.1) is a fine tune of the FLUX.1-schnell model that has had the distillation trained out of it.
135
+ Please be sure to update the path of fast-lora.safetensors you have downloaded in the following code.
136
+
137
+
138
+ ```python
139
+ from diffusers import FluxPipeline, AutoencoderKL
140
+ from diffusers.image_processor import VaeImageProcessor
141
+ from transformers import T5ForConditionalGeneration,AutoTokenizer
142
+ import torch
143
+ import torch.nn as nn
144
+
145
+
146
+ class MLP(nn.Module):
147
+ def __init__(self, in_dim=4096, out_dim=4096, hidden_dim=4096, out_dim1=768, use_residual=True):
148
+ super().__init__()
149
+ self.layernorm = nn.LayerNorm(in_dim)
150
+ self.projector = nn.Sequential(
151
+ nn.Linear(in_dim, hidden_dim, bias=False),
152
+ nn.GELU(),
153
+ nn.Linear(hidden_dim, hidden_dim, bias=False),
154
+ nn.GELU(),
155
+ nn.Linear(hidden_dim, out_dim, bias=False),
156
+ )
157
+ self.fc = nn.Linear(out_dim, out_dim1)
158
+ def forward(self, x):
159
+ x = self.layernorm(x)
160
+ x = self.projector(x)
161
+ x2 = nn.GELU()(x)
162
+ x1 = self.fc(x2)
163
+ x1 = torch.mean(x1,1)
164
+ return x1,x2
165
+
166
+
167
+ dtype = torch.bfloat16
168
+ device = "cuda"
169
+ ckpt_id = "ostris/OpenFLUX.1"
170
+ text_encoder_ckpt_id = 'google/byt5-xxl'
171
+ proj_t5 = MLP(in_dim=4672, out_dim=4096, hidden_dim=4096, out_dim1=768).to(device=device,dtype=dtype)
172
+ text_encoder_t5 = T5ForConditionalGeneration.from_pretrained(text_encoder_ckpt_id).get_encoder().to(device=device,dtype=dtype)
173
+ tokenizer_t5 = AutoTokenizer.from_pretrained(text_encoder_ckpt_id)
174
+
175
+
176
+ proj_t5_save_path = f"diffusion_pytorch_model.bin"
177
+ state_dict = torch.load(proj_t5_save_path, map_location="cpu")
178
+ state_dict_new = {}
179
+ for k,v in state_dict.items():
180
+ k_new = k.replace("module.","")
181
+ state_dict_new[k_new] = v
182
+
183
+ proj_t5.load_state_dict(state_dict_new)
184
+
185
+ pipeline = FluxPipeline.from_pretrained(
186
+ ckpt_id, text_encoder=None, text_encoder_2=None,
187
+ tokenizer=None, tokenizer_2=None, vae=None,
188
+ torch_dtype=torch.bfloat16
189
+ ).to(device)
190
+ pipeline.load_lora_weights("ostris/OpenFLUX.1/openflux1-v0.1.0-fast-lora.safetensors")
191
+
192
+ vae = AutoencoderKL.from_pretrained(
193
+ ckpt_id,
194
+ subfolder="vae",
195
+ torch_dtype=torch.bfloat16
196
+ ).to(device)
197
+ vae_scale_factor = 2 ** (len(vae.config.block_out_channels))
198
+ image_processor = VaeImageProcessor(vae_scale_factor=vae_scale_factor)
199
+
200
+ while True:
201
+ raw_text = input("\nPlease Input Query (stop to exit) >>> ")
202
+ if not raw_text:
203
+ print('Query should not be empty!')
204
+ continue
205
+ if raw_text == "stop":
206
+ break
207
+
208
+ with torch.no_grad():
209
+ text_inputs = tokenizer_t5(
210
+ raw_text,
211
+ padding="max_length",
212
+ max_length=256,
213
+ truncation=True,
214
+ add_special_tokens=True,
215
+ return_tensors="pt",
216
+ ).input_ids.to(device)
217
+ text_embeddings = text_encoder_t5(text_inputs)[0]
218
+ pooled_prompt_embeds,prompt_embeds = proj_t5(text_embeddings)
219
+ height, width = 1024, 1024
220
+ latents = pipeline(
221
+ prompt_embeds=prompt_embeds,
222
+ pooled_prompt_embeds=pooled_prompt_embeds,
223
+ num_inference_steps=4, guidance_scale=0,
224
+ height=height, width=width,
225
+ output_type="latent",
226
+ ).images
227
+
228
+ latents = FluxPipeline._unpack_latents(latents, height, width, vae_scale_factor)
229
+ latents = (latents / vae.config.scaling_factor) + vae.config.shift_factor
230
+ image = vae.decode(latents, return_dict=False)[0]
231
+ image = image_processor.postprocess(image, output_type="pil")
232
+ image[0].save("MultilingualOpenFLUX.jpg")
233
+
234
+ ```
235
+
236
+ To learn more check out the [diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/flux) documentation
237
+
238
+
239
+ # License
240
+ The adapter itself is Apache License 2.0, but it must follow the license of the main model, such as FLUX.1 [dev] Non Commercial License.