|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- uwunish/ghibli-dataset |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- stabilityai/stable-diffusion-2-1-base |
|
|
pipeline_tag: text-to-image |
|
|
library_name: diffusers |
|
|
tags: |
|
|
- ghibli |
|
|
- text2image |
|
|
- finetune |
|
|
- sd-2.1 |
|
|
--- |
|
|
<div align="center"> |
|
|
<h1> |
|
|
Ghibli Fine-Tuned Stable Diffusion 2.1 |
|
|
</h1> |
|
|
</div> |
|
|
|
|
|
## Dataset |
|
|
|
|
|
Avalible at: https://huggingface.co/datasets/uwunish/ghibli-dataset. |
|
|
|
|
|
## Hyperparameters |
|
|
The fine-tuning process was optimized with the following hyperparameters: |
|
|
|
|
|
| Hyperparameter | Value | |
|
|
| --- | --- | |
|
|
| `learning_rate` | 1e-05 | |
|
|
| `num_train_epochs` | 40 | |
|
|
| `train_batch_size` | 2 | |
|
|
| `gradient_accumulation_steps` | 2 | |
|
|
| `mixed_precision` | "fp16" | |
|
|
| `resolution` | 512 | |
|
|
| `max_grad_norm` | 1 | |
|
|
| `lr_scheduler` | "constant" | |
|
|
| `lr_warmup_steps` | 0 | |
|
|
| `checkpoints_total_limit` | 1 | |
|
|
| `use_ema` | True | |
|
|
| `use_8bit_adam` | True | |
|
|
| `center_crop` | True | |
|
|
| `random_flip` | True | |
|
|
| `gradient_checkpointing` | True | |
|
|
|
|
|
These parameters were carefully selected to balance training efficiency and model performance, leveraging techniques like mixed precision and gradient checkpointing. |
|
|
|
|
|
## Metrics |
|
|
|
|
|
The fine-tuning process achieved a final loss of **0.0345**, indicating excellent convergence and high fidelity to the Ghibli art style. |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Step 1: Import Required Libraries |
|
|
|
|
|
Begin by importing the necessary libraries to power the image generation pipeline. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from PIL import Image |
|
|
import numpy as np |
|
|
from transformers import CLIPTextModel, CLIPTokenizer |
|
|
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler |
|
|
from tqdm import tqdm |
|
|
``` |
|
|
|
|
|
### Step 2: Configure the Model |
|
|
|
|
|
Set up the device, data type, and load the pre-trained Ghibli-fine-tuned Stable Diffusion model. |
|
|
|
|
|
```python |
|
|
# Configure device and data type |
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
|
|
|
|
|
# Model path |
|
|
model_name = "danhtran2mind/ghibli-fine-tuned-sd-2.1" |
|
|
|
|
|
# Load model components |
|
|
vae = AutoencoderKL.from_pretrained(model_name, subfolder="vae", torch_dtype=dtype).to(device) |
|
|
tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer") |
|
|
text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder", torch_dtype=dtype).to(device) |
|
|
unet = UNet2DConditionModel.from_pretrained(model_name, subfolder="unet", torch_dtype=dtype).to(device) |
|
|
scheduler = PNDMScheduler.from_pretrained(model_name, subfolder="scheduler") |
|
|
``` |
|
|
|
|
|
### Step 3: Define the Image Generation Function |
|
|
|
|
|
Use the following function to generate Ghibli-style images based on your text prompts. |
|
|
|
|
|
```python |
|
|
def generate_image(prompt, height=512, width=512, num_inference_steps=50, guidance_scale=3.5, seed=42): |
|
|
"""Generate a Ghibli-style image from a text prompt.""" |
|
|
# Set random seed for reproducibility |
|
|
generator = torch.Generator(device=device).manual_seed(int(seed)) |
|
|
|
|
|
# Tokenize and encode the prompt |
|
|
text_input = tokenizer( |
|
|
[prompt], padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt" |
|
|
) |
|
|
with torch.no_grad(): |
|
|
text_embeddings = text_encoder(text_input.input_ids.to(device))[0].to(dtype=dtype) |
|
|
|
|
|
# Encode an empty prompt for classifier-free guidance |
|
|
uncond_input = tokenizer( |
|
|
[""], padding="max_length", max_length=text_input.input_ids.shape[-1], return_tensors="pt" |
|
|
) |
|
|
with torch.no_grad(): |
|
|
uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0].to(dtype=dtype) |
|
|
|
|
|
text_embeddings = torch.cat([uncond_embeddings, text_embeddings]) |
|
|
|
|
|
# Initialize latent representations |
|
|
latents = torch.randn( |
|
|
(1, unet.config.in_channels, height // 8, width // 8), |
|
|
generator=generator, |
|
|
dtype=dtype, |
|
|
device=device |
|
|
) |
|
|
|
|
|
# Configure scheduler timesteps |
|
|
scheduler.set_timesteps(num_inference_steps) |
|
|
latents = latents * scheduler.init_noise_sigma |
|
|
|
|
|
# Denoising loop |
|
|
for t in tqdm(scheduler.timesteps, desc="Generating image"): |
|
|
latent_model_input = torch.cat([latents] * 2) |
|
|
latent_model_input = scheduler.scale_model_input(latent_model_input, t) |
|
|
|
|
|
with torch.no_grad(): |
|
|
if device.type == "cuda": |
|
|
with torch.autocast(device_type="cuda", dtype=torch.float16): |
|
|
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample |
|
|
else: |
|
|
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample |
|
|
|
|
|
# Apply classifier-free guidance |
|
|
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) |
|
|
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) |
|
|
latents = scheduler.step(noise_pred, t, latents).prev_sample |
|
|
|
|
|
# Decode latents to image |
|
|
with torch.no_grad(): |
|
|
latents = latents / vae.config.scaling_factor |
|
|
image = vae.decode(latents).sample |
|
|
|
|
|
# Convert to PIL Image |
|
|
image = (image / 2 + 0.5).clamp(0, 1) |
|
|
image = image.detach().cpu().permute(0, 2, 3, 1).numpy() |
|
|
image = (image * 255).round().astype("uint8") |
|
|
return Image.fromarray(image[0]) |
|
|
``` |
|
|
|
|
|
### Step 4: Generate Your Image |
|
|
|
|
|
Craft a vivid prompt and generate your Ghibli-style masterpiece. |
|
|
|
|
|
```python |
|
|
# Example prompt |
|
|
prompt = "a serene landscape in Ghibli style" |
|
|
|
|
|
# Generate the image |
|
|
image = generate_image( |
|
|
prompt=prompt, |
|
|
height=512, |
|
|
width=512, |
|
|
num_inference_steps=50, |
|
|
guidance_scale=3.5, |
|
|
seed=42 |
|
|
) |
|
|
|
|
|
# Display or save the image |
|
|
image.show() # Or image.save("ghibli_landscape.png") |
|
|
``` |
|
|
## Environment |
|
|
|
|
|
The project was developed and tested in the following environment: |
|
|
|
|
|
- **Python Version**: 3.11.11 |
|
|
- **Dependencies**: |
|
|
|
|
|
| Library | Version | |
|
|
| --- | --- | |
|
|
| huggingface-hub | 0.30.2 | |
|
|
| accelerate | 1.3.0 | |
|
|
| bitsandbytes | 0.45.5 | |
|
|
| torch | 2.5.1 | |
|
|
| Pillow | 11.1.0 | |
|
|
| numpy | 1.26.4 | |
|
|
| transformers | 4.51.1 | |
|
|
| torchvision | 0.20.1 | |
|
|
| diffusers | 0.33.1 | |
|
|
| gradio | Latest | |
|
|
|
|
|
Ensure your environment matches these specifications to avoid compatibility issues. |