danhtran2mind's picture
Update README.md
0ef4d1f verified
---
license: mit
datasets:
- uwunish/ghibli-dataset
language:
- en
base_model:
- stabilityai/stable-diffusion-2-1-base
pipeline_tag: text-to-image
library_name: diffusers
tags:
- ghibli
- text2image
- finetune
- sd-2.1
---
<div align="center">
<h1>
Ghibli Fine-Tuned Stable Diffusion 2.1
</h1>
</div>
## Dataset
Avalible at: https://huggingface.co/datasets/uwunish/ghibli-dataset.
## Hyperparameters
The fine-tuning process was optimized with the following hyperparameters:
| Hyperparameter | Value |
| --- | --- |
| `learning_rate` | 1e-05 |
| `num_train_epochs` | 40 |
| `train_batch_size` | 2 |
| `gradient_accumulation_steps` | 2 |
| `mixed_precision` | "fp16" |
| `resolution` | 512 |
| `max_grad_norm` | 1 |
| `lr_scheduler` | "constant" |
| `lr_warmup_steps` | 0 |
| `checkpoints_total_limit` | 1 |
| `use_ema` | True |
| `use_8bit_adam` | True |
| `center_crop` | True |
| `random_flip` | True |
| `gradient_checkpointing` | True |
These parameters were carefully selected to balance training efficiency and model performance, leveraging techniques like mixed precision and gradient checkpointing.
## Metrics
The fine-tuning process achieved a final loss of **0.0345**, indicating excellent convergence and high fidelity to the Ghibli art style.
## Usage
### Step 1: Import Required Libraries
Begin by importing the necessary libraries to power the image generation pipeline.
```python
import torch
from PIL import Image
import numpy as np
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
from tqdm import tqdm
```
### Step 2: Configure the Model
Set up the device, data type, and load the pre-trained Ghibli-fine-tuned Stable Diffusion model.
```python
# Configure device and data type
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Model path
model_name = "danhtran2mind/ghibli-fine-tuned-sd-2.1"
# Load model components
vae = AutoencoderKL.from_pretrained(model_name, subfolder="vae", torch_dtype=dtype).to(device)
tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder", torch_dtype=dtype).to(device)
unet = UNet2DConditionModel.from_pretrained(model_name, subfolder="unet", torch_dtype=dtype).to(device)
scheduler = PNDMScheduler.from_pretrained(model_name, subfolder="scheduler")
```
### Step 3: Define the Image Generation Function
Use the following function to generate Ghibli-style images based on your text prompts.
```python
def generate_image(prompt, height=512, width=512, num_inference_steps=50, guidance_scale=3.5, seed=42):
"""Generate a Ghibli-style image from a text prompt."""
# Set random seed for reproducibility
generator = torch.Generator(device=device).manual_seed(int(seed))
# Tokenize and encode the prompt
text_input = tokenizer(
[prompt], padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
)
with torch.no_grad():
text_embeddings = text_encoder(text_input.input_ids.to(device))[0].to(dtype=dtype)
# Encode an empty prompt for classifier-free guidance
uncond_input = tokenizer(
[""], padding="max_length", max_length=text_input.input_ids.shape[-1], return_tensors="pt"
)
with torch.no_grad():
uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0].to(dtype=dtype)
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
# Initialize latent representations
latents = torch.randn(
(1, unet.config.in_channels, height // 8, width // 8),
generator=generator,
dtype=dtype,
device=device
)
# Configure scheduler timesteps
scheduler.set_timesteps(num_inference_steps)
latents = latents * scheduler.init_noise_sigma
# Denoising loop
for t in tqdm(scheduler.timesteps, desc="Generating image"):
latent_model_input = torch.cat([latents] * 2)
latent_model_input = scheduler.scale_model_input(latent_model_input, t)
with torch.no_grad():
if device.type == "cuda":
with torch.autocast(device_type="cuda", dtype=torch.float16):
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
else:
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
# Apply classifier-free guidance
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
latents = scheduler.step(noise_pred, t, latents).prev_sample
# Decode latents to image
with torch.no_grad():
latents = latents / vae.config.scaling_factor
image = vae.decode(latents).sample
# Convert to PIL Image
image = (image / 2 + 0.5).clamp(0, 1)
image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
image = (image * 255).round().astype("uint8")
return Image.fromarray(image[0])
```
### Step 4: Generate Your Image
Craft a vivid prompt and generate your Ghibli-style masterpiece.
```python
# Example prompt
prompt = "a serene landscape in Ghibli style"
# Generate the image
image = generate_image(
prompt=prompt,
height=512,
width=512,
num_inference_steps=50,
guidance_scale=3.5,
seed=42
)
# Display or save the image
image.show() # Or image.save("ghibli_landscape.png")
```
## Environment
The project was developed and tested in the following environment:
- **Python Version**: 3.11.11
- **Dependencies**:
| Library | Version |
| --- | --- |
| huggingface-hub | 0.30.2 |
| accelerate | 1.3.0 |
| bitsandbytes | 0.45.5 |
| torch | 2.5.1 |
| Pillow | 11.1.0 |
| numpy | 1.26.4 |
| transformers | 4.51.1 |
| torchvision | 0.20.1 |
| diffusers | 0.33.1 |
| gradio | Latest |
Ensure your environment matches these specifications to avoid compatibility issues.