File size: 6,112 Bytes

---
license: mit
datasets:
- uwunish/ghibli-dataset
language:
- en
base_model:
- stabilityai/stable-diffusion-2-1-base
pipeline_tag: text-to-image
library_name: diffusers
tags:
- ghibli
- text2image
- finetune
- sd-2.1
---
<div align="center">
  <h1>
    Ghibli Fine-Tuned Stable Diffusion 2.1
  </h1>
</div>

## Dataset

Avalible at: https://huggingface.co/datasets/uwunish/ghibli-dataset.

## Hyperparameters
The fine-tuning process was optimized with the following hyperparameters:

| Hyperparameter | Value |
| --- | --- |
| `learning_rate` | 1e-05 |
| `num_train_epochs` | 40 |
| `train_batch_size` | 2 |
| `gradient_accumulation_steps` | 2 |
| `mixed_precision` | "fp16" |
| `resolution` | 512 |
| `max_grad_norm` | 1 |
| `lr_scheduler` | "constant" |
| `lr_warmup_steps` | 0 |
| `checkpoints_total_limit` | 1 |
| `use_ema` | True |
| `use_8bit_adam` | True |
| `center_crop` | True |
| `random_flip` | True |
| `gradient_checkpointing` | True |

These parameters were carefully selected to balance training efficiency and model performance, leveraging techniques like mixed precision and gradient checkpointing.
  
## Metrics

The fine-tuning process achieved a final loss of **0.0345**, indicating excellent convergence and high fidelity to the Ghibli art style.

## Usage

### Step 1: Import Required Libraries

Begin by importing the necessary libraries to power the image generation pipeline.

```python
import torch
from PIL import Image
import numpy as np
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
from tqdm import tqdm
```

### Step 2: Configure the Model

Set up the device, data type, and load the pre-trained Ghibli-fine-tuned Stable Diffusion model.

```python
# Configure device and data type
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Model path
model_name = "danhtran2mind/ghibli-fine-tuned-sd-2.1"

# Load model components
vae = AutoencoderKL.from_pretrained(model_name, subfolder="vae", torch_dtype=dtype).to(device)
tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder", torch_dtype=dtype).to(device)
unet = UNet2DConditionModel.from_pretrained(model_name, subfolder="unet", torch_dtype=dtype).to(device)
scheduler = PNDMScheduler.from_pretrained(model_name, subfolder="scheduler")
```

### Step 3: Define the Image Generation Function

Use the following function to generate Ghibli-style images based on your text prompts.

```python
def generate_image(prompt, height=512, width=512, num_inference_steps=50, guidance_scale=3.5, seed=42):
    """Generate a Ghibli-style image from a text prompt."""
    # Set random seed for reproducibility
    generator = torch.Generator(device=device).manual_seed(int(seed))

    # Tokenize and encode the prompt
    text_input = tokenizer(
        [prompt], padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
    )
    with torch.no_grad():
        text_embeddings = text_encoder(text_input.input_ids.to(device))[0].to(dtype=dtype)

    # Encode an empty prompt for classifier-free guidance
    uncond_input = tokenizer(
        [""], padding="max_length", max_length=text_input.input_ids.shape[-1], return_tensors="pt"
    )
    with torch.no_grad():
        uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0].to(dtype=dtype)

    text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

    # Initialize latent representations
    latents = torch.randn(
        (1, unet.config.in_channels, height // 8, width // 8),
        generator=generator,
        dtype=dtype,
        device=device
    )

    # Configure scheduler timesteps
    scheduler.set_timesteps(num_inference_steps)
    latents = latents * scheduler.init_noise_sigma

    # Denoising loop
    for t in tqdm(scheduler.timesteps, desc="Generating image"):
        latent_model_input = torch.cat([latents] * 2)
        latent_model_input = scheduler.scale_model_input(latent_model_input, t)

        with torch.no_grad():
            if device.type == "cuda":
                with torch.autocast(device_type="cuda", dtype=torch.float16):
                    noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
            else:
                noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

        # Apply classifier-free guidance
        noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
        noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
        latents = scheduler.step(noise_pred, t, latents).prev_sample

    # Decode latents to image
    with torch.no_grad():
        latents = latents / vae.config.scaling_factor
        image = vae.decode(latents).sample

    # Convert to PIL Image
    image = (image / 2 + 0.5).clamp(0, 1)
    image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
    image = (image * 255).round().astype("uint8")
    return Image.fromarray(image[0])
```

### Step 4: Generate Your Image

Craft a vivid prompt and generate your Ghibli-style masterpiece.

```python
# Example prompt
prompt = "a serene landscape in Ghibli style"

# Generate the image
image = generate_image(
    prompt=prompt,
    height=512,
    width=512,
    num_inference_steps=50,
    guidance_scale=3.5,
    seed=42
)

# Display or save the image
image.show()  # Or image.save("ghibli_landscape.png")
```
## Environment

The project was developed and tested in the following environment:

- **Python Version**: 3.11.11
- **Dependencies**:

| Library | Version |
| --- | --- |
| huggingface-hub | 0.30.2 |
| accelerate | 1.3.0 |
| bitsandbytes | 0.45.5 |
| torch | 2.5.1 |
| Pillow | 11.1.0 |
| numpy | 1.26.4 |
| transformers | 4.51.1 |
| torchvision | 0.20.1 |
| diffusers | 0.33.1 |
| gradio | Latest |

Ensure your environment matches these specifications to avoid compatibility issues.