Update README.md

0ef4d1f verified 3 months ago

6.11 kB

	---
	license: mit
	datasets:
	- uwunish/ghibli-dataset
	language:
	- en
	base_model:
	- stabilityai/stable-diffusion-2-1-base
	pipeline_tag: text-to-image
	library_name: diffusers
	tags:
	- ghibli
	- text2image
	- finetune
	- sd-2.1
	---
	<div align="center">
	<h1>
	Ghibli Fine-Tuned Stable Diffusion 2.1
	</h1>
	</div>

	## Dataset

	Avalible at: https://huggingface.co/datasets/uwunish/ghibli-dataset.

	## Hyperparameters
	The fine-tuning process was optimized with the following hyperparameters:

	\| Hyperparameter \| Value \|
	\| --- \| --- \|
	\| `learning_rate` \| 1e-05 \|
	\| `num_train_epochs` \| 40 \|
	\| `train_batch_size` \| 2 \|
	\| `gradient_accumulation_steps` \| 2 \|
	\| `mixed_precision` \| "fp16" \|
	\| `resolution` \| 512 \|
	\| `max_grad_norm` \| 1 \|
	\| `lr_scheduler` \| "constant" \|
	\| `lr_warmup_steps` \| 0 \|
	\| `checkpoints_total_limit` \| 1 \|
	\| `use_ema` \| True \|
	\| `use_8bit_adam` \| True \|
	\| `center_crop` \| True \|
	\| `random_flip` \| True \|
	\| `gradient_checkpointing` \| True \|

	These parameters were carefully selected to balance training efficiency and model performance, leveraging techniques like mixed precision and gradient checkpointing.

	## Metrics

	The fine-tuning process achieved a final loss of 0.0345, indicating excellent convergence and high fidelity to the Ghibli art style.

	## Usage

	### Step 1: Import Required Libraries

	Begin by importing the necessary libraries to power the image generation pipeline.

	```python
	import torch
	from PIL import Image
	import numpy as np
	from transformers import CLIPTextModel, CLIPTokenizer
	from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
	from tqdm import tqdm
	```

	### Step 2: Configure the Model

	Set up the device, data type, and load the pre-trained Ghibli-fine-tuned Stable Diffusion model.

	```python
	# Configure device and data type
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	dtype = torch.float16 if torch.cuda.is_available() else torch.float32

	# Model path
	model_name = "danhtran2mind/ghibli-fine-tuned-sd-2.1"

	# Load model components
	vae = AutoencoderKL.from_pretrained(model_name, subfolder="vae", torch_dtype=dtype).to(device)
	tokenizer = CLIPTokenizer.from_pretrained(model_name, subfolder="tokenizer")
	text_encoder = CLIPTextModel.from_pretrained(model_name, subfolder="text_encoder", torch_dtype=dtype).to(device)
	unet = UNet2DConditionModel.from_pretrained(model_name, subfolder="unet", torch_dtype=dtype).to(device)
	scheduler = PNDMScheduler.from_pretrained(model_name, subfolder="scheduler")
	```

	### Step 3: Define the Image Generation Function

	Use the following function to generate Ghibli-style images based on your text prompts.

	```python
	def generate_image(prompt, height=512, width=512, num_inference_steps=50, guidance_scale=3.5, seed=42):
	"""Generate a Ghibli-style image from a text prompt."""
	# Set random seed for reproducibility
	generator = torch.Generator(device=device).manual_seed(int(seed))

	# Tokenize and encode the prompt
	text_input = tokenizer(
	[prompt], padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt"
	)
	with torch.no_grad():
	text_embeddings = text_encoder(text_input.input_ids.to(device))[0].to(dtype=dtype)

	# Encode an empty prompt for classifier-free guidance
	uncond_input = tokenizer(
	[""], padding="max_length", max_length=text_input.input_ids.shape[-1], return_tensors="pt"
	)
	with torch.no_grad():
	uncond_embeddings = text_encoder(uncond_input.input_ids.to(device))[0].to(dtype=dtype)

	text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

	# Initialize latent representations
	latents = torch.randn(
	(1, unet.config.in_channels, height // 8, width // 8),
	generator=generator,
	dtype=dtype,
	device=device
	)

	# Configure scheduler timesteps
	scheduler.set_timesteps(num_inference_steps)
	latents = latents * scheduler.init_noise_sigma

	# Denoising loop
	for t in tqdm(scheduler.timesteps, desc="Generating image"):
	latent_model_input = torch.cat([latents] * 2)
	latent_model_input = scheduler.scale_model_input(latent_model_input, t)

	with torch.no_grad():
	if device.type == "cuda":
	with torch.autocast(device_type="cuda", dtype=torch.float16):
	noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
	else:
	noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

	# Apply classifier-free guidance
	noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
	noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
	latents = scheduler.step(noise_pred, t, latents).prev_sample

	# Decode latents to image
	with torch.no_grad():
	latents = latents / vae.config.scaling_factor
	image = vae.decode(latents).sample

	# Convert to PIL Image
	image = (image / 2 + 0.5).clamp(0, 1)
	image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
	image = (image * 255).round().astype("uint8")
	return Image.fromarray(image[0])
	```

	### Step 4: Generate Your Image

	Craft a vivid prompt and generate your Ghibli-style masterpiece.

	```python
	# Example prompt
	prompt = "a serene landscape in Ghibli style"

	# Generate the image
	image = generate_image(
	prompt=prompt,
	height=512,
	width=512,
	num_inference_steps=50,
	guidance_scale=3.5,
	seed=42
	)

	# Display or save the image
	image.show() # Or image.save("ghibli_landscape.png")
	```
	## Environment

	The project was developed and tested in the following environment:

	- Python Version: 3.11.11
	- Dependencies:

	\| Library \| Version \|
	\| --- \| --- \|
	\| huggingface-hub \| 0.30.2 \|
	\| accelerate \| 1.3.0 \|
	\| bitsandbytes \| 0.45.5 \|
	\| torch \| 2.5.1 \|
	\| Pillow \| 11.1.0 \|
	\| numpy \| 1.26.4 \|
	\| transformers \| 4.51.1 \|
	\| torchvision \| 0.20.1 \|
	\| diffusers \| 0.33.1 \|
	\| gradio \| Latest \|

	Ensure your environment matches these specifications to avoid compatibility issues.