paligemma-architecture

This model is a fine-tuned version of google/paligemma2-3b-pt-448 on a custom architecture dataset (700 image description pairs). This is my first model uploaded to HuggingFace.

Training procedure

Followed the notebook from smol-vision, adjusted dataset loading and some parameters.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 1
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 8
optimizer: Use OptimizerNames.ADAMW_HF with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 2
num_epochs: 4

Approx. 30GB of GPU RAM, trained on Google colab's A100

Training results

TrainOutput(global_step=352, training_loss=7.797419488430023, metrics={ 'train_runtime': 1653.6164, 'train_samples_per_second': 1.705, 'train_steps_per_second': 0.213, 'total_flos': 5.772661476596784e+16, 'train_loss': 7.797419488430023, 'epoch': 3.9645390070921986})

Usage

Using a CUDA supported GPU:

from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
import torch
from PIL import Image
import requests

# Model and device
model_id = "lmajnaric/paligemma448_arch_finetune"
device = "cuda"

# Load image using path or url
url = "https://cms.guggenheim-bilbao.eus/uploads/2019/05/el-edificio-guggenheim-bilbao-1.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# image = Image.open("building.jpg")


# Load model and processor with bfloat16 precision
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=dtype,
    device_map=device,
).eval()

processor = AutoProcessor.from_pretrained(model_id)


# Create prompt
prompt = (
        "Describe this building's architectural style in detail. What are its key features? "
        "What period and region is this style associated with? What materials are predominantly "
        "used in this building? Describe any notable decorative elements, patterns, or ornaments. "
        "Describe the overall structure, including the shape, height, and any distinctive "
        "architectural elements like towers, domes, or facades. If the building has a name, "
        "please state it in the beginning."
    )

# Process inputs
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

# Generate text
with torch.inference_mode():
    generation = model.generate(
        **model_inputs, 
        max_new_tokens=256,
        do_sample=True,      # Enable sampling for more diverse outputs
        temperature=0.7,     # Control randomness (lower = more deterministic)
        top_p=0.9,
    )
    
    # Only decode the new tokens (not the prompt)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    
    print(decoded)

or CPU:

from transformers import PaliGemmaProcessor, PaliGemmaForConditionalGeneration
import torch
from PIL import Image
import requests

# Model and device
model_id = "lmajnaric/paligemma448_arch_finetune"

# Load image using path or url
url = "https://cms.guggenheim-bilbao.eus/uploads/2019/05/el-edificio-guggenheim-bilbao-1.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# image = Image.open("building.jpg")


# Load model and processor with bfloat16 precision
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)


# Create prompt
prompt = (
        "Describe this building's architectural style in detail. What are its key features? "
        "What period and region is this style associated with? What materials are predominantly "
        "used in this building? Describe any notable decorative elements, patterns, or ornaments. "
        "Describe the overall structure, including the shape, height, and any distinctive "
        "architectural elements like towers, domes, or facades. If the building has a name, "
        "please state it in the beginning."
    )

# Process inputs
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

# Generate text
with torch.inference_mode():
    generation = model.generate(
        **model_inputs, 
        max_new_tokens=256,
        do_sample=True,      # Enable sampling for more diverse outputs
        temperature=0.7,     # Control randomness (lower = more deterministic)
        top_p=0.9,
    )
    
    # Only decode the new tokens (not the prompt)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    
    print(decoded)

Framework versions

Transformers 4.50.0.dev0
Pytorch 2.6.0+cu124
Datasets 3.4.0
Tokenizers 0.21.0

lmajnaric
/

paligemma448_arch_finetune