metadata

base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
library_name: peft
datasets:
  - HuggingFaceM4/WebSight

Model Card for Llama-3.2-11B-Vision-WebSight

LLama 3.2 Vision Instruct trained on 10k samples from https://huggingface.co/datasets/HuggingFaceM4/WebSight.

Model Details

Model Description

Developed by: pdufour
Model type: Vision Language Model
Language(s) (NLP): English
License: MIT
Finetuned from model: meta-llama/Llama-3.2-11B-Vision-Instruct

How to Get Started with the Model

from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoProcessor
from peft import PeftModel
from PIL import Image
import torch

model = PeftModel.from_pretrained(
    AutoModelForVision2Seq.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct", device_map="auto", load_in_4bit=True),
    "pdufour/Llama-3.2-11B-Vision-WebSight"
)
tokenizer = AutoTokenizer.from_pretrained("pdufour/Llama-3.2-11B-Vision-WebSight")
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")

inputs = processor(text="Generate code for a web page that looks exactly like this. <|image|>", images=Image.open("fashion.jpg"), return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(input_ids=inputs['input_ids'], max_new_tokens=4096, do_sample=True, temperature=0.7, top_p=0.9)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

Vision-language dataset used for instruction tuning.

Training Procedure

Training Hyperparameters

Training regime: Fine-tuning with LoRA
Learning rate: 0.0002
Batch size: 10
Gradient accumulation steps: 1
Number of epochs: 3.0
Optimizer: adamw_torch_fused
LR scheduler type: constant
Weight decay: 0.0
FP16 Training: False

Speeds, Sizes, Times

Training Duration: Unknown hours
Number of Parameters: Unknown trainable parameters
Model Size: 0.08 GB

Evaluation

Metrics

Results

epoch: 0.9000
grad_norm: 0.2568
learning_rate: 0.0002
loss: 0.0791
step: 900.0000

Technical Specifications

Model Architecture and Objective

LoRA-tuned Vision-Language Model based on Llama architecture.

Compute Infrastructure

Hardware Type: GPU
Number of GPUs: 1

Software

Framework versions:
- PEFT 0.13.2
- PyTorch 2.5.0+cu121

Model Card Contact

For questions about this model, please file an issue on the GitHub repository.