pdufour's picture
Update README.md
be9b30c verified
metadata
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
library_name: peft
datasets:
  - HuggingFaceM4/WebSight

Model Card for Llama-3.2-11B-Vision-WebSight

LLama 3.2 Vision Instruct trained on 10k samples from https://huggingface.co/datasets/HuggingFaceM4/WebSight.

Model Details

Model Description

  • Developed by: pdufour
  • Model type: Vision Language Model
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: meta-llama/Llama-3.2-11B-Vision-Instruct

How to Get Started with the Model

from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoProcessor
from peft import PeftModel
from PIL import Image
import torch

model = PeftModel.from_pretrained(
    AutoModelForVision2Seq.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct", device_map="auto", load_in_4bit=True),
    "pdufour/Llama-3.2-11B-Vision-WebSight"
)
tokenizer = AutoTokenizer.from_pretrained("pdufour/Llama-3.2-11B-Vision-WebSight")
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")

inputs = processor(text="Generate code for a web page that looks exactly like this. <|image|>", images=Image.open("fashion.jpg"), return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(input_ids=inputs['input_ids'], max_new_tokens=4096, do_sample=True, temperature=0.7, top_p=0.9)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training Details

Training Data

Vision-language dataset used for instruction tuning.

Training Procedure

Training Hyperparameters

  • Training regime: Fine-tuning with LoRA
  • Learning rate: 0.0002
  • Batch size: 10
  • Gradient accumulation steps: 1
  • Number of epochs: 3.0
  • Optimizer: adamw_torch_fused
  • LR scheduler type: constant
  • Weight decay: 0.0
  • FP16 Training: False

Speeds, Sizes, Times

  • Training Duration: Unknown hours
  • Number of Parameters: Unknown trainable parameters
  • Model Size: 0.08 GB

Evaluation

Metrics

Results

  • epoch: 0.9000
  • grad_norm: 0.2568
  • learning_rate: 0.0002
  • loss: 0.0791
  • step: 900.0000

Technical Specifications

Model Architecture and Objective

LoRA-tuned Vision-Language Model based on Llama architecture.

Compute Infrastructure

  • Hardware Type: GPU
  • Number of GPUs: 1

Software

  • Framework versions:
    • PEFT 0.13.2
    • PyTorch 2.5.0+cu121

Model Card Contact

For questions about this model, please file an issue on the GitHub repository.