pdufour's picture
Update README.md
be9b30c verified
---
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
library_name: peft
datasets:
- HuggingFaceM4/WebSight
---
# Model Card for Llama-3.2-11B-Vision-WebSight
LLama 3.2 Vision Instruct trained on 10k samples from https://huggingface.co/datasets/HuggingFaceM4/WebSight.
## Model Details
### Model Description
* **Developed by:** pdufour
* **Model type:** Vision Language Model
* **Language(s) (NLP):** English
* **License:** MIT
* **Finetuned from model:** meta-llama/Llama-3.2-11B-Vision-Instruct
## How to Get Started with the Model
```python
from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoProcessor
from peft import PeftModel
from PIL import Image
import torch
model = PeftModel.from_pretrained(
AutoModelForVision2Seq.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct", device_map="auto", load_in_4bit=True),
"pdufour/Llama-3.2-11B-Vision-WebSight"
)
tokenizer = AutoTokenizer.from_pretrained("pdufour/Llama-3.2-11B-Vision-WebSight")
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct")
inputs = processor(text="Generate code for a web page that looks exactly like this. <|image|>", images=Image.open("fashion.jpg"), return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(input_ids=inputs['input_ids'], max_new_tokens=4096, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## Training Details
### Training Data
Vision-language dataset used for instruction tuning.
### Training Procedure
#### Training Hyperparameters
* **Training regime:** Fine-tuning with LoRA
* **Learning rate:** 0.0002
* **Batch size:** 10
* **Gradient accumulation steps:** 1
* **Number of epochs:** 3.0
* **Optimizer:** adamw_torch_fused
* **LR scheduler type:** constant
* **Weight decay:** 0.0
* **FP16 Training:** False
### Speeds, Sizes, Times
* **Training Duration:** Unknown hours
* **Number of Parameters:** Unknown trainable parameters
* **Model Size:** 0.08 GB
## Evaluation
### Metrics
#### Results
* **epoch:** 0.9000
* **grad_norm:** 0.2568
* **learning_rate:** 0.0002
* **loss:** 0.0791
* **step:** 900.0000
## Technical Specifications
### Model Architecture and Objective
LoRA-tuned Vision-Language Model based on Llama architecture.
### Compute Infrastructure
* **Hardware Type:** GPU
* **Number of GPUs:** 1
### Software
* **Framework versions:**
* PEFT 0.13.2
* PyTorch 2.5.0+cu121
## Model Card Contact
For questions about this model, please file an issue on the GitHub repository.