|
--- |
|
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct |
|
library_name: peft |
|
datasets: |
|
- HuggingFaceM4/WebSight |
|
--- |
|
|
|
# Model Card for Llama-3.2-11B-Vision-WebSight |
|
|
|
LLama 3.2 Vision Instruct trained on 10k samples from https://huggingface.co/datasets/HuggingFaceM4/WebSight. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
* **Developed by:** pdufour |
|
* **Model type:** Vision Language Model |
|
* **Language(s) (NLP):** English |
|
* **License:** MIT |
|
* **Finetuned from model:** meta-llama/Llama-3.2-11B-Vision-Instruct |
|
|
|
## How to Get Started with the Model |
|
```python |
|
from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoProcessor |
|
from peft import PeftModel |
|
from PIL import Image |
|
import torch |
|
|
|
model = PeftModel.from_pretrained( |
|
AutoModelForVision2Seq.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct", device_map="auto", load_in_4bit=True), |
|
"pdufour/Llama-3.2-11B-Vision-WebSight" |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained("pdufour/Llama-3.2-11B-Vision-WebSight") |
|
processor = AutoProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision-Instruct") |
|
|
|
inputs = processor(text="Generate code for a web page that looks exactly like this. <|image|>", images=Image.open("fashion.jpg"), return_tensors="pt").to(model.device) |
|
|
|
with torch.no_grad(): |
|
outputs = model.generate(input_ids=inputs['input_ids'], max_new_tokens=4096, do_sample=True, temperature=0.7, top_p=0.9) |
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
Vision-language dataset used for instruction tuning. |
|
|
|
### Training Procedure |
|
|
|
#### Training Hyperparameters |
|
* **Training regime:** Fine-tuning with LoRA |
|
* **Learning rate:** 0.0002 |
|
* **Batch size:** 10 |
|
* **Gradient accumulation steps:** 1 |
|
* **Number of epochs:** 3.0 |
|
* **Optimizer:** adamw_torch_fused |
|
* **LR scheduler type:** constant |
|
* **Weight decay:** 0.0 |
|
* **FP16 Training:** False |
|
|
|
### Speeds, Sizes, Times |
|
* **Training Duration:** Unknown hours |
|
* **Number of Parameters:** Unknown trainable parameters |
|
* **Model Size:** 0.08 GB |
|
|
|
## Evaluation |
|
|
|
### Metrics |
|
|
|
#### Results |
|
* **epoch:** 0.9000 |
|
* **grad_norm:** 0.2568 |
|
* **learning_rate:** 0.0002 |
|
* **loss:** 0.0791 |
|
* **step:** 900.0000 |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
LoRA-tuned Vision-Language Model based on Llama architecture. |
|
|
|
### Compute Infrastructure |
|
* **Hardware Type:** GPU |
|
* **Number of GPUs:** 1 |
|
|
|
### Software |
|
* **Framework versions:** |
|
* PEFT 0.13.2 |
|
* PyTorch 2.5.0+cu121 |
|
|
|
## Model Card Contact |
|
For questions about this model, please file an issue on the GitHub repository. |