reach-vb HF Staff commited on
Commit
0ee50e5
·
verified ·
1 Parent(s): 0adf4cb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -1
README.md CHANGED
@@ -3,6 +3,8 @@ license: apple-amlr
3
  license_name: apple-ascl
4
  license_link: https://github.com/apple/ml-fastvlm/blob/main/LICENSE_MODEL
5
  library_name: ml-fastvlm
 
 
6
  ---
7
  # FastVLM: Efficient Vision Encoding for Vision Language Models
8
 
@@ -50,6 +52,58 @@ python predict.py --model-path /path/to/checkpoint-dir \
50
  --image-file /path/to/image.png \
51
  --prompt "Describe the image."
52
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
 
55
  ## Citation
@@ -62,4 +116,4 @@ If you found this model useful, please cite the following paper:
62
  month = {June},
63
  year = {2025},
64
  }
65
- ```
 
3
  license_name: apple-ascl
4
  license_link: https://github.com/apple/ml-fastvlm/blob/main/LICENSE_MODEL
5
  library_name: ml-fastvlm
6
+ tags:
7
+ - transformers
8
  ---
9
  # FastVLM: Efficient Vision Encoding for Vision Language Models
10
 
 
52
  --image-file /path/to/image.png \
53
  --prompt "Describe the image."
54
  ```
55
+ ### Run inference with Transformers (Remote Code)
56
+ To run inference with transformers we can leverage `trust_remote_code` along with the following snippet:
57
+
58
+ ```python
59
+ import torch
60
+ from PIL import Image
61
+ from transformers import AutoTokenizer, AutoModelForCausalLM
62
+ MID = "apple/FastVLM-7B"
63
+ IMAGE_TOKEN_INDEX = -200 # what the model code looks for
64
+
65
+ # Load
66
+ tok = AutoTokenizer.from_pretrained(MID, trust_remote_code=True)
67
+ model = AutoModelForCausalLM.from_pretrained(
68
+ MID,
69
+ torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
70
+ device_map="auto",
71
+ trust_remote_code=True,
72
+ )
73
+
74
+ # Build chat -> render to string (not tokens) so we can place <image> exactly
75
+ messages = [
76
+ {"role": "user", "content": "<image>\nDescribe this image in detail."}
77
+ ]
78
+ rendered = tok.apply_chat_template(
79
+ messages, add_generation_prompt=True, tokenize=False
80
+ )
81
+ pre, post = rendered.split("<image>", 1)
82
+
83
+ # Tokenize the text *around* the image token (no extra specials!)
84
+ pre_ids = tok(pre, return_tensors="pt", add_special_tokens=False).input_ids
85
+ post_ids = tok(post, return_tensors="pt", add_special_tokens=False).input_ids
86
+
87
+ # Splice in the IMAGE token id (-200) at the placeholder position
88
+ img_tok = torch.tensor([[IMAGE_TOKEN_INDEX]], dtype=pre_ids.dtype)
89
+ input_ids = torch.cat([pre_ids, img_tok, post_ids], dim=1).to(model.device)
90
+ attention_mask = torch.ones_like(input_ids, device=model.device)
91
+
92
+ # Preprocess image via the model's own processor
93
+ img = Image.open("test-2.jpg").convert("RGB")
94
+ px = model.get_vision_tower().image_processor(images=img, return_tensors="pt")["pixel_values"]
95
+ px = px.to(model.device, dtype=model.dtype)
96
+
97
+ # Generate
98
+ with torch.no_grad():
99
+ out = model.generate(
100
+ inputs=input_ids,
101
+ attention_mask=attention_mask,
102
+ images=px,
103
+ max_new_tokens=128,
104
+ )
105
+ print(tok.decode(out[0], skip_special_tokens=True))
106
+ ```
107
 
108
 
109
  ## Citation
 
116
  month = {June},
117
  year = {2025},
118
  }
119
+ ```