Microsoft Phi-Ground-4B-7C
π€ HomePage | π Paper | π Arxiv | π Model | π Eval data
Phi-Ground-4B-7C is one of the Phi-Ground model family, finetuned from microsoft/Phi-3.5-vision-instruct with fixed input resolution 1008x672. The Phi-Ground model family achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision. We believe that the various details discussed in the tech report, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks.
Main results
Usage
The current transformers
version can be verified with: pip list | grep transformers
.
Examples of required packages:
flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.43.0
accelerate==0.30.0
Input Formats
The model require strict input format including fixed image resolution, instruction-first order and system prompt.
Input preprocessing
from PIL import Image
def process_image(img):
target_width, target_height = 336 * 3, 336 *2
img_ratio = img.width / img.height
target_ratio = target_width / target_height
if img_ratio > target_ratio:
new_width = target_width
new_height = int(new_width / img_ratio)
else:
new_height = target_height
new_width = int(new_height * img_ratio)
reshape_ratio = new_width / img.width
img = img.resize((new_width, new_height), Image.LANCZOS)
new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))
paste_position = (0, 0)
new_img.paste(img, paste_position)
return new_img
instruction = "<your instruction>"
prompt = """<|user|>
The description of the element:
{RE}
Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000.
<|image_1|>
<|end|>
<|assistant|>""".format(RE=instriuction)
image_path = "<your image path>"
image = process_image(Image.open(image_path))
Then you can use huggingface model or vllm to inference. We also provide End-to-end examples and benchmark results reproduction.
- Downloads last month
- 215
Model tree for microsoft/Phi-Ground
Base model
microsoft/Phi-3.5-vision-instruct