Microsoft Phi-Ground-4B-7C

πŸ€– HomePage | πŸ“„ Paper | πŸ“„ Arxiv | 😊 Model | 😊 Eval data

overview

Phi-Ground-4B-7C is one of the Phi-Ground model family, finetuned from microsoft/Phi-3.5-vision-instruct with fixed input resolution 1008x672. The Phi-Ground model family achieves state-of-the-art performance across all five grounding benchmarks for models under 10B parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of 43.2 on ScreenSpot-pro and 27.2 on UI-Vision. We believe that the various details discussed in the tech report, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks.

Main results

overview

Usage

The current transformers version can be verified with: pip list | grep transformers.

Examples of required packages:

flash_attn==2.5.8
numpy==1.24.4
Pillow==10.3.0
Requests==2.31.0
torch==2.3.0
torchvision==0.18.0
transformers==4.43.0
accelerate==0.30.0

Input Formats

The model require strict input format including fixed image resolution, instruction-first order and system prompt.

Input preprocessing

from PIL import Image
def process_image(img):

    target_width, target_height = 336 * 3, 336 *2
 
    img_ratio = img.width / img.height  
    target_ratio = target_width / target_height
   
    if img_ratio > target_ratio:  
        new_width = target_width  
        new_height = int(new_width / img_ratio)
    else:  
        new_height = target_height
        new_width = int(new_height * img_ratio)  
    reshape_ratio = new_width / img.width

    img = img.resize((new_width, new_height), Image.LANCZOS)  
    new_img = Image.new("RGB", (target_width, target_height), (255, 255, 255))  
    paste_position = (0, 0)  
    new_img.paste(img, paste_position)
    return new_img

instruction = "<your instruction>"
prompt = """<|user|>
The description of the element: 
{RE}

Locate the above described element in the image. The output should be bounding box using relative coordinates multiplying 1000.
<|image_1|>
<|end|>
<|assistant|>""".format(RE=instriuction)

image_path = "<your image path>"
image = process_image(Image.open(image_path))

Then you can use huggingface model or vllm to inference. We also provide End-to-end examples and benchmark results reproduction.

Downloads last month
215
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for microsoft/Phi-Ground

Finetuned
(20)
this model