Wrong co ordinates return

#3
by rdhoundiyal - opened

Hi I am using this model, its great, but I cant understand :: in your example you hav ementioned 14-July co-ordinates as (352, 348), but if i manually check the 14-July co-ordinates, those are ~(418,421). I have marked (352, 348) as black mark. Please check.
holo-cord.jpg

H company org
β€’
edited 3 days ago

Hi @rdhoundiyal , glad you're experimenting with Holo1!

Re coordinate mismatch, HuggingFace multimodal processor is performing a resize of the image under the hood. In order to have matching coordinate, you need to also resize the original image.

README has sample code to do so:

Let me know if it works :)

from PIL import Image
from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize

# Prepare image and instruction
image_url = "https://huggingface.co/Hcompany/Holo1-3B/resolve/main/calendar_example.jpg" 
image = Image.open(requests.get(image_url, stream=True).raw)

# Resize the image so that predicted absolute coordinates match the size of the image.
image_processor = processor.image_processor
resized_height, resized_width = smart_resize(
    image.height,
    image.width,
    factor=image_processor.patch_size * image_processor.merge_size,
    min_pixels=image_processor.min_pixels,
    max_pixels=image_processor.max_pixels,
)
image = image.resize(size=(resized_width, resized_height), resample=None)  # type: ignore

instruction = "Select July 14th as the check-out date"
H company org

Interesting! Perhaps we could leverage the post-processing functions to translate back to the original coordinate system, would that entail the creation of a custom processor class @RaushanTurganbay @yonigozlan ? cc @merve

hi i am adding my app.py, getting wrong co-ordinates :: Click(350, 352). what can i do to correct it.
Thanks

import json
import os
from typing import Any, Literal
import requests
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize

default: Load the model on the available device(s)

We recommend enabling flash_attention_2 for better acceleration and memory saving.

model = AutoModelForImageTextToText.from_pretrained(
"Hcompany/Holo1-3B",
torch_dtype="auto",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
device_map="auto",
)

default processor

processor = AutoProcessor.from_pretrained("Hcompany/Holo1-3B")

The default range for the number of visual tokens per image in the model is 4-1280.

You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.

processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)

Helper function to run inference

def run_inference(messages: list[dict[str, Any]]) -> str:
# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
text=[text],
images=image,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
return processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)

Prepare image and instruction

image_url = "https://huggingface.co/Hcompany/Holo1-3B/resolve/main/calendar_example.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

Resize the image so that predicted absolute coordinates match the size of the image.

image_processor = processor.image_processor
resized_height, resized_width = smart_resize(
image.height,
image.width,
factor=image_processor.patch_size * image_processor.merge_size,
min_pixels=image_processor.min_pixels,
max_pixels=image_processor.max_pixels,
)
image = image.resize(size=(resized_width, resized_height), resample=None) # type: ignore

instruction = "Select July 14th as the check-out date"

def get_localization_prompt(image, instruction: str) -> list[dict[str, Any]]:
guidelines: str = "Localize an element on the GUI image according to my instructions and output a click position as Click(x, y) with x num pixels from the left edge and y num pixels from the top edge."

return [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image,
            },
            {"type": "text", "text": f"{guidelines}\n{instruction}"},
        ],
    }
]

messages = get_localization_prompt(image, instruction)
coordinates_str = run_inference(messages)[0]
print(coordinates_str)

Expected Click(352, 348)

@pcuenq Yes, we could do something like processor.post_process which would take the point corrdinates and scale them to the correct size. For example in OwlVIT we have a similar helper ()https://github.com/huggingface/transformers/blob/ff3fad61e32af207cf83b687e6a038e4dd331234/src/transformers/models/owlvit/processing_owlvit.py#L228-L237

H company org

@RaushanTurganbay Yes! :) My question was more about how to do it, given that this model is following Qwen2_5_VLForConditionalGeneration. I see we have some "models" with just a processor change (example), so I guess that'd be the way to go here as well.

Sign up or log in to comment