Wrong co ordinates return

by rdhoundiyal - opened 25 days ago

25 days ago

Hi I am using this model, its great, but I cant understand :: in your example you hav ementioned 14-July co-ordinates as (352, 348), but if i manually check the 14-July co-ordinates, those are ~(418,421). I have marked (352, 348) as black mark. Please check.

plcedoz38

H company org 25 days ago

•

edited 25 days ago

Hi @rdhoundiyal , glad you're experimenting with Holo1!

Re coordinate mismatch, HuggingFace multimodal processor is performing a resize of the image under the hood. In order to have matching coordinate, you need to also resize the original image.

README has sample code to do so:

Let me know if it works :)

from PIL import Image
from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize

# Prepare image and instruction
image_url = "https://huggingface.co/Hcompany/Holo1-3B/resolve/main/calendar_example.jpg" 
image = Image.open(requests.get(image_url, stream=True).raw)

# Resize the image so that predicted absolute coordinates match the size of the image.
image_processor = processor.image_processor
resized_height, resized_width = smart_resize(
    image.height,
    image.width,
    factor=image_processor.patch_size * image_processor.merge_size,
    min_pixels=image_processor.min_pixels,
    max_pixels=image_processor.max_pixels,
)
image = image.resize(size=(resized_width, resized_height), resample=None)  # type: ignore

instruction = "Select July 14th as the check-out date"

pcuenq

H company org 25 days ago

Interesting! Perhaps we could leverage the post-processing functions to translate back to the original coordinate system, would that entail the creation of a custom processor class @RaushanTurganbay @yonigozlan ? cc @merve

rdhoundiyal

25 days ago

hi i am adding my app.py, getting wrong co-ordinates :: Click(350, 352). what can i do to correct it.
Thanks

import json
import os
from typing import Any, Literal
import requests
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
from transformers.models.qwen2_vl.image_processing_qwen2_vl import smart_resize

default: Load the model on the available device(s)

We recommend enabling flash_attention_2 for better acceleration and memory saving.

model = AutoModelForImageTextToText.from_pretrained(
"Hcompany/Holo1-3B",
torch_dtype="auto",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
device_map="auto",
)

default processor

processor = AutoProcessor.from_pretrained("Hcompany/Holo1-3B")

The default range for the number of visual tokens per image in the model is 4-1280.

You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.

processor = AutoProcessor.from_pretrained(model_dir, min_pixels=min_pixels, max_pixels=max_pixels)

Helper function to run inference

def run_inference(messages: list[dict[str, Any]]) -> str:
# Preparation for inference
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
text=[text],
images=image,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
return processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)

Prepare image and instruction

image_url = "https://huggingface.co/Hcompany/Holo1-3B/resolve/main/calendar_example.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)

Resize the image so that predicted absolute coordinates match the size of the image.

image_processor = processor.image_processor
resized_height, resized_width = smart_resize(
image.height,
image.width,
factor=image_processor.patch_size * image_processor.merge_size,
min_pixels=image_processor.min_pixels,
max_pixels=image_processor.max_pixels,
)
image = image.resize(size=(resized_width, resized_height), resample=None) # type: ignore

instruction = "Select July 14th as the check-out date"

def get_localization_prompt(image, instruction: str) -> list[dict[str, Any]]:
guidelines: str = "Localize an element on the GUI image according to my instructions and output a click position as Click(x, y) with x num pixels from the left edge and y num pixels from the top edge."

return [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": image,
            },
            {"type": "text", "text": f"{guidelines}\n{instruction}"},
        ],
    }
]

messages = get_localization_prompt(image, instruction)
coordinates_str = run_inference(messages)[0]
print(coordinates_str)

Expected Click(352, 348)

RaushanTurganbay

25 days ago

@pcuenq Yes, we could do something like processor.post_process which would take the point corrdinates and scale them to the correct size. For example in OwlVIT we have a similar helper ()https://github.com/huggingface/transformers/blob/ff3fad61e32af207cf83b687e6a038e4dd331234/src/transformers/models/owlvit/processing_owlvit.py#L228-L237

pcuenq

H company org 25 days ago

@RaushanTurganbay Yes! :) My question was more about how to do it, given that this model is following Qwen2_5_VLForConditionalGeneration. I see we have some "models" with just a processor change (example), so I guess that'd be the way to go here as well.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment