Is it possible to only input text in LLaVa model?

#38

by Tizzzzy - opened Nov 5, 2024

Nov 5, 2024

Hi,
Currently I can successful do image question answering with LLaVa model with the following code:

processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = AutoModelForImageTextToText.from_pretrained("llava-hf/llava-1.5-7b-hf", device_map="auto")

def llava_describe(image):
    question = "<image> Describe this image as detail as possible."
    inputs = processor(images=image, text=question, return_tensors="pt").to(model.device)
    generated_ids = model.generate(**inputs, max_new_tokens=200)
    answer = processor.decode(generated_ids[0][2:], skip_special_tokens=True)

I also want to only input text in the model. However, my code doesn't work:

def llava_describe(image):
    question = "..."
    inputs = processor(images=None, text=question, return_tensors="pt").to(model.device)
    generated_ids = model.generate(**inputs, max_new_tokens=200)
    answer = processor.decode(generated_ids[0][2:], skip_special_tokens=True)

I am keep getting this error:

Traceback (most recent call last):
  File "/workspace/llava/model.py", line 138, in <module>
    generated_text = llava_describe(image)
  File "/workspace/llava/model.py", line 48, in llava_describe
    generated_ids = model.generate(**inputs, max_new_tokens=200)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 2215, in generate
    result = self._sample(
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/generation/utils.py", line 3206, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 487, in forward
    inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
  File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 303, in _merge_input_ids_with_image_features
    num_images, num_image_patches, embed_dim = image_features.shape
AttributeError: 'NoneType' object has no attribute 'shape'

Note this task is important for me, and I really want LLaVa to support text only also.
Thank you for your help!

RaushanTurganbay

Llava Hugging Face org Nov 9, 2024

Hey @Tizzzzy !

Currently Llava models will not support text-only input. I have been changing lot of stuff lately with llava models and will bring back the text-only inference soon. It got removed accidentally but it shouldn't have been

ZoeyYao27

Nov 24, 2024

Hello！I am facing the same problem. Did you find a way to solve it?

ZoeyYao27

Nov 24, 2024

Currently， I modified the modeling_llava.py in line 487 and successfully managed to only input text in LLaVa model

            # prefill stage vs decoding stage (legacy behavior copied)
            if input_ids.shape[1] != 1:
                if image_features is not None: ##add this
                    inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
                        image_features, inputs_embeds, input_ids, attention_mask, labels
                    )

RaushanTurganbay

Llava Hugging Face org Nov 24, 2024

@ZoeyYao27 will be resolved in the next v4.47 release, and yes if you can change the source code and install from source then the way to go is to add an extra indentation so that images are merged with input ids, only when pixel values are not None. The way you did is also good workaround until we make the release

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment