BLIP2 for retrieval
Is there a way to use the huggingface model to do cross modal retrieval tasks?
There's an effort to add it: https://github.com/huggingface/transformers/pull/29261
but it seems that there is no proper model on hugging-face for Blip2ForImageTextRetrieval? existing models could not be rightly loaded on retrieval tasks.
The PR above has been merged, so the Blip2ForImageTextRetrieval
class is now available.
There are 2 checkpoints available:
The PR above has been merged, so the
Blip2ForImageTextRetrieval
class is now available.There are 2 checkpoints available:
- https://huggingface.co/Salesforce/blip2-itm-vit-g
- https://huggingface.co/Salesforce/blip2-itm-vit-g-coco
as I know, blip2-itm-vit-g do not work well.
here is the logs:
Some weights of the model checkpoint at ../Salesforce/blip2-itm-vit-g were not used when initializing Blip2ForImageTextRetrieval: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'qformer.embeddings.LayerNorm.bias', 'qformer.embeddings.LayerNorm.weight', 'qformer.embeddings.position_embeddings.weight', 'qformer.embeddings.word_embeddings.weight', 'temp', 'text_proj.bias', 'text_proj.weight', 'vision_proj.bias', 'vision_proj.weight']
- This IS expected if you are initializing Blip2ForImageTextRetrieval from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Blip2ForImageTextRetrieval from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Blip2ForImageTextRetrieval were not initialized from the model checkpoint at /home/pyr/pretrained_models/Salesforce/blip2-itm-vit-g and are newly initialized: ['embeddings.position_embeddings.weight', 'embeddings.word_embeddings.weight', 'qformer.layernorm.bias', 'qformer.layernorm.weight', 'text_projection.bias', 'text_projection.weight', 'vision_projection.bias', 'vision_projection.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Hi,
It looks like you may need to update your Transformers version, the following code snippet works for me:
import torch
from PIL import Image
import requests
from transformers import AutoProcessor, Blip2ForImageTextRetrieval
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Blip2ForImageTextRetrieval.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float16)
processor = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g")
model.to(device) # doctest: +IGNORE_RESULT
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "two cats laying on a pink blanket"
inputs = processor(images=image, text=text, return_tensors="pt").to(device, torch.float16)
itm_out = model(**inputs, use_image_text_matching_head=True)
logits_per_image = torch.nn.functional.softmax(itm_out.logits_per_image, dim=1)
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
print("Probs:", probs)
which prints:
Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
Probs: tensor([[0.2693, 0.7305]], dtype=torch.float16, grad_fn=<SoftmaxBackward0>)
Hi,
It looks like you may need to update your Transformers version, the following code snippet works for me:
import torch from PIL import Image import requests from transformers import AutoProcessor, Blip2ForImageTextRetrieval device = "cuda" if torch.cuda.is_available() else "cpu" model = Blip2ForImageTextRetrieval.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float16) processor = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g") model.to(device) # doctest: +IGNORE_RESULT url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) text = "two cats laying on a pink blanket" inputs = processor(images=image, text=text, return_tensors="pt").to(device, torch.float16) itm_out = model(**inputs, use_image_text_matching_head=True) logits_per_image = torch.nn.functional.softmax(itm_out.logits_per_image, dim=1) probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities print("Probs:", probs)
which prints:
Expanding inputs for image tokens in BLIP-2 should be done in processing. Please follow instruction here (https://gist.github.com/zucchini-nlp/e9f20b054fa322f84ac9311d9ab67042) to update your BLIP-2 model. Using processors without these attributes in the config is deprecated and will throw an error in v4.47. Probs: tensor([[0.2693, 0.7305]], dtype=torch.float16, grad_fn=<SoftmaxBackward0>)
Thank you for your reply; it was very helpful to me. It indeed seems to be a version issue.
Furthermore, may I ask another question? In the examples of the lavis library, both the extracted image features and text features are multiple low-dimensional vectors.
However, I noticed that using
image_emb = model.extract_features(sample, mode="image").image_embeds[:,0,:] # size (768)
text_emb = model.extract_features(sample, mode="text").text_embeds[:,0,:] # size (768)
seems to also work.
What is the difference between these two methods? Or is there detailed documentation available somewhere? Thank you.