Is the `get_vision_embedding` method published yet?

#19
by ronima - opened

GitHub suggests that the method exists, here:
https://github.com/OpenBMB/MiniCPM-V/blob/main/omnilmm/model/omnilmm.py#L108

But, it is missing from the HF repo code, here:
https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/main/modeling_minicpmv.py#L64

Is there a way to get vision embeddings for an images using the model?
Thanks.

@Cuiunbo
What should the tgt_sizes be then?
https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/b02e4d7872bafd5a376e604ee069c2342f11062d/modeling_minicpmv.py#L94

The updated version suggests no such variable:
https://github.com/OpenBMB/MiniCPM-V/blob/main/omnilmm/model/resampler.py#L149

Thanks!

  • The output of row number 93 (vision_embedding = self.vpm(all_pixel_values.type(dtype), patch_attention_mask=patch_attn_mask).last_hidden_state) is a variable of dimension (1, 256, 1152) for an input image of shape (1, 3, 224, 224) whereas I need an output of a single dimension. Is that possible to extract?
OpenBMB org

Hi, the two models you are looking at are not the same framework. One is omnilmm one is minicpm-v.

@Cuiunbo Got it, thanks.
In that case, what is the proper manner of manipulating the (1, 256, 1152) output to another tensor of shape (1, X)?
I mean, how should one operate tgt_sizes and resampler to achieve that?

The image was sliced before forward, and tagt_size was used to adjust the resampler's embedding. We didn't add image processing to the forward, but if you want to train the model, this is a viable way to do it, see our github repo for fine-tune code as well.
https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/b02e4d7872bafd5a376e604ee069c2342f11062d/modeling_minicpmv.py#L407

Cuiunbo changed discussion status to closed

Sign up or log in to comment