Is the `get_vision_embedding` method published yet?
GitHub suggests that the method exists, here:
https://github.com/OpenBMB/MiniCPM-V/blob/main/omnilmm/model/omnilmm.py#L108
But, it is missing from the HF repo code, here:
https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/main/modeling_minicpmv.py#L64
Is there a way to get vision embeddings for an images using the model?
Thanks.
Thank you for your attention, it is possible, see here for implementation
https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/b02e4d7872bafd5a376e604ee069c2342f11062d/modeling_minicpmv.py#L93
@Cuiunbo
What should the tgt_sizes
be then?
https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/b02e4d7872bafd5a376e604ee069c2342f11062d/modeling_minicpmv.py#L94
The updated version suggests no such variable:
https://github.com/OpenBMB/MiniCPM-V/blob/main/omnilmm/model/resampler.py#L149
Thanks!
- The output of row number 93 (
vision_embedding = self.vpm(all_pixel_values.type(dtype), patch_attention_mask=patch_attn_mask).last_hidden_state
) is a variable of dimension (1, 256, 1152) for an input image of shape (1, 3, 224, 224) whereas I need an output of a single dimension. Is that possible to extract?
Hi, the two models you are looking at are not the same framework. One is omnilmm one is minicpm-v.
The image was sliced before forward, and tagt_size was used to adjust the resampler's embedding. We didn't add image processing to the forward, but if you want to train the model, this is a viable way to do it, see our github repo for fine-tune code as well.
https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5/blob/b02e4d7872bafd5a376e604ee069c2342f11062d/modeling_minicpmv.py#L407