Is CONCHv1_5 the vision encoder used in PathChat MLLM?

#1
by lijinda - opened

Hello Mahmood Lab team,
I'm exploring the components of the impressive PathChat model presented in your recent work ("A multimodal generative AI copilot for human pathology"). I've also looked at the related papers on UNI and CONCH.
I recently came across the CONCHv1_5 model card on Hugging Face. Based on the description, it seems like this model might be the specific vision encoder component used to build PathChat.
My understanding from the model card and papers is:
CONCHv1_5 starts from the UNI (ViT-L) checkpoint (which was trained via self-supervision on Mass-100K).
It then undergoes further vision-language pretraining using a CoCa-style recipe ("fine-tuned in a similar fashion as CONCH").
This second stage uses approximately 1.17 million pathology image-caption pairs.
This preparation process appears to align very closely with the description of how the vision encoder was prepared before being integrated with the Llama 2 LLM and undergoing the final instruction fine-tuning to create the full PathChat model.
Could you please clarify if the CONCHv1_5 checkpoint available on Hugging Face is indeed the exact vision encoder component used within the PathChat MLLM, specifically representing the state after the vision-language alignment stage but before the final multimodal instruction tuning?
Confirming this would be very helpful for understanding the specific components and the step-by-step training pipeline of PathChat.
Thank you for developing these valuable models and for any clarification you can provide!

Sign up or log in to comment