the released checkpoint seems not fitting the transformers

#1
by catherinexyz - opened

Hi, I am trying to load your released ckpt using transformers, but it says that The checkpoint you are trying to load has model type vica_qwen but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command pip install --upgrade transformers. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command pip install git+https://github.com/huggingface/transformers.git. I am just wondering how to solve this issue?Thanks!

Owner

Hi @catherinexyz ,

Thank you for your question! I'd be happy to help you with this issue.

Hi, I am trying to load your released ckpt using transformers, but it says that The checkpoint you are trying to load has model type vica_qwen but Transformers does not recognize this architecture... I am just wondering how to solve this issue?

This is an excellent question, and the error you're seeing is expected if you try to load the model directly with a standard Transformers installation. Here is the solution and a detailed explanation of why it happens.

Solution

The key is to use our project's code, which includes the custom model definition. Please follow the setup instructions from our README file.

Here are the step-by-step commands:

# 1. Clone our repository
git clone https://github.com/nkkbr/ViCA.git
cd ViCA

# 2. Create and activate a conda environment
conda create -n vica2 python=3.10 -y
conda activate vica2

# 3. Install dependencies (editable mode to include our code)
# This command uses CUDA 12.1, adjust if needed
pip install --extra-index-url https://download.pytorch.org/whl/cu121 -e .

# 4. Install FlashAttention (required)
pip install flash-attn==2.5.7

# 5. Install LLaVA-NeXT
# Although our repo includes the necessary LLaVA-NeXT code, 
# for simplicity and to ensure all dependencies are met, we recommend installing it this way.
pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git

After completing these steps, you should be able to run the inference code provided on the model page without any issues. I have just re-tested this entire process in a clean environment and can confirm it works correctly.

Explanation: Why This Happens

The root cause of the error is that our model, VicaQwenForCausalLM, is a custom architecture.

While it's built upon the Hugging Face Transformers framework, it has not been officially merged into the main Transformers library. (This means you won't find its definition in the official Transformers models directory).

Our model, ViCA2, is custom-developed on top of the popular multimodal model LLaVA-NeXT. You can find our full implementation in our GitHub repository: https://github.com/nkkbr/ViCA.

Here’s how it works technically:

  1. Defining the Custom model_type: In our code, specifically in vica2/model/language_model/vica_qwen.py, we define a custom configuration class that specifies our model type on lines 37-38:

    class VicaQwenConfig(Qwen2Config):
        model_type = "vica_qwen"
    
  2. Dynamic Registration: At the end of that same file (lines 152-154), we dynamically register this custom architecture with the AutoModel classes from Transformers. This tells Transformers how to handle the "vica_qwen" type using our code.

    AutoConfig.register("vica_qwen", VicaQwenConfig)
    AutoModelForCausalLM.register(VicaQwenConfig, VicaQwenForCausalLM)
    print(f"VicaQwenConfig, VicaQwenForCausalLM registered!")
    
  3. How Registration is Triggered: When you run our inference example, the very first line is:

    from vica2.model.builder import load_pretrained_model
    

    This import statement triggers a chain reaction:

    • It executes the top-level code in vica2/model/builder.py.
    • That file, in turn, imports our custom model via from vica2.model.language_model.vica_qwen import VicaQwenForCausalLM.
    • This finally executes the registration code shown above, making Transformers aware of our model before you try to load it.

In fact, after you complete the installation, you can verify this yourself. Just open a Python shell and run from vica2.model.builder import load_pretrained_model. You will see the confirmation message:
VicaQwenConfig, VicaQwenForCausalLM registered!

Additional Note

I noticed you posted this on the nkkbr/ViCA2-init model page. Just to clarify, that checkpoint is from an early stage of the project. The final, primary model is located here:

We have released several checkpoints from different fine-tuning stages. You can find them all in this collection:

For details about each stage, the data used, and the methodology, please refer to the GitHub README.

Hope this helps! Let me know if you have any other questions.

Owner

Just wanted to add a small note to make running the inference example as smooth as possible.

While all of the following is already present in the inference code on the model page, I thought it might be helpful to break down the key steps here for clarity:

  1. Load the dataset metadata: First, you'll load the dataset using the datasets library.

    from datasets import load_dataset
    
    vsi_bench = load_dataset("nyu-visionx/VSI-Bench")
    vsi_bench = vsi_bench['test']
    
  2. Download and Extract Video Files: The load_dataset command above does not download the actual video files. To get them, you need to clone the dataset repository and unzip the video archive inside.

    You can do this by running the following commands in your terminal:

    # Clone the dataset repository (it uses Git LFS)
    git clone https://huggingface.co/datasets/nyu-visionx/VSI-Bench
    
    # Navigate into the new directory and unzip the videos
    cd VSI-Bench
    unzip arkitscenes.zip
    unzip scannet.zip
    unzip scannetpp.zip
    
  3. Update the Code: Finally, modify the video_path variable in the example script to point to one of the videos you just unzipped.

    # Change this line in the example code to the actual path
    video_path = "VSI-Bench/videos/your_chosen_video.mp4" 
    

Of course, you are also welcome to test it with your own videos of indoor scenes and ask custom questions. The example is just a starting point.

Hope this extra detail is helpful!

This comment has been hidden (marked as Resolved)

Okay. I think I now fixed it.Thanks for your reply. It turns out that I need to run the command from vica2.model.builder import load_pretrained_model to register the model and then the model can be loaded using transformers.AutoConfig successfully. Thanks!

sorry, another question:I notice in your experiment configuration, you state that you used H100 and H200 to train your model. Can you state the GPU you used for each stage? And will the final results fluctuate a lot if another type of GPU is used (e.g.,V100,3090,A60 etc.) Thanks a lot for your help!

Hi @catherinexyz ,

Thanks again for the follow-up questions! I'm happy to provide all the details.

GPU Configuration for Training

Our training process was conducted in four stages. Here are the specific GPUs used for each stage:

  • Stage 1 (Pre-training): Four NVIDIA H100 SXM 80GB
  • Stage 2: Four NVIDIA H100 SXM 80GB
  • Stage 3: Four NVIDIA H200 SXM 141GB: The result of this stage is ViCA2, our main model.
  • Stage 4 (ViCA2-thinking): We also performed a final fine-tuning step on a small dataset (nkkbr/ViCA-thinking-2.68k) to enable the model to output its "thought process." This stage also used Four NVIDIA H200 SXM 141GB.

Training Details & Logs

For more comprehensive information about the training strategy—including datasets, trainable modules, learning rates, and other hyperparameters—please refer to this table in our README:

Furthermore, for full transparency, we have made the training logs for all four stages publicly available via Weights & Biases:

On Result Fluctuation with Different GPUs

Regarding whether results will fluctuate on different hardware (like V100, 3090, A6000), this is a great question.

First, it's important to note that our provided inference example uses deterministic settings:

cont = model.generate(
    input_ids,
    images=video1,
    images_for_sam=video2,
    modalities= ["video"],
    do_sample=False,
    temperature=0,
    max_new_tokens=1024,
)

With do_sample=False and temperature=0, the model's output should be deterministic and not subject to randomness.

However, it's difficult to guarantee that there will be absolutely zero difference across different hardware types. Here are my thoughts and educated guesses on this topic:

  1. Numerical Precision: Different GPUs (and even different CUDA versions) can sometimes produce minor floating-point differences in continuous numerical computations. While these tiny variations might affect the raw logits scores, the final choice of a token is a discrete action (an argmax operation). For the output token to change, the numerical difference would need to be significant enough to alter which token has the highest logit score. This is generally unlikely.

  2. Scope of Impact: In the worst-case scenario, any potential discrepancies would likely be very subtle and affect only a very small number of samples. It is highly improbable that you would see widespread or significant deviations in the model's overall behavior or performance.

In summary, the results should be highly consistent, and any potential variations are expected to be minimal to non-existent.

Quick Guide to Inference

Just in case you missed it, you can directly use the inference code we provide on GitHub to test the model:

You simply need to modify the video file path and the question within the script to generate an output.

A Note on the nkkbr/ViCA2-init Checkpoint

Finally, I'd like to clarify the role of nkkbr/ViCA2-init, since you first asked your question there.

This checkpoint represents the initial weights right at the start of training. Specifically:

  • It initializes most of its parameters using weights from lmms-lab/LLaVA-Video-7B-Qwen2 and nkkbr/hiera-base-plus-in-sam2.1.
  • However, the projector (an MLP that connects the Hiera visual encoder to the language model) has randomly initialized weights.

This checkpoint serves as the starting point for our fine-tuning process, and we released it for full transparency. You might also notice its file size is much larger; this is because we saved it with torch.float32 precision, whereas the subsequent fine-tuned models are saved in the more efficient torch.bfloat16 format.

While you can technically run inference with this initial checkpoint, its performance is expected to be low because the visual and language components are not yet well "aligned."

I hope this detailed explanation is helpful! Please feel free to ask if anything is unclear.

got it!thanks again for your detailed reply!

Sign up or log in to comment