Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
singhsidhukuldeep 
posted an update 1 day ago
Post
811
Good folks at @nvidia have just released NVLM 1.0, a family of frontier-class multimodal large language models that achieve state-of-the-art results across vision-language tasks.

Here is how they did it:

1. Model Architecture Design:
- Developed three model architectures:
a) NVLM-D: Decoder-only architecture
b) NVLM-X: Cross-attention-based architecture
c) NVLM-H: Novel hybrid architecture

2. Vision Encoder:
- Used InternViT-6B-448px-V1-5 as the vision encoder
- Implemented dynamic high-resolution (DHR) input handling

3. Language Model:
- Used Qwen2-72B-Instruct as the base LLM

4. Training Data Curation:
- Carefully curated high-quality pretraining and supervised fine-tuning datasets
- Included diverse task-oriented datasets for various capabilities

5. Pretraining:
- Froze LLM and vision encoder
- Trained only modality-alignment modules (e.g., MLP projector, cross-attention layers)
- Used a large batch size of 2048

6. Supervised Fine-Tuning (SFT):
- Unfroze LLM while keeping the vision encoder frozen
- Trained on multimodal SFT datasets and high-quality text-only SFT data
- Implemented 1-D tile tagging for dynamic high-resolution inputs

7. Evaluation:
- Evaluated on multiple vision-language benchmarks
- Compared performance to leading proprietary and open-source models

8. Optimization:
- Iterated on model designs and training approaches
- Used smaller 34B models for faster experimentation before scaling to 72B

9. Now comes the best part...Open-Sourcing:
- Released model weights and full technical details to the research community

The paper provides fascinating insights into architecture design, training data curation, and achieving production-grade multimodality. A must-read for anyone working on multimodal AI!