Kuldeep Singh Sidhu's picture
5 3

Kuldeep Singh Sidhu

singhsidhukuldeep

AI & ML interests

Seeking contributors for a completely open-source 🚀 Data Science platform! singhsidhukuldeep.github.io

Organizations

Posts 62

view post
Post
115
It's not every day you see a research paper named "Alice's Adventures in a Differentiable Wonderland," and when you open it, it's a 281-page book!

I haven't completed it yet, but this amazing work, written by Simone Scardapane, is a fascinating introduction to deep neural networks and differentiable programming.

Some key technical highlights:

• Covers core concepts like automatic differentiation, stochastic optimization, and activation functions in depth

• Explains modern architectures like convolutional networks, transformers, and graph neural networks

• Provides mathematical foundations including linear algebra, gradients, and probability theory

• Discusses implementation details in PyTorch and JAX

• Explores advanced topics like Bayesian neural networks and neural scaling laws

The book takes a unique approach, framing neural networks as compositions of differentiable primitives rather than biological analogs. It provides both theoretical insights and practical coding examples.

I especially enjoyed the sections on:

• Vector-Jacobian products and reverse-mode autodiff
• Stochastic gradient descent and mini-batch optimization
• ReLU, GELU, and other modern activation functions
• Universal approximation capabilities of MLPs

Whether you're new to deep learning or an experienced practitioner, this book offers valuable insights into the fundamentals and latest developments. Highly recommended for anyone working with neural networks!
view post
Post
745
Good folks at @nvidia have just released NVLM 1.0, a family of frontier-class multimodal large language models that achieve state-of-the-art results across vision-language tasks.

Here is how they did it:

1. Model Architecture Design:
- Developed three model architectures:
a) NVLM-D: Decoder-only architecture
b) NVLM-X: Cross-attention-based architecture
c) NVLM-H: Novel hybrid architecture

2. Vision Encoder:
- Used InternViT-6B-448px-V1-5 as the vision encoder
- Implemented dynamic high-resolution (DHR) input handling

3. Language Model:
- Used Qwen2-72B-Instruct as the base LLM

4. Training Data Curation:
- Carefully curated high-quality pretraining and supervised fine-tuning datasets
- Included diverse task-oriented datasets for various capabilities

5. Pretraining:
- Froze LLM and vision encoder
- Trained only modality-alignment modules (e.g., MLP projector, cross-attention layers)
- Used a large batch size of 2048

6. Supervised Fine-Tuning (SFT):
- Unfroze LLM while keeping the vision encoder frozen
- Trained on multimodal SFT datasets and high-quality text-only SFT data
- Implemented 1-D tile tagging for dynamic high-resolution inputs

7. Evaluation:
- Evaluated on multiple vision-language benchmarks
- Compared performance to leading proprietary and open-source models

8. Optimization:
- Iterated on model designs and training approaches
- Used smaller 34B models for faster experimentation before scaling to 72B

9. Now comes the best part...Open-Sourcing:
- Released model weights and full technical details to the research community

The paper provides fascinating insights into architecture design, training data curation, and achieving production-grade multimodality. A must-read for anyone working on multimodal AI!

models

None public yet

datasets

None public yet