nvidia/MambaVision-L3-512-21K · What's new ? What has actually changed since Jul 2024

Hi,

Thanks for your interest in our work.

Since the initial release, we've significantly enhanced MambaVision, scaling it up to an impressive 740 million parameters. We've also expanded our training approach by utilizing the larger ImageNet-21K dataset and have introduced native support for higher resolutions, now handling images at 256 and 512 pixels compared to the original 224 pixels.

L/L2/L3 variants are indeed scaled up versions of their smaller counterparts T/T2 which were initially trained on ImageNet-1K.

These advancements have substantially elevated the model's performance. MambaVision now delivers outstanding accuracy in both primary image classification tasks on ImageNet and downstream applications such as semantic segmentation (ADE20K) and object detection (COCO). Notably, our newest model variant, MambaVision-L3-512-21K, featuring 740M parameters, achieves a remarkable Top-1 accuracy of 88.1% solely through pre-training on ImageNet-21K.

Notably, MambaVision represents the first successful scaling of a Mamba-based model to these large sizes, achieving state-of-the-art results.

Kind Regards,
Ali Hatamizadeh