Searching for better (Full) ImageNet ViT Baselines
timm
1.0.9 was just released. Included are a few new ImageNet-12k and ImageNet-12k -> ImageNet-1k weights in my Searching for Better ViT Baselines series.
model | top1 | top5 | param_count | img_size |
---|---|---|---|---|
vit_mediumd_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k | 87.438 | 98.256 | 64.11 | 384 |
vit_mediumd_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k | 86.608 | 97.934 | 64.11 | 256 |
vit_betwixt_patch16_reg4_gap_384.sbb2_e200_in12k_ft_in1k | 86.594 | 98.02 | 60.4 | 384 |
vit_betwixt_patch16_reg4_gap_256.sbb2_e200_in12k_ft_in1k | 85.734 | 97.61 | 60.4 | 256 |
I'd like to highlight these models as they're on the pareto front for ImageNet-12k / ImageNet-22k models. It is interesting to look at models with comparable ImageNet-22k fine-tunes to see how competitive (near) vanilla ViTs are with other architectures. With optimized attention kernels enabled (default in timm
), they are well ahead of Swin and holding up just fine relative to ConvNeXt, etc.
Something else worth pointing out, the deit3
model weights are quite remarkable and underappreciated set of weights. The upper end of my sbb
weights are matching deit3
at equivalent compute -- it's also a great recipe. Though, one of my goals with sbb
recipes was to allow easier fine-tuning. In opting for a less exotic augmentation scheme, sticking with AdamW, and sacrificing some top-1 (higher weight-decay), I feel that was achieved. Through several fine-tune trials I've found the sbb ViT weights to be easier to fit to other, especially smaller datasets (Oxford Pets, RESISC, etc) w/ short runs.
NOTE: all throughput measurements were done on an RTX 4090, AMP /w torch.compile() enabled, PyTorch 2.4, Cuda 12.4.
Bold rows: Pareto frontier models