Reproduce the results for training with DataCompDR_12M

by Amitshomer - opened 10 days ago

10 days ago

Hi, Thanks for the great work!

I'm trying to reproduce the results using your training section on GitHub. I'm running configs/run_datacompdr12m.sh to replicate the results from the paper for ViT-B16 with the DataCompDR-12M dataset.

Currently, my best result is 56.7% accuracy on ImageNet, while the paper mentions a target of 61.7%.
The only change I made is that about 30% of the image links in the DataCompDR-12M dataset couldn't be downloaded. To reach 12M samples, I added around 3.5M samples from the DataCompDR-1B dataset.

Thanks

fartashf

Apple org 8 days ago

•

edited 8 days ago

Thanks @Amitshomer for your interest. The gap you are observing seems too large. There are two differences between 12M and 1B sets that may have an impact.

First, the text_emb in the 12M set has an additionally repeated row (0 and 1) that are the same and both are the embeddings of the ground-truth caption. Depending on how you have combined the data this may or may not have an impact.
The other difference is the number of image augmentations per sample (10 for DataCompDR-1B and 30 for DataCompDR-12M). But this one should have less than 1% difference as Table 4.a suggests.

If investigating these issues don't result in any findings, I'd suggest repeating the ablations of Table 2 to see where the drop appears. Specifically, first try disabling all DR losses and synthetic captions and train with CLIP loss, then add synthetic captions and finally add KD loss but with lambda=1.

Amitshomer

8 days ago

Hi Thanks for the quick reply,

About the first point —As you said, the 12M set includes repeated rows. Correct me if I’m wrong, but it looks like there’s a bug in the dataloader for handling DataComp12M.

Now, len(texts) == 1, and the lines below show a mismatch. The selected syn_text and its syn_text_emb don’t align.

syn_text = sample["syn.json"]["syn_text"][scapi]
syn_text_emb = text_emb_all[len(texts) + scapi]

For example, when scapi == 0, the syn_text_emb pulled is actually for the GT caption (0 and 1 are the GT caption emb).

fartashf

Apple org 7 days ago

•

edited 7 days ago

To clarify, only one text embedding is repeated. so a hacky fix is to change this line:

is_duplicate = (len(text_emb_all) == 7)
syn_text_emb = text_emb_all[len(texts) + is_duplicate + scapi]

This skips the duplicate text embedding. We will fix the dataset soon and remove the redundant text embedding but you can use the hack above in the meantime.
This issue was also reported here: https://huggingface.co/datasets/apple/DataCompDR-12M/discussions/4

fartashf

Apple org 7 days ago

One more note, please make sure to download the original ground-truth captions for the 12M set as discussed here:
https://huggingface.co/datasets/apple/DataCompDR-12M/discussions/5
https://github.com/apple/ml-mobileclip/tree/main/training

fartashf

Apple org 1 day ago

Hi @Amitshomer ,
We have uploaded a revision of the dataset where the duplicate features are removed. The text embeddings per sample are now 6x1536. Please let us know if you test and observe any issues.
Thanks

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment