Leveraging Contextual Web Data for Fine-tuning Vision Language Models (https://arxiv.org/abs/2502.10250)