Papers
arxiv:2506.14842

PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

Published on Jun 16
· Submitted by cwolff on Jun 19

Abstract

PictSure is an in-context learning framework that enhances few-shot image classification by optimizing embedding models' architecture, pretraining, and fine-tuning strategies to improve out-of-domain performance.

AI-generated summary

Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model -- its architecture, pretraining, and training dynamics -- at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.

Community

Paper author Paper submitter

TL;DR of "PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers"

The paper introduces PictSure, a vision-only in-context learning (ICL) framework for few-shot image classification (FSIC) that emphasizes the critical role of image embedding models. Unlike prior ICL methods that rely on language-supervised embeddings (like CLIP), PictSure uses purely visual features and transformer-based inference to classify images without any fine-tuning.

Key contributions include:

  • A systematic analysis of how embedding architecture (ResNet vs. ViT), pretraining strategies (e.g., triplet loss), and training dynamics affect FSIC performance.
  • Evidence showing that pretrained, frozen encoders—especially ViTs with triplet loss—enable better generalization, especially to out-of-domain datasets (e.g., medical imagery).
  • PictSure outperforms larger models like CAML on out-of-domain tasks, while maintaining competitive in-domain performance, despite being significantly smaller.

The study highlights that embedding quality is more critical than model size or semantic alignment for generalization in low-data visual classification scenarios.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.14842 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2506.14842 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.14842 in a Space README.md to link it from this page.

Collections including this paper 1