PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers
Abstract
PictSure is an in-context learning framework that enhances few-shot image classification by optimizing embedding models' architecture, pretraining, and fine-tuning strategies to improve out-of-domain performance.
Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) has emerged as a promising paradigm for few-shot image classification (FSIC), enabling models to generalize across domains without gradient-based adaptation. However, prior work has largely overlooked a critical component of ICL-based FSIC pipelines: the role of image embeddings. In this work, we present PictSure, an ICL framework that places the embedding model -- its architecture, pretraining, and training dynamics -- at the center of analysis. We systematically examine the effects of different visual encoder types, pretraining objectives, and fine-tuning strategies on downstream FSIC performance. Our experiments show that the training success and the out-of-domain performance are highly dependent on how the embedding models are pretrained. Consequently, PictSure manages to outperform existing ICL-based FSIC models on out-of-domain benchmarks that differ significantly from the training distribution, while maintaining comparable results on in-domain tasks. Code can be found at https://github.com/PictSure/pictsure-library.
Community
TL;DR of "PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers"
The paper introduces PictSure, a vision-only in-context learning (ICL) framework for few-shot image classification (FSIC) that emphasizes the critical role of image embedding models. Unlike prior ICL methods that rely on language-supervised embeddings (like CLIP), PictSure uses purely visual features and transformer-based inference to classify images without any fine-tuning.
Key contributions include:
- A systematic analysis of how embedding architecture (ResNet vs. ViT), pretraining strategies (e.g., triplet loss), and training dynamics affect FSIC performance.
- Evidence showing that pretrained, frozen encoders—especially ViTs with triplet loss—enable better generalization, especially to out-of-domain datasets (e.g., medical imagery).
- PictSure outperforms larger models like CAML on out-of-domain tasks, while maintaining competitive in-domain performance, despite being significantly smaller.
The study highlights that embedding quality is more critical than model size or semantic alignment for generalization in low-data visual classification scenarios.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models (2025)
- Generalizable Vision-Language Few-Shot Adaptation with Predictive Prompts and Negative Learning (2025)
- E-InMeMo: Enhanced Prompting for Visual In-Context Learning (2025)
- Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation (2025)
- Beginning with You: Perceptual-Initialization Improves Vision-Language Representation and Alignment (2025)
- HierVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment (2025)
- FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper