Abstract
JAFAR is a lightweight feature upsampler using an attention-based module with Spatial Feature Transform modulation, enabling high-resolution features from Foundation Vision Encoders without high-resolution supervision.
Foundation Vision Encoders have become essential for a wide range of dense vision tasks. However, their low-resolution spatial feature outputs necessitate feature upsampling to produce the high-resolution modalities required for downstream tasks. In this work, we introduce JAFAR, a lightweight and flexible feature upsampler that enhances the spatial resolution of visual features from any Foundation Vision Encoder to an arbitrary target resolution. JAFAR employs an attention-based module designed to promote semantic alignment between high-resolution queries, derived from low-level image features, and semantically enriched low-resolution keys, using Spatial Feature Transform (SFT) modulation. Notably, despite the absence of high-resolution supervision, we demonstrate that learning at low upsampling ratios and resolutions generalizes remarkably well to significantly higher output scales. Extensive experiments show that JAFAR effectively recovers fine-grained spatial details and consistently outperforms existing feature upsampling methods across a diverse set of downstream tasks. Project page at https://jafar-upsampler.github.io
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation (2025)
- Leveraging Depth and Language for Open-Vocabulary Domain-Generalized Semantic Segmentation (2025)
- Vision Transformers with Self-Distilled Registers (2025)
- DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception (2025)
- ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction (2025)
- REN: Fast and Efficient Region Encodings from Patch-Based Image Encoders (2025)
- DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper