FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model
Abstract
FG-CLIP 2, a bilingual vision-language model, enhances fine-grained alignment for English and Chinese through rich supervision and a new TIC loss, achieving state-of-the-art performance across multiple datasets and tasks.
Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.
Community
FG-CLIP Series, as a new generation of fully innovative text-image cross-modal models, demonstrates exceptional performance in fine-grained understanding. As the latest model in this series, FG-CLIP 2 supports both Chinese and English. Across 29 datasets and 8 diverse task categories, it outperforms strong baseline models including SigLIP 2 and MetaCLIP 2, achieving the current state-of-the-art performance in tasks for both languages.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval (2025)
- UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception (2025)
- Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions (2025)
- Cross-modal Full-mode Fine-grained Alignment for Text-to-Image Person Retrieval (2025)
- Visual Representation Alignment for Multimodal Large Language Models (2025)
- WAVE: Learning Unified&Versatile Audio-Visual Embeddings with Multimodal LLM (2025)
- SETR: A Two-Stage Semantic-Enhanced Framework for Zero-Shot Composed Image Retrieval (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend