arxiv:2510.10921

FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

Published on Oct 13

· Submitted by

Authors:

Abstract

FG-CLIP 2, a bilingual vision-language model, enhances fine-grained alignment for English and Chinese through rich supervision and a new TIC loss, achieving state-of-the-art performance across multiple datasets and tasks.

AI-generated summary

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. Our approach leverages rich fine-grained supervision, including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages. We release the model, code, and benchmark to facilitate future research on bilingual fine-grained alignment.

View arXiv page View PDF Project page Add to collection

Community

DavidLeon

Paper submitter 3 days ago

FG-CLIP Series, as a new generation of fully innovative text-image cross-modal models, demonstrates exceptional performance in fine-grained understanding. As the latest model in this series, FG-CLIP 2 supports both Chinese and English. Across 29 datasets and 8 diverse task categories, it outperforms strong baseline models including SigLIP 2 and MetaCLIP 2, achieving the current state-of-the-art performance in tasks for both languages.

librarian-bot

2 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

FG-CLIP 2: A Bilingual Fine-grained Vision-Language Alignment Model

Abstract

Community

Models citing this paper 3

Datasets citing this paper 4

Spaces citing this paper 2

Collections including this paper 2