Add comprehensive model card for MLLMSeg_InternVL2_5_4B_RES (#1)
Browse files- Add comprehensive model card for MLLMSeg_InternVL2_5_4B_RES (30eed674e10caeef4200bcfa974f2f78e7459578)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -1,3 +1,176 @@
|
|
1 |
-
---
|
2 |
-
license: mit
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
pipeline_tag: image-segmentation
|
4 |
+
library_name: transformers
|
5 |
+
---
|
6 |
+
|
7 |
+
# MLLMSeg: Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder
|
8 |
+
|
9 |
+
This repository contains the `MLLMSeg_InternVL2_5_4B_RES` model, which was presented in the paper [Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder](https://huggingface.co/papers/2508.04107).
|
10 |
+
|
11 |
+
**MLLMSeg** is a novel framework designed for Referring Expression Segmentation (RES). It fully exploits the inherent visual detail features encoded in Multimodal Large Language Model (MLLM) vision encoders without introducing an extra visual encoder. The model employs a detail-enhanced and semantic-consistent feature fusion module (DSFF) and a light-weight mask decoder (with only 34M network parameters) to achieve precise mask prediction. This approach strikes a better balance between performance and cost compared to existing methods.
|
12 |
+
|
13 |
+
Code: https://github.com/jcwang0602/MLLMSeg
|
14 |
+
|
15 |
+
## Quick Start (Inference)
|
16 |
+
|
17 |
+
### Installation
|
18 |
+
|
19 |
+
First, install the `transformers` library and other dependencies. For a complete installation guide, please refer to the [official GitHub repository](https://github.com/jcwang0602/MLLMSeg) and the [InternVL2 documentation](https://internvl.readthedocs.io/en/latest/get_started/installation.html).
|
20 |
+
|
21 |
+
```bash
|
22 |
+
conda create -n mllmseg python==3.10.18 -y
|
23 |
+
conda activate mllmseg
|
24 |
+
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu118
|
25 |
+
pip install -r requirements.txt
|
26 |
+
pip install flash-attn==2.3.6 --no-build-isolation # Note: need gpu to install
|
27 |
+
```
|
28 |
+
|
29 |
+
### Inference Example
|
30 |
+
|
31 |
+
Here's an example to perform inference with the model:
|
32 |
+
|
33 |
+
```python
|
34 |
+
import numpy as np
|
35 |
+
import torch
|
36 |
+
import torchvision.transforms as T
|
37 |
+
from PIL import Image
|
38 |
+
from torchvision.transforms.functional import InterpolationMode
|
39 |
+
from transformers import AutoModel, AutoTokenizer
|
40 |
+
|
41 |
+
IMAGENET_MEAN = (0.485, 0.456, 0.406)
|
42 |
+
IMAGENET_STD = (0.229, 0.224, 0.225)
|
43 |
+
|
44 |
+
def build_transform(input_size):
|
45 |
+
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
|
46 |
+
transform = T.Compose([
|
47 |
+
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
|
48 |
+
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
|
49 |
+
T.ToTensor(),
|
50 |
+
T.Normalize(mean=MEAN, std=STD)
|
51 |
+
])
|
52 |
+
return transform
|
53 |
+
|
54 |
+
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
|
55 |
+
best_ratio_diff = float('inf')
|
56 |
+
best_ratio = (1, 1)
|
57 |
+
area = width * height
|
58 |
+
for ratio in target_ratios:
|
59 |
+
target_aspect_ratio = ratio[0] / ratio[1]
|
60 |
+
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
|
61 |
+
if ratio_diff < best_ratio_diff:
|
62 |
+
best_ratio_diff = ratio_diff
|
63 |
+
best_ratio = ratio
|
64 |
+
elif ratio_diff == best_ratio_diff:
|
65 |
+
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
|
66 |
+
best_ratio = ratio
|
67 |
+
return best_ratio
|
68 |
+
|
69 |
+
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
|
70 |
+
orig_width, orig_height = image.size
|
71 |
+
aspect_ratio = orig_width / orig_height
|
72 |
+
|
73 |
+
# calculate the existing image aspect ratio
|
74 |
+
target_ratios = set(
|
75 |
+
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
|
76 |
+
i * j <= max_num and i * j >= min_num)
|
77 |
+
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
|
78 |
+
|
79 |
+
# find the closest aspect ratio to the target
|
80 |
+
target_aspect_ratio = find_closest_aspect_ratio(
|
81 |
+
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
|
82 |
+
|
83 |
+
# calculate the target width and height
|
84 |
+
target_width = image_size * target_aspect_ratio[0]
|
85 |
+
target_height = image_size * target_aspect_ratio[1]
|
86 |
+
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
|
87 |
+
|
88 |
+
# resize the image
|
89 |
+
resized_img = image.resize((target_width, target_height))
|
90 |
+
processed_images = []
|
91 |
+
for i in range(blocks):
|
92 |
+
box = (
|
93 |
+
(i % (target_width // image_size)) * image_size,
|
94 |
+
(i // (target_width // image_size)) * image_size,
|
95 |
+
((i % (target_width // image_size)) + 1) * image_size,
|
96 |
+
((i // (target_width // image_size)) + 1) * image_size
|
97 |
+
)
|
98 |
+
# split the image
|
99 |
+
split_img = resized_img.crop(box)
|
100 |
+
processed_images.append(split_img)
|
101 |
+
assert len(processed_images) == blocks
|
102 |
+
if use_thumbnail and len(processed_images) != 1:
|
103 |
+
thumbnail_img = image.resize((image_size, image_size))
|
104 |
+
processed_images.append(thumbnail_img)
|
105 |
+
return processed_images
|
106 |
+
|
107 |
+
def load_image(image_file, input_size=448, max_num=12):
|
108 |
+
image = Image.open(image_file).convert('RGB')
|
109 |
+
transform = build_transform(input_size=input_size)
|
110 |
+
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
|
111 |
+
pixel_values = [transform(image) for image in images]
|
112 |
+
pixel_values = torch.stack(pixel_values)
|
113 |
+
return pixel_values
|
114 |
+
|
115 |
+
# Load the model and tokenizer
|
116 |
+
path = 'jcwang0602/MLLMSeg_InternVL2_5_4B_RES'
|
117 |
+
model = AutoModel.from_pretrained(
|
118 |
+
path,
|
119 |
+
torch_dtype=torch.bfloat16,
|
120 |
+
low_cpu_mem_usage=True,
|
121 |
+
trust_remote_code=True).eval().cuda()
|
122 |
+
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)
|
123 |
+
|
124 |
+
# Prepare image and question (replace './examples/images/web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png' with your image path)
|
125 |
+
pixel_values = load_image('./examples/images/web_dfacd48d-d2c2-492f-b94c-41e6a34ea99f.png', max_num=6).to(torch.bfloat16).cuda()
|
126 |
+
generation_config = dict(max_new_tokens=1024, do_sample=True)
|
127 |
+
|
128 |
+
question = "In the screenshot of this web page, please give me the coordinates of the element I want to click on according to my instructions(with point).\
|
129 |
+
\\\"'Champions League' link\\\""
|
130 |
+
|
131 |
+
# Chat with the model
|
132 |
+
response, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)
|
133 |
+
print(f'User: {question}\
|
134 |
+
Assistant: {response}')
|
135 |
+
```
|
136 |
+
|
137 |
+
## Performance Metrics
|
138 |
+
|
139 |
+
The following tables showcase the performance of MLLMSeg on various benchmarks, as presented in the original repository:
|
140 |
+
|
141 |
+
### Referring Expression Segmentation
|
142 |
+
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_res.png" width="800">
|
143 |
+
|
144 |
+
### Referring Expression Comprehension
|
145 |
+
<img src="https://jcwang0602/MLLMSeg/raw/main/assets/tab_rec.png" width="800">
|
146 |
+
|
147 |
+
### Generalized Referring Expression Segmentation
|
148 |
+
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_gres.png" width="800">
|
149 |
+
|
150 |
+
## Visualization
|
151 |
+
|
152 |
+
Visual examples of MLLMSeg's performance:
|
153 |
+
|
154 |
+
### Referring Expression Segmentation
|
155 |
+
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/res.png" width="800">
|
156 |
+
|
157 |
+
### Referring Expression Comprehension
|
158 |
+
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/rec.png" width="800">
|
159 |
+
|
160 |
+
### Generalized Referring Expression Segmentation
|
161 |
+
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/gres.png" width="800">
|
162 |
+
|
163 |
+
## Citation
|
164 |
+
|
165 |
+
If our work is useful for your research, please consider citing:
|
166 |
+
```bibtex
|
167 |
+
@misc{wang2025unlockingpotentialmllmsreferring,
|
168 |
+
title={Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder},
|
169 |
+
author={Jingchao Wang and Zhijian Wu and Dingjiang Huang and Yefeng Zheng and Hong Wang},
|
170 |
+
year={2025},
|
171 |
+
eprint={2508.04107},
|
172 |
+
archivePrefix={arXiv},
|
173 |
+
primaryClass={cs.CV},
|
174 |
+
url={https://arxiv.org/abs/2508.04107},
|
175 |
+
}
|
176 |
+
```
|