File size: 4,225 Bytes
df5aa97
 
 
 
ab8cb0b
 
df5aa97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ab8cb0b
 
 
df5aa97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
license: mit
pipeline_tag: image-segmentation
library_name: transformers
base_model:
- OpenGVLab/InternVL2_5-2B
---

# MLLMSeg: Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder

This repository contains the `MLLMSeg_InternVL2_5_1B_RES` model, which was presented in the paper [Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder](https://huggingface.co/papers/2508.04107).

**MLLMSeg** aims to segment image regions specified by referring expressions. While Multimodal Large Language Models (MLLMs) are proficient in semantic understanding, their token-generation approach often struggles with pixel-level dense prediction tasks like segmentation. To address this, MLLMSeg proposes a novel framework that fully leverages the inherent visual detail features encoded in the MLLM's vision encoder, eliminating the need for an extra visual encoder. It further introduces a detail-enhanced and semantic-consistent feature fusion module (DSFF) to integrate visual details with semantic features from the Large Language Model (LLM). Finally, a lightweight mask decoder (with only 34M parameters) is established to optimize the use of these features for precise mask prediction. This approach strikes a better balance between performance and computational cost compared to existing SAM-based and SAM-free methods.

The official code is available on GitHub: [https://github.com/jcwang0602/MLLMSeg](https://github.com/jcwang0602/MLLMSeg)

## Model Architecture
<p align="center">
  <img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/method.png" width="800">
</p>

## Quick Start / How to Use

This section provides instructions on how to use our pre-trained model for inference. Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (e.g., a bounding box defined by top-left and bottom-right coordinates). For visualization, you will need to convert these relative coordinates back to the original image dimensions.

### Installation

First, install the `transformers` library and other necessary dependencies. Note that `flash-attn` requires a GPU for installation.

```bash
conda create -n mllmseg python==3.10.18 -y
conda activate mllmseg
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu118 # Adjust for your CUDA version
pip install -r requirements.txt # Assuming requirements.txt from the cloned repo
pip install flash-attn==2.3.6 --no-build-isolation # Note: requires GPU to install
```

## Usage

Refer to the Github README: 

# The 'response' will contain the segmentation mask coordinates in a specific format (normalized 0-1000).
# You will need to parse these coordinates and visualize the mask as per the paper's methodology or example scripts.
```

## Performance Metrics

### Referring Expression Segmentation
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_res.png" width="800">

### Referring Expression Comprehension
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_rec.png" width="800">

### Generalized Referring Expression Segmentation
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_gres.png" width="800">

## Visualization

### Referring Expression Segmentation
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/res.png" width="800">

### Referring Expression Comprehension
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/rec.png" width="800">

### Generalized Referring Expression Segmentation
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/gres.png" width="800">

## Citation
If our work is useful for your research, please consider citing:

```bibtex
@misc{wang2025unlockingpotentialmllmsreferring,
      title={Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder}, 
      author={Jingchao Wang and Zhijian Wu and Dingjiang Huang and Yefeng Zheng and Hong Wang},
      year={2025},
      eprint={2508.04107},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.04107}, 
}
```