|
--- |
|
license: mit |
|
pipeline_tag: image-segmentation |
|
library_name: transformers |
|
base_model: |
|
- OpenGVLab/InternVL2_5-2B |
|
--- |
|
|
|
# MLLMSeg: Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder |
|
|
|
This repository contains the `MLLMSeg_InternVL2_5_1B_RES` model, which was presented in the paper [Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder](https://huggingface.co/papers/2508.04107). |
|
|
|
**MLLMSeg** aims to segment image regions specified by referring expressions. While Multimodal Large Language Models (MLLMs) are proficient in semantic understanding, their token-generation approach often struggles with pixel-level dense prediction tasks like segmentation. To address this, MLLMSeg proposes a novel framework that fully leverages the inherent visual detail features encoded in the MLLM's vision encoder, eliminating the need for an extra visual encoder. It further introduces a detail-enhanced and semantic-consistent feature fusion module (DSFF) to integrate visual details with semantic features from the Large Language Model (LLM). Finally, a lightweight mask decoder (with only 34M parameters) is established to optimize the use of these features for precise mask prediction. This approach strikes a better balance between performance and computational cost compared to existing SAM-based and SAM-free methods. |
|
|
|
The official code is available on GitHub: [https://github.com/jcwang0602/MLLMSeg](https://github.com/jcwang0602/MLLMSeg) |
|
|
|
## Model Architecture |
|
<p align="center"> |
|
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/method.png" width="800"> |
|
</p> |
|
|
|
## Quick Start / How to Use |
|
|
|
This section provides instructions on how to use our pre-trained model for inference. Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (e.g., a bounding box defined by top-left and bottom-right coordinates). For visualization, you will need to convert these relative coordinates back to the original image dimensions. |
|
|
|
### Installation |
|
|
|
First, install the `transformers` library and other necessary dependencies. Note that `flash-attn` requires a GPU for installation. |
|
|
|
```bash |
|
conda create -n mllmseg python==3.10.18 -y |
|
conda activate mllmseg |
|
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu118 # Adjust for your CUDA version |
|
pip install -r requirements.txt # Assuming requirements.txt from the cloned repo |
|
pip install flash-attn==2.3.6 --no-build-isolation # Note: requires GPU to install |
|
``` |
|
|
|
## Usage |
|
|
|
Refer to the Github README: |
|
|
|
# The 'response' will contain the segmentation mask coordinates in a specific format (normalized 0-1000). |
|
# You will need to parse these coordinates and visualize the mask as per the paper's methodology or example scripts. |
|
``` |
|
|
|
## Performance Metrics |
|
|
|
### Referring Expression Segmentation |
|
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_res.png" width="800"> |
|
|
|
### Referring Expression Comprehension |
|
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_rec.png" width="800"> |
|
|
|
### Generalized Referring Expression Segmentation |
|
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_gres.png" width="800"> |
|
|
|
## Visualization |
|
|
|
### Referring Expression Segmentation |
|
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/res.png" width="800"> |
|
|
|
### Referring Expression Comprehension |
|
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/rec.png" width="800"> |
|
|
|
### Generalized Referring Expression Segmentation |
|
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/gres.png" width="800"> |
|
|
|
## Citation |
|
If our work is useful for your research, please consider citing: |
|
|
|
```bibtex |
|
@misc{wang2025unlockingpotentialmllmsreferring, |
|
title={Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder}, |
|
author={Jingchao Wang and Zhijian Wu and Dingjiang Huang and Yefeng Zheng and Hong Wang}, |
|
year={2025}, |
|
eprint={2508.04107}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2508.04107}, |
|
} |
|
``` |