jcwang0602's picture
Remove code snippet, add base model (#2)
ab8cb0b verified
|
raw
history blame
4.23 kB
---
license: mit
pipeline_tag: image-segmentation
library_name: transformers
base_model:
- OpenGVLab/InternVL2_5-2B
---
# MLLMSeg: Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder
This repository contains the `MLLMSeg_InternVL2_5_1B_RES` model, which was presented in the paper [Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder](https://huggingface.co/papers/2508.04107).
**MLLMSeg** aims to segment image regions specified by referring expressions. While Multimodal Large Language Models (MLLMs) are proficient in semantic understanding, their token-generation approach often struggles with pixel-level dense prediction tasks like segmentation. To address this, MLLMSeg proposes a novel framework that fully leverages the inherent visual detail features encoded in the MLLM's vision encoder, eliminating the need for an extra visual encoder. It further introduces a detail-enhanced and semantic-consistent feature fusion module (DSFF) to integrate visual details with semantic features from the Large Language Model (LLM). Finally, a lightweight mask decoder (with only 34M parameters) is established to optimize the use of these features for precise mask prediction. This approach strikes a better balance between performance and computational cost compared to existing SAM-based and SAM-free methods.
The official code is available on GitHub: [https://github.com/jcwang0602/MLLMSeg](https://github.com/jcwang0602/MLLMSeg)
## Model Architecture
<p align="center">
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/method.png" width="800">
</p>
## Quick Start / How to Use
This section provides instructions on how to use our pre-trained model for inference. Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (e.g., a bounding box defined by top-left and bottom-right coordinates). For visualization, you will need to convert these relative coordinates back to the original image dimensions.
### Installation
First, install the `transformers` library and other necessary dependencies. Note that `flash-attn` requires a GPU for installation.
```bash
conda create -n mllmseg python==3.10.18 -y
conda activate mllmseg
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu118 # Adjust for your CUDA version
pip install -r requirements.txt # Assuming requirements.txt from the cloned repo
pip install flash-attn==2.3.6 --no-build-isolation # Note: requires GPU to install
```
## Usage
Refer to the Github README:
# The 'response' will contain the segmentation mask coordinates in a specific format (normalized 0-1000).
# You will need to parse these coordinates and visualize the mask as per the paper's methodology or example scripts.
```
## Performance Metrics
### Referring Expression Segmentation
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_res.png" width="800">
### Referring Expression Comprehension
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_rec.png" width="800">
### Generalized Referring Expression Segmentation
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_gres.png" width="800">
## Visualization
### Referring Expression Segmentation
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/res.png" width="800">
### Referring Expression Comprehension
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/rec.png" width="800">
### Generalized Referring Expression Segmentation
<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/gres.png" width="800">
## Citation
If our work is useful for your research, please consider citing:
```bibtex
@misc{wang2025unlockingpotentialmllmsreferring,
title={Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder},
author={Jingchao Wang and Zhijian Wu and Dingjiang Huang and Yefeng Zheng and Hong Wang},
year={2025},
eprint={2508.04107},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.04107},
}
```