Remove code snippet, add base model (#2)

ab8cb0b verified 17 days ago

4.23 kB

	---
	license: mit
	pipeline_tag: image-segmentation
	library_name: transformers
	base_model:
	- OpenGVLab/InternVL2_5-2B
	---

	# MLLMSeg: Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder

	This repository contains the `MLLMSeg_InternVL2_5_1B_RES` model, which was presented in the paper [Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder](https://huggingface.co/papers/2508.04107).

	MLLMSeg aims to segment image regions specified by referring expressions. While Multimodal Large Language Models (MLLMs) are proficient in semantic understanding, their token-generation approach often struggles with pixel-level dense prediction tasks like segmentation. To address this, MLLMSeg proposes a novel framework that fully leverages the inherent visual detail features encoded in the MLLM's vision encoder, eliminating the need for an extra visual encoder. It further introduces a detail-enhanced and semantic-consistent feature fusion module (DSFF) to integrate visual details with semantic features from the Large Language Model (LLM). Finally, a lightweight mask decoder (with only 34M parameters) is established to optimize the use of these features for precise mask prediction. This approach strikes a better balance between performance and computational cost compared to existing SAM-based and SAM-free methods.

	The official code is available on GitHub: [https://github.com/jcwang0602/MLLMSeg](https://github.com/jcwang0602/MLLMSeg)

	## Model Architecture
	<p align="center">
	<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/method.png" width="800">
	</p>

	## Quick Start / How to Use

	This section provides instructions on how to use our pre-trained model for inference. Our models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (e.g., a bounding box defined by top-left and bottom-right coordinates). For visualization, you will need to convert these relative coordinates back to the original image dimensions.

	### Installation

	First, install the `transformers` library and other necessary dependencies. Note that `flash-attn` requires a GPU for installation.

	```bash
	conda create -n mllmseg python==3.10.18 -y
	conda activate mllmseg
	pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu118 # Adjust for your CUDA version
	pip install -r requirements.txt # Assuming requirements.txt from the cloned repo
	pip install flash-attn==2.3.6 --no-build-isolation # Note: requires GPU to install
	```

	## Usage

	Refer to the Github README:

	# The 'response' will contain the segmentation mask coordinates in a specific format (normalized 0-1000).
	# You will need to parse these coordinates and visualize the mask as per the paper's methodology or example scripts.
	```

	## Performance Metrics

	### Referring Expression Segmentation
	<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_res.png" width="800">

	### Referring Expression Comprehension
	<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_rec.png" width="800">

	### Generalized Referring Expression Segmentation
	<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/tab_gres.png" width="800">

	## Visualization

	### Referring Expression Segmentation
	<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/res.png" width="800">

	### Referring Expression Comprehension
	<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/rec.png" width="800">

	### Generalized Referring Expression Segmentation
	<img src="https://github.com/jcwang0602/MLLMSeg/raw/main/assets/gres.png" width="800">

	## Citation
	If our work is useful for your research, please consider citing:

	```bibtex
	@misc{wang2025unlockingpotentialmllmsreferring,
	title={Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder},
	author={Jingchao Wang and Zhijian Wu and Dingjiang Huang and Yefeng Zheng and Hong Wang},
	year={2025},
	eprint={2508.04107},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2508.04107},
	}
	```