LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs

This repository contains the LLaVA-SP model, presented in the paper LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs.

LLaVA-SP proposes a novel approach to enhance the visual representation in Multimodal Large Language Models (MLLMs). It addresses the limitation of traditional CLIP-ViT encoders in modeling local relationships between adjacent patches, leading to more detailed understanding abilities. This is achieved by only adding six spatial visual tokens to the original visual tokens.

Key innovations and advantages of LLaVA-SP include:

A novel Projector utilizing convolutional kernels to derive visual spatial tokens from ViT patch features, simulating "from central region to global" and "from abstract to specific" visual spatial ordering.
The application of a cross-attention mechanism to fuse fine-grained visual information, enriching the overall visual representation.
Two model variants: LLaVA-SP-Cropping (focusing on detail features through progressive cropping) and LLaVA-SP-Pooling (capturing global semantics through adaptive pooling), enabling the model to handle diverse visual understanding tasks.

Extensive experiments demonstrate that LLaVA-SP, fine-tuned with LoRA, achieves significant performance improvements across various multimodal benchmarks, outperforming state-of-the-art models like LLaVA-1.5 with nearly identical inference latency.

Code and Usage

The official code and models are available at the GitHub repository: https://github.com/CnFaker/LLaVA-SP

You can load and use this model with the transformers library. Please refer to the official GitHub repository for detailed installation instructions and sample usage.

Citation

If you find this work useful, please consider citing the paper:

@article{lou2025llava,
  title={LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs},
  author={Lou, Haoran and Fan, Chunxiao and Liu, Ziyan Liu and Wu, Yuexin Wu and Wang, Xinliang},
  journal={arXiv preprint arXiv:2507.00505},
  year={2025}
}