---
license: cc-by-nc-4.0
pipeline_tag: image-text-to-text
library_name: transformers
---

# LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs

This repository contains the LLaVA-SP model, presented in the paper [LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs](https://huggingface.co/papers/2507.00505).

LLaVA-SP proposes a novel approach to enhance the visual representation in Multimodal Large Language Models (MLLMs). It addresses the limitation of traditional CLIP-ViT encoders in modeling local relationships between adjacent patches, leading to more detailed understanding abilities. This is achieved by **only adding six spatial visual tokens** to the original visual tokens.

Key innovations and advantages of LLaVA-SP include:
- A novel Projector utilizing convolutional kernels to derive visual spatial tokens from ViT patch features, simulating "from central region to global" and "from abstract to specific" visual spatial ordering.
- The application of a cross-attention mechanism to fuse fine-grained visual information, enriching the overall visual representation.
- Two model variants: LLaVA-SP-Cropping (focusing on detail features through progressive cropping) and LLaVA-SP-Pooling (capturing global semantics through adaptive pooling), enabling the model to handle diverse visual understanding tasks.

Extensive experiments demonstrate that LLaVA-SP, fine-tuned with LoRA, achieves significant performance improvements across various multimodal benchmarks, outperforming state-of-the-art models like LLaVA-1.5 with nearly identical inference latency.

## Code and Usage

The official code and models are available at the GitHub repository: [https://github.com/CnFaker/LLaVA-SP](https://github.com/CnFaker/LLaVA-SP)

You can load and use this model with the `transformers` library. Please refer to the official GitHub repository for detailed installation instructions and sample usage.

## Citation

If you find this work useful, please consider citing the paper:

```bibtex
@article{lou2025llava,
  title={LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs},
  author={Lou, Haoran and Fan, Chunxiao and Liu, Ziyan Liu and Wu, Yuexin Wu and Wang, Xinliang},
  journal={arXiv preprint arXiv:2507.00505},
  year={2025}
}
```