--- license: cc-by-nc-4.0 pipeline_tag: image-text-to-text library_name: transformers --- # LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs This repository contains the LLaVA-SP model, presented in the paper [LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs](https://huggingface.co/papers/2507.00505). LLaVA-SP proposes a novel approach to enhance the visual representation in Multimodal Large Language Models (MLLMs). It addresses the limitation of traditional CLIP-ViT encoders in modeling local relationships between adjacent patches, leading to more detailed understanding abilities. This is achieved by **only adding six spatial visual tokens** to the original visual tokens. Key innovations and advantages of LLaVA-SP include: - A novel Projector utilizing convolutional kernels to derive visual spatial tokens from ViT patch features, simulating "from central region to global" and "from abstract to specific" visual spatial ordering. - The application of a cross-attention mechanism to fuse fine-grained visual information, enriching the overall visual representation. - Two model variants: LLaVA-SP-Cropping (focusing on detail features through progressive cropping) and LLaVA-SP-Pooling (capturing global semantics through adaptive pooling), enabling the model to handle diverse visual understanding tasks. Extensive experiments demonstrate that LLaVA-SP, fine-tuned with LoRA, achieves significant performance improvements across various multimodal benchmarks, outperforming state-of-the-art models like LLaVA-1.5 with nearly identical inference latency. ## Code and Usage The official code and models are available at the GitHub repository: [https://github.com/CnFaker/LLaVA-SP](https://github.com/CnFaker/LLaVA-SP) You can load and use this model with the `transformers` library. Please refer to the official GitHub repository for detailed installation instructions and sample usage. ## Citation If you find this work useful, please consider citing the paper: ```bibtex @article{lou2025llava, title={LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs}, author={Lou, Haoran and Fan, Chunxiao and Liu, Ziyan Liu and Wu, Yuexin Wu and Wang, Xinliang}, journal={arXiv preprint arXiv:2507.00505}, year={2025} } ```