Safetensors
qwen2_5_vl
remyx

Model Card for SpaceQwen2.5-VL-3B-Instruct

SpaceQwen2.5-VL-3B-Instruct uses LoRA to fine-tune Qwen2.5-VL-3B-Instruct on a dataset designed with VQASynth to enhance spatial reasoning as in SpatialVLM

Model Details

Model Description

This model uses data synthesis techniques and publically available models to reproduce the work described in SpatialVLM to enhance the spatial reasoning of multimodal models. With a pipeline of expert models, we can infer spatial relationships between objects in a scene to create VQA dataset for spatial reasoning.

  • Developed by: remyx.ai
  • Model type: MultiModal Model, Vision Language Model, Qwen2.5-VL-3B-Instruct
  • License: Apache-2.0
  • Finetuned from model: LLaVA

Model Sources

Citation

@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}
Downloads last month
2
Safetensors
Model size
3.75B params
Tensor type
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for remyxai/SpaceQwen2.5-VL-3B-Instruct

Unable to build the model tree, the base model loops to the model itself. Learn more.

Dataset used to train remyxai/SpaceQwen2.5-VL-3B-Instruct

Collections including remyxai/SpaceQwen2.5-VL-3B-Instruct