File size: 9,675 Bytes
cc2c805
a6abb54
 
 
 
cc2c805
 
a6abb54
 
 
 
cc2c805
 
 
 
 
 
 
 
 
 
 
 
 
a6abb54
 
6e1abd5
cc2c805
 
 
 
 
 
 
 
 
 
 
 
 
 
a6abb54
cc2c805
 
a6abb54
cc2c805
 
a6abb54
cc2c805
 
a6abb54
cc2c805
 
a6abb54
cc2c805
 
a6abb54
cc2c805
 
a6abb54
cc2c805
 
a6abb54
cc2c805
 
a6abb54
cc2c805
 
a6abb54
cc2c805
 
a6abb54
cc2c805
a6abb54
cc2c805
a6abb54
cc2c805
 
 
 
a6abb54
 
 
cc2c805
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
---
base_model:
- remyxai/SpaceOm
datasets:
- remyxai/SpaceThinker
language:
- en
library_name: llama.cpp
license: apache-2.0
pipeline_tag: image-text-to-text
paper: 2506.07966
tags:
- gguf
- remyx
- SpatialReasoning
- spatial-reasoning
- test-time-compute
- thinking
- reasoning
- multimodal
- vlm
- vision-language
- distance-estimation
- quantitative-spatial-reasoning
task_categories:
- visual-question-answering
pretty_name: SpaceOm-GGUF
model-index:
- name: SpaceOm
  results:
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: 3DSRBench
      type: benchmark
    metrics:
    - type: success_rate
      value: 0.5419
      name: Overall Success Rate
    - type: success_rate
      value: 0.599
      name: Overall Success Rate
    - type: success_rate
      value: 0.388
      name: Overall Success Rate
    - type: success_rate
      value: 0.5833
      name: Overall Success Rate
    - type: success_rate
      value: 0.4455
      name: Overall Success Rate
    - type: success_rate
      value: 0.4876
      name: Overall Success Rate
    - type: success_rate
      value: 0.6105
      name: Overall Success Rate
    - type: success_rate
      value: 0.7043
      name: Overall Success Rate
    - type: success_rate
      value: 0.3504
      name: Overall Success Rate
    - type: success_rate
      value: 0.2558
      name: Overall Success Rate
    - type: success_rate
      value: 0.8085
      name: Overall Success Rate
    - type: success_rate
      value: 0.6839
      name: Overall Success Rate
    - type: success_rate
      value: 0.6553
      name: Overall Success Rate
---

# SpaceOm

This model is evaluated in the paper [SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence](https://huggingface.co/papers/2506.07966).
The code for the SpaCE-10 benchmark is available at: https://github.com/Cuzyoung/SpaCE-10.

**Model creator:** [remyxai](https://huggingface.co/remyxai)<br>
**Original model**: [SpaceOm](https://huggingface.co/remyxai/SpaceOm)<br>
**GGUF quantization:** `llama.cpp` commit [2baf07727f921d9a4a1b63a2eff941e95d0488ed](https://github.com/ggerganov/llama.cpp/tree/2baf07727f921d9a4a1b63a2eff941e95d0488ed)<br>

## Description

<img src="https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/5cPsHwrmzqPOjd7zUgzss.gif"  width="500"/>

## Model Overview

**SpaceOm** improves over **SpaceThinker** by adding:

* the target module `o_proj` in LoRA fine-tuning
* **SpaceOm** [dataset](https://huggingface.co/datasets/salma-remyx/SpaceOm) for longer reasoning traces
* **Robo2VLM-Reasoning** [dataset](https://huggingface.co/datasets/salma-remyx/Robo2VLM-Reasoning) for more robotics domain and MCVQA examples


The choice to include `o_proj` among the target modules in LoRA finetuning was inspired by the study [here](https://arxiv.org/pdf/2505.20993v1), which argues for
the importance of this module in reasoning models.

The reasoning traces in the SpaceThinker dataset average ~200 "thinking" tokens so now we've included longer reasoning traces in the training data 
to help the model use more tokens in reasoning.

Aiming to improve alignment for robotics applications, we've trained with synthetic reasoning traces, derived from the **Robo2VLM-1** [dataset](https://huggingface.co/datasets/keplerccc/Robo2VLM-1).

## Model Evaluation

### SpatialScore - 3B and 4B models

| **Model**              | **Overall** | **Count.** | **Obj.-Loc.** | **Pos.-Rel.** | **Dist.** | **Obj.-Prop.** | **Cam.&IT.** | **Tracking** | **Others** |
|------------------------|-------------|------------|----------------|----------------|-----------|----------------|---------------|---------------|------------|
| SpaceQwen2.5-VL-3B     | 42.31       | 45.01      | 49.78          | 57.88          | 27.36     | 34.11          | 26.34         | 26.44         | 43.58      |
| SpatialBot-Phi2-3B     | 41.65       | 53.23      | 54.32          | 55.40          | 27.12     | 26.10          | 24.21         | 27.57         | 41.66      |
| Kimi-VL-3B             | 51.48       | 49.22      | 61.99          | 61.34          | 38.27     | 46.74          | 33.75         | 56.28         | 47.23      |
| Kimi-VL-3B-Thinking    | 52.60       | 52.66      | 58.93          | 63.28          | 39.38     | 42.57          | 32.00         | 46.97         | 42.73      |
| Qwen2.5-VL-3B          | 47.90       | 46.62      | 55.55          | 62.23          | 32.39     | 32.97          | 30.66         | 36.90         | 42.19      |
| InternVL2.5-4B         | 49.82       | 53.32      | 62.02          | 62.02          | 32.80     | 27.00          | 32.49         | 37.02         | 48.95      |
| **SpaceOm (3B)**       | 49.00   | **56.00**      | 54.00          | **65.00**          | **41.00**     | **50.00**          | **36.00**         | 42.00         | 47.00      |

See [all results](https://huggingface.co/datasets/salma-remyx/SpaceOm_SpatialScore) for evaluating **SpaceOm** on the **SpatialScore** [benchmark](https://haoningwu3639.github.io/SpatialScore/).

Compared to **SpaceQwen**, this model outperforms by all categories


<img src="https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/tyrLNKsW3PAuZ8t7pCKU6.png" width="800">

And comparing to **SpaceThinker**:


<img src="https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/TWRLWismj3-HduHUkTAuM.png" width="800">

### SpaCE-10 Benchmark Comparison

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YpIOjJFZ-Zaomg77ImeQHSqYBLB8T1Ce?usp=sharing)

This table compares `SpaceOm` evaluated using GPT scoring against several top models from the SpaCE-10 benchmark leaderboard. Top scores in each category are **bolded**.

| Model                  | EQ    | SQ    | SA    | OO    | OS    | EP    | FR    | SP    | Source    |
|------------------------|-------|-------|-------|-------|-------|-------|-------|-------|-----------|
| **SpaceOm** | 32.47 | 24.81 | **47.63** | **50.00** | **32.52** |  9.12 | 37.04 | 25.00 | GPT Eval  |
| Qwen2.5-VL-7B-Instruct | 32.70 | 31.00 | 41.30 | 32.10 | 27.60 | 15.40 | 26.30 | 27.50 | Table      |
| LLaVA-OneVision-7B     | **37.40** | 36.20 | 42.90 | 44.20 | 27.10 | 11.20 | **45.60** | 27.20 | Table      |
| VILA1.5-7B             | 30.20 | **38.60** | 39.90 | 44.10 | 16.50 | **35.10** | 30.10 | **37.60** | Table      |
| InternVL2.5-4B         | 34.30 | 34.40 | 43.60 | 44.60 | 16.10 | 30.10 | 33.70 | 36.70 | Table      |

**Legend:**
- EQ: Entity Quantification
- SQ: Scene Quantification
- SA: Size Assessment
- OO: Object-Object spatial relations
- OS: Object-Scene spatial relations
- EP: Entity Presence
- FR: Functional Reasoning
- SP: Spatial Planning

> ℹ️ Note: Scores for SpaceOm are generated via `gpt_eval_score` on single-choice (`*-single`) versions of the SpaCE-10 benchmark tasks. Other entries reflect leaderboard accuracy scores from the official SpaCE-10 evaluation table.

Read more about the [SpaCE-10 benchmark](https://arxiv.org/pdf/2506.07966v1)

## Limitations

- Performance may degrade in cluttered environments or camera perspective.
- This model was fine-tuned using synthetic reasoning over an internet image dataset.
- Multimodal biases inherent to the base model (Qwen2.5-VL) may persist.
- Not intended for use in safety-critical or legal decision-making.

> Users are encouraged to evaluate outputs critically and consider fine-tuning for domain-specific safety and performance. Distances estimated using autoregressive
> transformers may help in higher-order reasoning for planning and behavior but may not be suitable replacements for measurements taken with high-precision sensors,
> calibrated stereo vision systems, or specialist monocular depth estimation models capable of more accurate, pixel-wise predictions and real-time performance.


## Citation

```
@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@misc{qwen2.5-VL,
  title = {Qwen2.5-VL},
  url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
  author = {Qwen Team},
  month = {January},
  year = {2025}
}

@misc{vl-thinking2025,
  title={SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models },
  author={Hardy Chen and Haoqin Tu and Fali Wang and Hui Liu and Xianfeng Tang and Xinya Du and Yuyin Zhou and Cihang Xie},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/UCSC-VLAA/VLAA-Thinking}},
}


@article{wu2025spatialscore,
    author    = {Wu, Haoning and Huang, Xiao and Chen, Yaohui and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
    title     = {SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding},
    journal   = {arXiv preprint arXiv:2505.17012},
    year      = {2025},
}

@article{gong2025space10,
  title     = {SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence},
  author    = {Ziyang Gong and Wenhao Li and Oliver Ma and Songyuan Li and Jiayi Ji and Xue Yang and Gen Luo and Junchi Yan and Rongrong Ji},
  journal   = {arXiv preprint arXiv:2506.07966},
  year      = {2025},
  url       = {https://arxiv.org/abs/2506.07966}
}

```