File size: 25,659 Bytes
cdd1abc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
321afc5
 
 
 
 
 
9103c18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cdd1abc
e72f650
cbec47b
4f3550b
cbec47b
f26c916
cbec47b
e72f650
 
 
 
 
 
 
 
 
 
b11bd11
cbec47b
7e2e7d7
321afc5
0a270e3
 
 
321afc5
5a87640
0d17dcc
 
6380596
fc8f41b
6380596
 
fc8f41b
5a87640
a17df3a
e72f650
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a17df3a
e72f650
a17df3a
e72f650
a17df3a
 
e72f650
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
082dfa2
3dcf267
 
2abfa0b
 
 
513daf9
fda87b4
2abfa0b
513daf9
2abfa0b
 
 
 
 
513daf9
2abfa0b
513daf9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54d0393
513daf9
 
 
 
 
 
 
 
 
 
 
 
 
2abfa0b
e292549
 
3dcf267
 
 
3c0a642
2abfa0b
 
3dcf267
 
 
dc4cfb4
3dcf267
 
 
 
 
 
 
 
 
 
3c0a642
2abfa0b
 
3dcf267
 
 
 
 
6714978
3dcf267
dc4cfb4
6714978
3dcf267
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
082dfa2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
---
task_categories:
- visual-question-answering
language:
- en
tags:
- remyx
- SpatialReasoning
- spatial-reasoning
- test-time-compute
- thinking
- reasoning
- multimodal
- vlm
- vision-language
- distance-estimation
- quantitative-spatial-reasoning
pretty_name: SpaceOm
license: apache-2.0
datasets:
- remyxai/SpaceThinker
base_model:
- UCSC-VLAA/VLAA-Thinker-Qwen2.5VL-3B
pipeline_tag: image-text-to-text
library_name: transformers
model-index:
- name: SpaceOm
  results:
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: 3DSRBench
      type: benchmark
    metrics:
    - type: success_rate
      name: Overall Success Rate
      value: 0.5419
    results_by_subcategory:
    - name: 3D Positional Relation / Orientation
      success_rate: 0.4877
    - name: Object Localization / 3D Localization
      success_rate: 0.6337
    - name: Object Properties / Size
      success_rate: 0.5043
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: BLINK
      type: benchmark
    metrics:
    - type: success_rate
      name: Overall Success Rate
      value: 0.599
    results_by_subcategory:
    - name: 3D Positional Relation / Orientation
      success_rate: 0.7972
    - name: Counting / Object Counting
      success_rate: 0.6167
    - name: Depth and Distance / Relative
      success_rate: 0.621
    - name: Object Localization / 2D Localization
      success_rate: 0.582
    - name: Point and Object Tracking / Point Correspondence
      success_rate: 0.3779
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: MMIU
      type: benchmark
    metrics:
    - type: success_rate
      name: Overall Success Rate
      value: 0.388
    results_by_subcategory:
    - name: Camera and Image Transformation / 2D Transformation
      success_rate: 0.255
    - name: Camera and Image Transformation / 3D Camera Pose
      success_rate: 0.4
    - name: Camera and Image Transformation / Camera Motion
      success_rate: 0.4436
    - name: Depth and Distance / Absolute
      success_rate: 0.265
    - name: Object Localization / 3D Localization
      success_rate: 0.3625
    - name: Point and Object Tracking / 3D Tracking
      success_rate: 0.725
    - name: Point and Object Tracking / Point Correspondence
      success_rate: 0.265
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: MMVP
      type: benchmark
    metrics:
    - type: success_rate
      name: Overall Success Rate
      value: 0.5833
    results_by_subcategory:
    - name: Others / Miscellaneous
      success_rate: 0.5833
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: QSpatialBench-Plus
      type: benchmark
    metrics:
    - type: success_rate
      name: Overall Success Rate
      value: 0.4455
    results_by_subcategory:
    - name: Depth and Distance / Absolute
      success_rate: 0.4455
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: QSpatialBench-ScanNet
      type: benchmark
    metrics:
    - type: success_rate
      name: Overall Success Rate
      value: 0.4876
    results_by_subcategory:
    - name: Depth and Distance / Absolute
      success_rate: 0.464
    - name: Object Properties / Size
      success_rate: 0.5111
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: RealWorldQA
      type: benchmark
    metrics:
    - type: success_rate
      name: Overall Success Rate
      value: 0.6105
    results_by_subcategory:
    - name: Others / Miscellaneous
      success_rate: 0.6105
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: SpatialSense
      type: benchmark
    metrics:
    - type: success_rate
      name: Overall Success Rate
      value: 0.7043
    results_by_subcategory:
    - name: 3D Positional Relation / Orientation
      success_rate: 0.7043
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: VGBench
      type: benchmark
    metrics:
    - type: success_rate
      name: Overall Success Rate
      value: 0.3504
    results_by_subcategory:
    - name: Camera and Image Transformation / 2D Transformation
      success_rate: 0.2568
    - name: Camera and Image Transformation / 3D Camera Pose
      success_rate: 0.4371
    - name: Depth and Distance / Absolute
      success_rate: 0.3339
    - name: Depth and Distance / Relative
      success_rate: 0.32
    - name: Object Localization / 3D Localization
      success_rate: 0.4283
    - name: Point and Object Tracking / 3D Tracking
      success_rate: 0.3264
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: VSI-Bench_8
      type: benchmark
    metrics:
    - type: success_rate
      name: Overall Success Rate
      value: 0.2558
    results_by_subcategory:
    - name: 3D Positional Relation / Orientation
      success_rate: 0.3998
    - name: Counting / Object Counting
      success_rate: 0.229
    - name: Depth and Distance / Absolute
      success_rate: 0.1562
    - name: Depth and Distance / Relative
      success_rate: 0.3648
    - name: Object Properties / Size
      success_rate: 0.1645
    - name: Others / Miscellaneous
      success_rate: 0.2204
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: VSR-ZeroShot
      type: benchmark
    metrics:
    - type: success_rate
      name: Overall Success Rate
      value: 0.8085
    results_by_subcategory:
    - name: 3D Positional Relation / Orientation
      success_rate: 0.8085
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: cvbench
      type: benchmark
    metrics:
    - type: success_rate
      name: Overall Success Rate
      value: 0.6839
    results_by_subcategory:
    - name: Counting / Object Counting
      success_rate: 0.6294
    - name: Depth and Distance / Relative
      success_rate: 0.7408
    - name: Object Localization / 3D Localization
      success_rate: 0.6815
  - task:
      type: visual-question-answering
      name: Spatial Reasoning
    dataset:
      name: spatialbench
      type: benchmark
    metrics:
    - type: success_rate
      name: Overall Success Rate
      value: 0.6553
    results_by_subcategory:
    - name: 3D Positional Relation / Orientation
      success_rate: 0.6765
    - name: Counting / Object Counting
      success_rate: 0.75
    - name: Object Properties / Existence
      success_rate: 0.925
    - name: Object Properties / Reachability
      success_rate: 0.55
    - name: Object Properties / Size
      success_rate: 0.375

---
[![Official](https://img.shields.io/badge/Official-%239a0018.svg?logo=)](https://remyx.ai/?model_id=SpaceThinker-Qwen2.5VL-3B&sha256=abc123def4567890abc123def4567890abc123def4567890abc123def4567890)

# SpaceOm 

<img src="https://cdn-uploads.huggingface.co/production/uploads/647777304ae93470ffc28913/5cPsHwrmzqPOjd7zUgzss.gif"  width="500"/>

## πŸ“š Contents

- [🧠 Model Overview](#model-overview)
- [πŸ“Š Evaluation & Benchmarks](#model-evaluation)
- [πŸƒβ€β™€οΈ Running SpaceOm](#running-spaceom)
- [πŸ‹οΈβ€β™‚οΈ Training Configuration](#training-spaceom)
- [πŸ“‚ Dataset Info](#dataset-info)
- [⚠️ Limitations](#limitations)
- [πŸ“œ Citation](#citation)

## Model Overview

**SpaceOm** improves over **SpaceThinker** by adding:

* the target module `o_proj` in LoRA fine-tuning
* **SpaceOm** [dataset](https://huggingface.co/datasets/salma-remyx/SpaceOm) for longer reasoning traces
* **Robo2VLM-Reasoning** [dataset](https://huggingface.co/datasets/salma-remyx/Robo2VLM-Reasoning) for more robotics domain and MCVQA examples


The choice to include `o_proj` among the target modules in LoRA finetuning was inspired by the study [here](https://arxiv.org/pdf/2505.20993v1), which argues for
the importance of this module in reasoning models.

The reasoning traces in the SpaceThinker dataset average ~200 "thinking" tokens so now we've included longer reasoning traces in the training data 
to help the model use more tokens in reasoning.

Aiming to improve alignment for robotics applications, we've trained with synthetic reasoning traces, derived from the **Robo2VLM-1** [dataset](https://huggingface.co/datasets/keplerccc/Robo2VLM-1).


## Running SpaceOm

### Ollama
To launch with ollama, run:
```bash
ollama run hf.co/remyxai/SpaceOm:latest
```
or 
```bash
ollama run remyxai/spaceom
```

### llama.cpp
To run locally with **llama.cpp**, install and build this [branch](https://github.com/HimariO/llama.cpp.qwen2.5vl/tree/qwen25-vl) and download the [.gguf weights here](https://huggingface.co/remyxai/SpaceThinker-Qwen2.5VL-3B/tree/main/gguf)

```bash
./llama-qwen2vl-cli -m spaceom-F16.gguf
--mmproj spaceom-vision.gguf
--image images/example_1.jpg --threads 24 -ngl 9
-p "Does the man in blue shirt working have a greater \\
height compared to the wooden pallet with boxes on floor?" 
```

### Transformers
Run locally using **Transformers**

```python
import torch
from PIL import Image
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import requests
from io import BytesIO

# Configuration
model_id = "remyxai/SpaceOm"
image_path = "images/example_1.jpg"  # or local path
prompt = "What can you infer from this image about the environment?"
system_message = (
  "You are VL-Thinking πŸ€”, a helpful assistant with excellent reasoning ability. "
  "You should first think about the reasoning process and then provide the answer. "
  "Use <think>...</think> and <answer>...</answer> tags."
)

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.bfloat16
)
processor = AutoProcessor.from_pretrained(model_id)

# Load and preprocess image
if image_path.startswith("http"):
    image = Image.open(BytesIO(requests.get(image_path).content)).convert("RGB")
else:
    image = Image.open(image_path).convert("RGB")
if image.width > 512:
    ratio = image.height / image.width
    image = image.resize((512, int(512 * ratio)), Image.Resampling.LANCZOS)

# Format input
chat = [
    {"role": "system", "content": [{"type": "text", "text": system_message}]},
    {"role": "user", "content": [{"type": "image", "image": image},
                                {"type": "text", "text": prompt}]}
]
text_input = processor.apply_chat_template(chat, tokenize=False,
                                                  add_generation_prompt=True)

# Tokenize
inputs = processor(text=[text_input], images=[image],
                                      return_tensors="pt").to("cuda")

# Generate response
generated_ids = model.generate(**inputs, max_new_tokens=1024)
output = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

print("Response:\n", output)
```

## Dataset Info

The [SpaceThinker](https://huggingface.co/datasets/remyxai/SpaceThinker) dataset includes over 12K samples synthesized using VQASynth on a subset of images in the localized narratives split of [the cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron).
**SpaceThinker** is formatted similar to the [Llama-Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset) to toggle reasoning.

The [SpaceOm](https://huggingface.co/datasets/remyxai/SpaceOm) dataset includes ~1K samples synthesized using VQASynth to include longer reasoning traces.

The [Robo2VLM-Reasoning](https://huggingface.co/datasets/remyxai/Robo2VLM-Reasoning) datasert is a subset of the original [Robo2VLM](https://huggingface.co/datasets/remyxai/Robo2VLM-Reasoning) dataset modified to include reasoning traces.

These datasets were combined to create the final training data for this model.

The model builds upon the ideas from [SpatialVLM (Chen et al., 2024)](https://spatial-vlm.github.io/), introducing synthetic reasoning traces grounded on a 3D scene reconstruction pipeline using **Molmo, VGGT, SAM2**.


## Training SpaceOm

**PEFT Configuration**
- Architecture: Qwen2.5-VL-3B
- Base model: UCSC-VLAA/VLAA-Thinker-Qwen2.5VL-3B
- Method: LoRA finetuning (PEFT)
- LoRA Alpha: 256
- LoRA Rank: 128
- Target Modules: q_proj, v_proj, o_proj
- Optimizer: AdamW (lr=2e-5), batch size = 1, epochs = 3
- Max input length: 1024 tokens

Reproduce LoRA SFT training with included script:
```bash
python train.py
```

## Model Evaluation


### OmniSpatial

Benchmark leaderboard with **SpaceOm** highlighted.

| Model                        | Avg   | Manip | Motion | Traffic | Locate | Geospatial | Pattern | Geometric | Ego   | Allo  | Hypo  |
|-----------------------------|--------|--------|--------|---------|--------|------------|---------|-----------|--------|--------|--------|
| πŸ₯‡ o3-2025-04-16            | 56.33 | 71.89  | 66.18  | 61.18   | 68.57  | 65.45      | 40.21   | 29.68     | 77.06 | 48.40 | 48.19 |
| πŸ₯ˆ Gemini-2.5-pro-preview-05-06 | 55.19 | 67.57  | 71.39  | 62.35   | 75.24  | 64.55      | 43.30   | 34.84     | 74.51 | 38.03 | 37.35 |
| πŸ₯‰ Gemini-2.5-flash-thinking-05-20 | 53.16 | 70.27  | 64.74  | 61.18   | 72.38  | 58.18      | 35.05   | 36.13     | 74.12 | 40.96 | 32.53 |
| o4-mini-04-16               | 52.77 | 72.97  | 59.83  | 60.00   | 73.33  | 61.82      | 34.02   | 36.77     | 73.53 | 40.69 | 40.96 |
| Gemini-2.5-flash-preview-05-20 | 52.12 | 67.57  | 62.72  | 68.24   | 73.33  | 60.91      | 38.14   | 34.19     | 75.49 | 35.90 | 33.73 |
| GPT-4.1-2025-04-14          | 51.78 | 66.22  | 64.74  | 60.00   | 65.33  | 60.18      | 31.75   | 30.06     | 70.98 | 40.64 | 39.04 |
| o1-2024-12-17               | 50.36 | 71.62  | 60.98  | 57.65   | 63.81  | 60.00      | 39.18   | 27.10     | 71.57 | 38.03 | 36.14 |
| InternVL3-78B              | 49.33 | 63.78  | 63.12  | 56.24   | 59.24  | 51.45      | 27.63   | 30.19     | 74.51 | 38.46 | 35.90 |
| GPT-4.1-mini-2025-04-14     | 48.87 | 64.32  | 56.53  | 59.06   | 60.19  | 56.36      | 29.28   | 30.19     | 72.55 | 39.57 | 39.28 |
| Claude-3-7-thinking-20250219| 48.62 | 57.21  | 59.73  | 53.73   | 67.94  | 57.27      | 30.24   | 28.17     | 68.63 | 37.94 | 36.95 |
| InternVL3-38B              | 48.48 | 63.42  | 63.58  | 54.59   | 58.29  | 50.55      | 29.90   | 28.52     | 72.16 | 36.76 | 33.49 |
| Gemini-2.0-flash-exp       | 48.40 | 61.89  | 56.01  | 51.76   | 63.43  | 59.09      | 20.82   | 33.81     | 72.75 | 39.20 | 39.28 |
| Qwen-VL2.5-72B             | 47.85 | 58.38  | 60.12  | 50.12   | 59.81  | 53.64      | 26.19   | 33.03     | 71.37 | 36.81 | 36.39 |
| GPT-4o-2024-11-20          | 47.81 | 65.54  | 57.23  | 56.47   | 52.38  | 54.09      | 26.29   | 25.48     | 75.98 | 39.49 | 39.76 |
| Claude-3-7-sonnet-20250219 | 47.53 | 57.57  | 55.95  | 56.71   | 63.81  | 59.09      | 29.48   | 28.39     | 72.16 | 36.06 | 36.63 |
| Qwen-VL2.5-32B             | 47.36 | 63.06  | 55.09  | 51.76   | 66.29  | 56.91      | 26.39   | 27.48     | 68.04 | 37.50 | 40.24 |
| Claude-3-5-sonnet-20241022 | 46.86 | 54.05  | 54.57  | 58.12   | 68.38  | 53.09      | 26.60   | 31.74     | 70.00 | 34.79 | 39.52 |
| InternVL3-14B              | 45.94 | 54.32  | 60.17  | 50.35   | 51.81  | 51.45      | 28.04   | 28.26     | 68.04 | 35.37 | 34.46 |
| LLaVA-onevision-qwen2-72B  | 45.66 | 62.16  | 50.29  | 54.12   | 60.95  | 56.36      | 22.68   | 25.81     | 76.47 | 37.23 | 33.73 |
| SoFar-Qwen2.5-3B           | 45.14 | 56.49  | 51.16  | 54.12   | 53.14  | 52.73      | 31.75   | 22.88     | 71.60 | 36.56 | 41.69 |
| Gemma-3-27B                | 44.75 | 56.76  | 55.78  | 57.65   | 50.48  | 52.73      | 27.84   | 29.03     | 64.71 | 33.51 | 32.53 |
| Gemini-2.0-flash-lite      | 44.03 | 59.19  | 46.71  | 60.24   | 49.52  | 53.27      | 21.65   | 31.23     | 66.47 | 36.81 | 38.80 |
| Gemma-3-12B                | 43.71 | 54.05  | 54.91  | 54.12   | 47.62  | 45.45      | 16.49   | 30.32     | 63.73 | 36.70 | 33.73 |
| GPT-4o-mini-2024-07-18     | 42.64 | 55.95  | 50.29  | 54.59   | 43.43  | 44.91      | 22.47   | 29.42     | 61.57 | 36.76 | 34.22 |
| GPT-4.1-nano-2025-04-14    | 42.62 | 50.90  | 53.85  | 54.90   | 40.95  | 42.42      | 24.40   | 30.11     | 53.59 | 37.23 | 33.73 |
| πŸ§˜β€β™‚οΈ **SpaceOm**                | 41.79 | 51.89  | 47.98  | 50.82   | 39.62  | 43.64      | 27.63   | 27.61     | 70.00 | 35.74 | 33.73 |
| InternVL3-8B               | 41.60 | 52.43  | 40.87  | 48.94   | 51.05  | 44.77      | 24.95   | 28.63     | 64.20 | 38.62 | 40.96 |
| SpaceThinker-Qwen2.5-3B    | 40.42 | 47.84  | 53.06  | 43.29   | 35.43  | 38.73      | 24.33   | 28.00     | 58.04 | 35.11 | 31.08 |
| Qwen-VL2.5-3B              | 40.30 | 55.41  | 47.51  | 46.12   | 42.29  | 44.73      | 32.16   | 23.87     | 59.41 | 33.30 | 30.84 |
| SpaceQwen2.5-VL-3B         | 40.25 | 58.11  | 39.88  | 41.18   | 40.95  | 40.91      | 29.90   | 25.81     | 63.73 | 38.83 | 39.76 |
| Gemma-3-4B                 | 39.79 | 41.89  | 49.71  | 56.47   | 27.62  | 36.36      | 23.71   | 24.52     | 59.80 | 36.17 | 38.55 |
| Qwen-VL2.5-7B              | 39.18 | 58.38  | 35.09  | 50.12   | 45.33  | 44.00      | 31.13   | 29.42     | 64.51 | 33.19 | 37.35 |
| InternVL3-2B               | 37.98 | 50.00  | 40.58  | 43.29   | 40.00  | 40.55      | 21.86   | 28.52     | 55.49 | 35.11 | 33.01 |
| SpaceMantis-13B            | 36.36 | 47.03  | 36.59  | 40.94   | 34.86  | 33.09      | 22.27   | 24.39     | 49.22 | 38.25 | 39.28 |
| RoboPoint-vicuna-7B        | 35.85 | 57.03  | 28.61  | 34.82   | 37.33  | 40.55      | 29.90   | 22.71     | 50.20 | 38.72 | 40.96 |
| LLaVA-onevision-qwen2-7B   | 35.68 | 43.24  | 38.15  | 32.94   | 29.52  | 41.82      | 28.87   | 22.58     | 47.06 | 36.17 | 37.35 |
| SpatialBot-3B              | 35.68 | 43.24  | 38.15  | 32.94   | 29.52  | 41.82      | 28.87   | 22.58     | 47.06 | 36.17 | 37.35 |
| LLaVA-1.5-vicuna-7B        | 34.97 | 54.46  | 31.23  | 35.29   | 36.19  | 33.94      | 29.01   | 24.18     | 55.60 | 34.66 | 36.14 |
| RoboPoint-vicuna-13B       | 34.60 | 55.68  | 28.15  | 42.82   | 32.19  | 32.55      | 24.12   | 27.74     | 49.02 | 37.66 | 33.49 |

See full **SpaceOm** [results here](https://huggingface.co/datasets/salma-remyx/SpaceOm_OmniSpatial/blob/main/OmniSpatial_spaceom_results.json) for the 
**OmniSpatial** [benchmark](https://qizekun.github.io/omnispatial/).

### SpatialScore 

Top scores in each category are **bolded** in partial table of 3B/4B models.


| **Model**              | **Overall** | **Count.** | **Obj.-Loc.** | **Pos.-Rel.** | **Dist.** | **Obj.-Prop.** | **Cam.&IT.** | **Tracking** | **Others** |
|------------------------|-------------|------------|----------------|----------------|-----------|----------------|---------------|---------------|------------|
| InternVL2.5-4B         | 49.82       | **53.32**      | **62.02**          | **62.82**          | **42.30**     | 27.00          | 32.49         | 37.02         | **48.95**      |
| πŸ§˜β€β™‚οΈ **SpaceOm**       | 48.15   | 47.84      | 55.24          | 61.83          | 41.48     | 30.97          | 32.94         | **37.20**         | 43.74      |
| Qwen2.5-VL-3B          | 47.90       | 46.62      | 55.55          | 62.23          | 37.53     | 32.59          | **35.85**         | 36.90         | 42.19      |
| SpaceQwen2.5-VL-3B     | 42.31       | 45.01      | 49.78          | 57.88          | 27.36     | **34.11**          | 26.34         | 26.44         | 43.58      |
| SpatialBot-Phi2-3B     | 41.65       | 53.25      | 54.32          | 55.40          | 27.12     | 26.10          | 24.21         | 27.57         | 41.66      |


See [all results](https://huggingface.co/datasets/salma-remyx/SpaceOm_SpatialScore) for evaluating **SpaceOm** on the **SpatialScore** [benchmark](https://haoningwu3639.github.io/SpatialScore/).


### SpaCE-10 

Top scores in each category are **bolded** in partial table of 3B/4B models.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1YpIOjJFZ-Zaomg77ImeQHSqYBLB8T1Ce?usp=sharing)


| **Model**                | **Overall** | **EQ**   | **SQ**   | **SA**   | **OO**   | **OS**   | **EP**   | **FR**   | **SP**   | **Source**  |
|--------------------------|-------------|----------|----------|----------|----------|----------|----------|----------|----------|-------------|
| InternVL2.5-4B           | **36.01**   | **34.30**| 34.40    | 43.60    | 44.40    | 16.50    | **31.10**| **50.10**| **33.70**| Table       |
| SpaceThinker             | 32.72       | 32.73    | 24.81    | 47.26    | 50.33    | 33.63    | 9.25     | 37.54    | 26.25    | GPT Eval    |
| πŸ§˜β€β™‚οΈ **SpaceOm**              | 32.32       | 32.47    | 24.81    | **47.63**| 50.00    | 32.52    | 9.12     | 37.04    | 25.00    | GPT Eval    |
| SpaceQwen                | 31.98       | 31.19    | 25.89    | 41.61    | **51.98**| **35.18**| 10.97    | 36.54    | 22.50    | GPT Eval    |
| Qwen2.5-VL-3B-Instruct   | 30.00       | 31.70    | **45.50**| 39.00    | 43.00    | 25.30    | 11.50    | 22.80    | 21.20    | Table       |



**Legend:**
- EQ: Entity Quantification
- SQ: Scene Quantification
- SA: Size Assessment
- OO: Object-Object spatial relations
- OS: Object-Scene spatial relations
- EP: Entity Presence
- FR: Functional Reasoning
- SP: Spatial Planning

> ℹ️ Note: Scores for SpaceQwen, SpaceThinker, SpaceOm are generated via `gpt_eval_score` on single-choice (`*-single`) versions of the SpaCE-10 benchmark tasks. Other entries reflect leaderboard accuracy scores from the official SpaCE-10 evaluation table.

Read more about the [SpaCE-10 benchmark](https://arxiv.org/pdf/2506.07966v1) or see [results here](https://huggingface.co/datasets/salma-remyx/SpaceOm_SpaCE-10_Results/blob/main/20250611_041721_results.json)


## Limitations

- Performance may degrade in cluttered environments or camera perspective.
- This model was fine-tuned using synthetic reasoning over an internet image dataset.
- Multimodal biases inherent to the base model (Qwen2.5-VL) may persist.
- Not intended for use in safety-critical or legal decision-making.

> Users are encouraged to evaluate outputs critically and consider fine-tuning for domain-specific safety and performance. Distances estimated using autoregressive
> transformers may help in higher-order reasoning for planning and behavior but may not be suitable replacements for measurements taken with high-precision sensors,
> calibrated stereo vision systems, or specialist monocular depth estimation models capable of more accurate, pixel-wise predictions and real-time performance.


## Citation


```
@article{chen2024spatialvlm,
  title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities},
  author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei},
  journal = {arXiv preprint arXiv:2401.12168},
  year = {2024},
  url = {https://arxiv.org/abs/2401.12168},
}

@misc{qwen2.5-VL,
  title = {Qwen2.5-VL},
  url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
  author = {Qwen Team},
  month = {January},
  year = {2025}
}

@misc{vl-thinking2025,
  title={SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models },
  author={Hardy Chen and Haoqin Tu and Fali Wang and Hui Liu and Xianfeng Tang and Xinya Du and Yuyin Zhou and Cihang Xie},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/UCSC-VLAA/VLAA-Thinking}},
}


@article{wu2025spatialscore,
    author    = {Wu, Haoning and Huang, Xiao and Chen, Yaohui and Zhang, Ya and Wang, Yanfeng and Xie, Weidi},
    title     = {SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding},
    journal   = {arXiv preprint arXiv:2505.17012},
    year      = {2025},
}

```