AlphaSpace-1.5B

Introduction

"AlphaSpace: (Paper), a novel methodology designed to enhance the spatial reasoning capabilities of language models for robotic manipulation in 3D Cartesian space. AlphaSpace employs a hierarchical semantics-based tokenization strategy that encodes spatial information at both coarse and fine-grained levels. Our approach represents objects with their attributes, positions, and height information through structured tokens, enabling precise spatial reasoning without relying on traditional vision-based embeddings. This approach enables LLMs to accurately manipulate objects by positioning them at specific [x, y, z] coordinates.

Code: https://github.com/AlanDao/AlphaSpace

Model Details

Model architecture: Deepseek-R1-Distil-Qwen-1.5B Instruct
Dataset:
- Training: homebrewltd/Pick-Place-Table-Reasoning-local-pos-v0.2
- Eval: https://huggingface.co/datasets/EmbodiedBench/EB-Manipulation.
License: Apache-2.0 license
Developed by: Alan Dao, Dinh Bach Vu, Bui Quang Huy (Menlo Research)

How to Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
import torch
from utils import tokenize_desk, SYSTEM_PROMPT

# Load the mode


model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Define your workspace
objects = [
    {"red-cube": [51, 43, 17]},
    {"black-cube": [44, 58, 17]},
    {"purple-cube": [74, 59, 17]},
    {"green-cube": [65, 82, 17]},
]

# Give a natural language instruction
instruction = "Throw the red cube on top of the blue cylinder"
desk, object_height = tokenize_desk(objects)
final_instruction = SYSTEM_PROMPT.format(object_height=object_height,instruction=instruction,TABLE_MAP=desk)
chat = [
    {"role": "user", "content": final_instruction.strip()}
]
tokenized_chat = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, use_system_prompt=False, return_tensors="pt")
# print(len(tokenized_chat[0]))
generated_ids = model.generate(
    tokenized_chat.to("cuda"),
    max_new_tokens=2048,
    do_sample=False,
    temperature=0.6,
)
# Get the solution
result = tokenizer.decode(generated_ids[0][tokenized_chat.shape[1]:], skip_special_tokens=True)
print(result)

Hardware

GPU Configuration: Cluster of 8x NVIDIA H200-SXM-140GB.

GPU Usage:

SFT: 40 mins.

Training Arguments

We utilize Llama-Factory library to train the model.

Parameter	Continual Training
Epoch	1
Global batch size	128
Learning Rate	1e-4
Learning Scheduler	cosine with warmup
Optimizer	AdamW Fused
Warmup Ratio	0.1
Max length	4096
Precision	bf16

Citation

https://arxiv.org/abs/2503.18769

More Information

Contact the authors at [email protected], [email protected], [email protected] for further details.

Menlo
/

AlphaSpace-1.5B