File size: 2,915 Bytes
52eaaa3
 
 
 
 
 
 
 
 
3f583b5
 
 
 
 
52eaaa3
 
 
 
827ee88
ff0d5b0
52eaaa3
 
 
 
1f54ced
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3f583b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f54ced
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
license: mit
datasets:
- allenai/PRISM
language:
- en
base_model:
- allenai/Molmo-7B-D-0924
pipeline_tag: robotics
tags:
- robotics
- grasping
- task-oriented-grasping
- manipulation
---

# GraspMolmo

[[Paper]](https://arxiv.org/pdf/2505.13441) [[arXiv]](https://arxiv.org/abs/2505.13441) [[Project Website]](https://abhaybd.github.io/GraspMolmo/) [[Data]](https://huggingface.co/datasets/allenai/PRISM)

GraspMolmo is a generalizable open-vocabulary task-oriented grasping (TOG) model for robotic manipulation. Given an image and a task to complete (e.g. "Pour me some tea"), GraspMolmo will point to the most appropriate grasp location, which can then be matched to the closest stable grasp.

## Code Sample

```python
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

img = Image.open("<path_to_image>")
task = "Pour coffee from the blue mug."

processor = AutoProcessor.from_pretrained("allenai/GraspMolmo", torch_dtype="auto", device_map="auto", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("allenai/GraspMolmo", torch_dtype="auto", device_map="auto", trust_remote_code=True)

prompt = f"Point to where I should grasp to accomplish the following task: {task}"
inputs = processor.process(images=img, text=prompt, return_tensors="pt")
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()}

output = model.generate_from_batch(inputs, GenerationConfig(max_new_tokens=256, stop_strings="<|endoftext|>"), tokenizer=processor.tokenizer)
generated_tokens = output[0, inputs["input_ids"].size(1):]
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(generated_text)
```

Running the above code could result in the following output:
```
In order to accomplish the task "Pour coffee from the blue mug.", the optimal grasp is described as follows: "The grasp is on the middle handle of the blue mug, with fingers grasping the sides of the handle.".

<point x="28.6" y="20.7" alt="Where to grasp the object">Where to grasp the object</point>
```

## Grasp Inference

To predict a grasp point *and* match it to one of the candidate grasps, refer to the [GraspMolmo](https://github.com/abhaybd/GraspMolmo/blob/main/graspmolmo/inference/grasp_predictor.py) class.
First, install `graspmolmo` with

```bash
pip install "git+https://github.com/abhaybd/GraspMolmo.git#egg=graspmolmo[infer]"
```

and then inference can be run as follows:

```python
from graspmolmo.inference.grasp_predictor import GraspMolmo

task = "..."
rgb, depth = get_image()
camera_intrinsics = np.array(...)

point_cloud = backproject(rgb, depth, camera_intrinsics)
# grasps are in the camera reference frame
grasps = predict_grasps(point_cloud)  # Using your favorite grasp predictor (e.g. M2T2)

gm = GraspMolmo()
idx = gm.pred_grasp(rgb, point_cloud, task, grasps)

print(f"Predicted grasp: {grasps[idx]}")
```