|
--- |
|
license: mit |
|
datasets: |
|
- allenai/PRISM |
|
language: |
|
- en |
|
base_model: |
|
- allenai/Molmo-7B-D-0924 |
|
pipeline_tag: robotics |
|
tags: |
|
- robotics |
|
- grasping |
|
- task-oriented-grasping |
|
- manipulation |
|
--- |
|
|
|
# GraspMolmo |
|
|
|
[[Paper]](https://arxiv.org/pdf/2505.13441) [[arXiv]](https://arxiv.org/abs/2505.13441) [[Project Website]](https://abhaybd.github.io/GraspMolmo/) [[Data]](https://huggingface.co/datasets/allenai/PRISM) |
|
|
|
GraspMolmo is a generalizable open-vocabulary task-oriented grasping (TOG) model for robotic manipulation. Given an image and a task to complete (e.g. "Pour me some tea"), GraspMolmo will point to the most appropriate grasp location, which can then be matched to the closest stable grasp. |
|
|
|
## Code Sample |
|
|
|
```python |
|
from PIL import Image |
|
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig |
|
|
|
img = Image.open("<path_to_image>") |
|
task = "Pour coffee from the blue mug." |
|
|
|
processor = AutoProcessor.from_pretrained("allenai/GraspMolmo", torch_dtype="auto", device_map="auto", trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained("allenai/GraspMolmo", torch_dtype="auto", device_map="auto", trust_remote_code=True) |
|
|
|
prompt = f"Point to where I should grasp to accomplish the following task: {task}" |
|
inputs = processor.process(images=img, text=prompt, return_tensors="pt") |
|
inputs = {k: v.to(model.device).unsqueeze(0) for k, v in inputs.items()} |
|
|
|
output = model.generate_from_batch(inputs, GenerationConfig(max_new_tokens=256, stop_strings="<|endoftext|>"), tokenizer=processor.tokenizer) |
|
generated_tokens = output[0, inputs["input_ids"].size(1):] |
|
generated_text = processor.tokenizer.decode(generated_tokens, skip_special_tokens=True) |
|
print(generated_text) |
|
``` |
|
|
|
Running the above code could result in the following output: |
|
``` |
|
In order to accomplish the task "Pour coffee from the blue mug.", the optimal grasp is described as follows: "The grasp is on the middle handle of the blue mug, with fingers grasping the sides of the handle.". |
|
|
|
<point x="28.6" y="20.7" alt="Where to grasp the object">Where to grasp the object</point> |
|
``` |
|
|
|
## Grasp Inference |
|
|
|
To predict a grasp point *and* match it to one of the candidate grasps, refer to the [GraspMolmo](https://github.com/abhaybd/GraspMolmo/blob/main/graspmolmo/inference/grasp_predictor.py) class. |
|
First, install `graspmolmo` with |
|
|
|
```bash |
|
pip install "git+https://github.com/abhaybd/GraspMolmo.git#egg=graspmolmo[infer]" |
|
``` |
|
|
|
and then inference can be run as follows: |
|
|
|
```python |
|
from graspmolmo.inference.grasp_predictor import GraspMolmo |
|
|
|
task = "..." |
|
rgb, depth = get_image() |
|
camera_intrinsics = np.array(...) |
|
|
|
point_cloud = backproject(rgb, depth, camera_intrinsics) |
|
# grasps are in the camera reference frame |
|
grasps = predict_grasps(point_cloud) # Using your favorite grasp predictor (e.g. M2T2) |
|
|
|
gm = GraspMolmo() |
|
idx = gm.pred_grasp(rgb, point_cloud, task, grasps) |
|
|
|
print(f"Predicted grasp: {grasps[idx]}") |
|
``` |