RDT2-FM: Flow-Matching Action Expert for RDT 2

RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective. By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups. This repository specifically provides the action expert component of RDT2-FM.

Home - Github - Discord

Highlights
Model details
Hardware & software requirements
Quickstart (inference)
Precision settings
Intended uses & limitations
Troubleshooting
Changelog
Citation
Contact

Highlights

Low-latency control: Flow-matching policy head (no iterative denoising) for fast closed-loop actions.
Zero-shot cross-embodiment: Designed to work with any bimanual platforms (e.g., UR5e, Franka FR3) after proper calibration.
Scales with RDT2-VQ: Pairs with the VLM backbone (RDT2-VQ) trained on 10k+ hours and 100+ scenes of UMI manipulation.

Model details

Architecture

Backbone: Vision-language backbone such as RDT2-VQ (Qwen2.5-VL-7B based).
Action head: Flow-Matching (FM) expert mapping observations + instruction → continuous actions.
Observation: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics.
Instruction: Short imperative text, recommended format “Verb + Object.” (e.g., “Pick up the apple.”).

Action representation (UMI bimanual, per 24-step chunk)

20-D per step = right (10) + left (10):
- pos (x,y,z): 3
- rot (6D rotation): 6
- gripper width: 1
Output tensor shape: (T=24, D=20), relative deltas, float32.

Hardware & software requirements

Approximate single-GPU requirements:

Mode	RAM	VRAM	Example GPU
Inference (FM head + VLM)	≥ 32 GB	~ 16 GB	RTX 4090
Fine-tuning FM head	–	~ 16 GB	RTX 4090

For deployment on real robots, follow your platform’s end-effector + camera choices and perform hardware setup & calibration (camera stand/pose, flange, etc.) before running closed-loop policies.

Tested OS: Ubuntu 24.04.

Quickstart (inference)

# Run under root directory of RDT2 GitHub Repo: https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration
import yaml

from models.rdt_inferencer import RDTInferencer


with open("configs/rdt/post_train.yaml", "r") as f:
  model_config = yaml.safe_load(f)

model = RDTInferencer(
  config=model_config,
  pretrained_path="robotics-diffusion-transformer/RDT2-FM",
  # TODO: modify `normalizer_path` to your own downloaded normalizer path
  # download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
  normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",  
  pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", # use RDT2-VQ as the VLM backbone
  device="cuda:0",
  dtype=torch.bfloat16,
)

result = model.step(
    observations={
        'images': {
            # 'exterior_rs': np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8),
            'left_stereo': ..., # left arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
            'right_stereo': ..., # right arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
        },
        # use zero input current state for currently
        # preserve input interface for future fine-tuning
        'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
    },
    instruction=instruction # Language instruction
    # We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period 
)


# relative action chunk in np.ndarray of shape (24, 20) with dtype=np.float32
# with the same format as RDT2-VQ
action_chunk = result.detach().cpu().numpy()

# rescale gripper width from [0, 0.088] to [0, 0.1]
for robot_idx in range(2):
    action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1

For guides on installation and fine-tuning, please refer to the official GitHub repository.

Precision settings

RDT2-FM (action expert): bfloat16 for training and inference.
RDT2-VQ (VLM backbone): bfloat16 by default (Qwen2.5-VL practices).

Intended uses & limitations

Intended uses

Research in robot manipulation and VLA modeling.
Low-latency, short-horizon control on bimanual systems following hardware calibration steps.

Limitations

Performance depends on calibration quality, camera placement, and correct normalization.
Dataset/action-stat shift can degrade behavior—verify bounds and reconstruction when adapting.

Safety & responsible use

Always test with hardware limits engaged (reduced speed, gravity compensation, E-stop within reach).

Troubleshooting

Symptom	Likely cause	Suggested fix
Drifting / unstable gripper widths	Scale mismatch	Apply LinearNormalizer; rescale widths ([0,0.088] → [0,0.1]).
Poor instruction following	Prompt format / backbone config	Use “Verb + Object.”; ensure backbone is loaded on same device.

Changelog

2025-09: Initial release of RDT2-FM on Hugging Face.

Citation

@software{rdt2,
    title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
    author={RDT Team},
    url={https://github.com/thu-ml/RDT2},
    month={September},
    year={2025}
}

Contact

Project page: https://rdt-robotics.github.io/rdt2/
Organization: https://huggingface.co/robotics-diffusion-transformer
Discord: https://discord.gg/vsZS3zmf9A

Downloads last month: 418

Video Preview

Robotics

Model tree for robotics-diffusion-transformer/RDT2-FM

Base model

robotics-diffusion-transformer/rdt-1b

Finetuned

(2)

this model

Collection including robotics-diffusion-transformer/RDT2-FM

RDT 2

Collection

RDT 2, the sequel to RDT-1B, is the first foundation model that achieves zero-shot deployment on unseen embodiments for simple open-vocabulary tasks. • 4 items • Updated Sep 26 • 16

robotics-diffusion-transformer
/

RDT2-FM