RDT2-FM: Flow-Matching Action Expert for RDT 2
RDT2-FM builds on a vision-language backbone (RDT2-VQ) and predicts short-horizon relative action chunks through an action expert that integrates an improved RDT architecture with a flow-matching objective. By leveraging flow matching, RDT2-FM achieves lower inference latency while maintaining strong instruction following and cross-embodiment generalization on UMI-style bimanual setups. This repository specifically provides the action expert component of RDT2-FM.
Table of contents
- Highlights
- Model details
- Hardware & software requirements
- Quickstart (inference)
- Precision settings
- Intended uses & limitations
- Troubleshooting
- Changelog
- Citation
- Contact
Highlights
- Low-latency control: Flow-matching policy head (no iterative denoising) for fast closed-loop actions.
- Zero-shot cross-embodiment: Designed to work with any bimanual platforms (e.g., UR5e, Franka FR3) after proper calibration.
- Scales with RDT2-VQ: Pairs with the VLM backbone (RDT2-VQ) trained on 10k+ hours and 100+ scenes of UMI manipulation.
Model details
Architecture
- Backbone: Vision-language backbone such as RDT2-VQ (Qwen2.5-VL-7B based).
- Action head: Flow-Matching (FM) expert mapping observations + instruction → continuous actions.
- Observation: Two wrist-camera RGB images (right/left), 384×384, JPEG-like statistics.
- Instruction: Short imperative text, recommended format “Verb + Object.” (e.g., “Pick up the apple.”).
Action representation (UMI bimanual, per 24-step chunk)
- 20-D per step = right (10) + left (10): - pos (x,y,z): 3
- rot (6D rotation): 6
- gripper width: 1
 
- Output tensor shape: (T=24, D=20), relative deltas, - float32.
Hardware & software requirements
Approximate single-GPU requirements:
| Mode | RAM | VRAM | Example GPU | 
|---|---|---|---|
| Inference (FM head + VLM) | ≥ 32 GB | ~ 16 GB | RTX 4090 | 
| Fine-tuning FM head | – | ~ 16 GB | RTX 4090 | 
For deployment on real robots, follow your platform’s end-effector + camera choices and perform hardware setup & calibration (camera stand/pose, flange, etc.) before running closed-loop policies.
Tested OS: Ubuntu 24.04.
Quickstart (inference)
# Run under root directory of RDT2 GitHub Repo: https://github.com/thu-ml/RDT2/tree/main?tab=readme-ov-file#1-important-hard-ware-set-up-and-calibration
import yaml
from models.rdt_inferencer import RDTInferencer
with open("configs/rdt/post_train.yaml", "r") as f:
  model_config = yaml.safe_load(f)
model = RDTInferencer(
  config=model_config,
  pretrained_path="robotics-diffusion-transformer/RDT2-FM",
  # TODO: modify `normalizer_path` to your own downloaded normalizer path
  # download from http://ml.cs.tsinghua.edu.cn/~lingxuan/rdt2/umi_normalizer_wo_downsample_indentity_rot.pt
  normalizer_path="umi_normalizer_wo_downsample_indentity_rot.pt",  
  pretrained_vision_language_model_name_or_path="robotics-diffusion-transformer/RDT2-VQ", # use RDT2-VQ as the VLM backbone
  device="cuda:0",
  dtype=torch.bfloat16,
)
result = model.step(
    observations={
        'images': {
            # 'exterior_rs': np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8),
            'left_stereo': ..., # left arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
            'right_stereo': ..., # right arm RGB image in np.ndarray of shape (384, 384, 3) with dtype=np.uint8
        },
        # use zero input current state for currently
        # preserve input interface for future fine-tuning
        'state': np.zeros(model_config["common"]["state_dim"]).astype(np.float32)
    },
    instruction=instruction # Language instruction
    # We suggest using Instruction in format "verb + object" with Capitalized First Letter and trailing period 
)
# relative action chunk in np.ndarray of shape (24, 20) with dtype=np.float32
# with the same format as RDT2-VQ
action_chunk = result.detach().cpu().numpy()
# rescale gripper width from [0, 0.088] to [0, 0.1]
for robot_idx in range(2):
    action_chunk[:, robot_idx * 10 + 9] = action_chunk[:, robot_idx * 10 + 9] / 0.088 * 0.1
For guides on installation and fine-tuning, please refer to the official GitHub repository.
Precision settings
- RDT2-FM (action expert): bfloat16for training and inference.
- RDT2-VQ (VLM backbone): bfloat16by default (Qwen2.5-VL practices).
Intended uses & limitations
Intended uses
- Research in robot manipulation and VLA modeling.
- Low-latency, short-horizon control on bimanual systems following hardware calibration steps.
Limitations
- Performance depends on calibration quality, camera placement, and correct normalization.
- Dataset/action-stat shift can degrade behavior—verify bounds and reconstruction when adapting.
Safety & responsible use
- Always test with hardware limits engaged (reduced speed, gravity compensation, E-stop within reach).
Troubleshooting
| Symptom | Likely cause | Suggested fix | 
|---|---|---|
| Drifting / unstable gripper widths | Scale mismatch | Apply LinearNormalizer; rescale widths ([0,0.088] → [0,0.1]). | 
| Poor instruction following | Prompt format / backbone config | Use “Verb + Object.”; ensure backbone is loaded on same device. | 
Changelog
- 2025-09: Initial release of RDT2-FM on Hugging Face.
Citation
@software{rdt2,
    title={RDT2: Enabling Zero-Shot Cross-Embodiment Generalization by Scaling Up UMI Data},
    author={RDT Team},
    url={https://github.com/thu-ml/RDT2},
    month={September},
    year={2025}
}
Contact
- Project page: https://rdt-robotics.github.io/rdt2/
- Organization: https://huggingface.co/robotics-diffusion-transformer
- Discord: https://discord.gg/vsZS3zmf9A
- Downloads last month
- 418
Model tree for robotics-diffusion-transformer/RDT2-FM
Base model
robotics-diffusion-transformer/rdt-1b