Model Card for MBDPO
Official release of MBDPO model checkpoints for the paper
Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization by
Xiaoyuan Cheng*, Wenxuan Yuan*, Zhancun Mu, Yuanzhao Zhang, Yiming Yang, Hai Wang, Zhuo Sun†, and Che Liu†.
Quick links: [Website] [Paper] [Code]
Note: Due to limited data storage capacity, data loss occurred when saving ckpt. Therefore, the current repository only provides offline trained ckpt and offline-to-online pretrained ckpt.
Model Details
We release model checkpoints for MBDPO, a model-based reinforcement learning framework that unifies search and policy optimization through diffusion policy optimization inside a learned latent world model.
MBDPO reformulates policy optimization as a diffusion process over imagined trajectories. The diffusion score field is corrected by model-based returns and anchored to the behavior distribution through an implicit energy function. This design removes the need for an explicit planner on top of the world model and addresses the structural mismatch between search and value learning in prior world-model reinforcement learning methods.
Model Description
- Developed by: Xiaoyuan Cheng*, Wenxuan Yuan*, Zhancun Mu, Yuanzhao Zhang, Yiming Yang, Hai Wang, Zhuo Sun, and Che Liu
- Model type: Model-based reinforcement learning checkpoints with diffusion policy optimization
- Framework: PyTorch
- Task type: Continuous control
- License: MIT
Model Sources
- Repository: https://github.com/Edmond1Cheng/MBDPO
- Paper: http://arxiv.org/abs/2605.26282
Uses
These checkpoints are intended for researchers interested in model-based reinforcement learning, world models, diffusion policies, offline reinforcement learning, and offline-to-online fine-tuning.
They can be used for reproducing MBDPO results, evaluating pretrained agents, analyzing learned world models and policies, and initializing offline-to-online fine-tuning experiments.
Direct Use
Model checkpoints can be loaded and evaluated using the official implementation.
Example evaluation command:
python scripts/evaluate.py \
task=mt80 \
checkpoint=/path/to/checkpoint.pt \
eval_episodes=10
Out-of-Scope Use
These checkpoints are research artifacts trained and evaluated in simulated continuous control environments. They are not intended for direct deployment in real-world robotics systems or safety-critical applications without additional validation.
We do not expect checkpoints to generalize reliably to unseen tasks or substantially different environments without fine-tuning or further training.
How to Get Started with the Models
Please first install the official implementation:
git clone https://github.com/Edmond1Cheng/MBDPO.git
cd MBDPO
Create the corresponding Conda environment. For example, for MT80 experiments:
conda env create -f conda_envs/mbdpo-mt80.yml
conda activate mbdpo-mt80
Other environment files are also provided for different experiment suites, such as ManiSkill2 and MyoSuite.
After downloading a checkpoint from this repository, run evaluation with:
python scripts/evaluate.py \
task=mt80 \
checkpoint=/path/to/checkpoint.pt \
eval_episodes=10
For offline-to-online fine-tuning:
python scripts/offline_to_online.py \
checkpoint=/path/to/checkpoint.pt \
save_path=/path/to/output_dir \
off2on_task="walker-run" \
steps=40000
Please refer to the official repository for detailed installation instructions, configuration files, and experiment scripts.
Training Details
MBDPO supports three main experimental settings:
- Online training from scratch
- Multi-task offline pretraining
- Offline-to-online fine-tuning
Training Data
For multi-task offline pretraining, MBDPO uses replay buffer data from the open-sourced TD-MPC2 dataset:
The relevant subsets include mt30 and mt80.
Supported Tasks
MBDPO supports 121 continuous control tasks across the following domains:
| Domain | Number of Tasks |
|---|---|
| DMControl | 39 |
| MetaWorld | 50 |
| ManiSkill2 | 5 |
| MyoSuite | 10 |
| Locomotion | 7 |
| Visual RL | 10 |
| Total | 121 |
In the DMControl domain, MBDPO follows the TD-MPC2 setting and includes 11 custom tasks.
Citation
If you find our work useful, please consider citing the paper as follows:
BibTeX:
@misc{cheng2026scalingworldmodelreinforcementlearning,
title={Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization},
author={Xiaoyuan Cheng and Wenxuan Yuan and Zhancun Mu and Yuanzhao Zhang and Yiming Yang and Hai Wang and Zhuo Sun and Che Liu},
year={2026},
eprint={2605.26282},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={http://arxiv.org/abs/2605.26282}
}
Contact
For questions about the paper, please contact:
- Xiaoyuan Cheng: ucesxc4@ucl.ac.uk
- Wenxuan Yuan: YUAN0186@e.ntu.edu.sg
For bugs, feature requests, or contributions, please open an issue or pull request in the official GitHub repository: