Model Card for MBDPO

Official release of MBDPO model checkpoints for the paper

Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization by

Xiaoyuan Cheng*, Wenxuan Yuan*, Zhancun Mu, Yuanzhao Zhang, Yiming Yang, Hai Wang, Zhuo Sun, and Che Liu.

Quick links: [Website] [Paper] [Code]

Note: Due to limited data storage capacity, data loss occurred when saving ckpt. Therefore, the current repository only provides offline trained ckpt and offline-to-online pretrained ckpt.

Model Details

We release model checkpoints for MBDPO, a model-based reinforcement learning framework that unifies search and policy optimization through diffusion policy optimization inside a learned latent world model.

MBDPO reformulates policy optimization as a diffusion process over imagined trajectories. The diffusion score field is corrected by model-based returns and anchored to the behavior distribution through an implicit energy function. This design removes the need for an explicit planner on top of the world model and addresses the structural mismatch between search and value learning in prior world-model reinforcement learning methods.

Model Description

  • Developed by: Xiaoyuan Cheng*, Wenxuan Yuan*, Zhancun Mu, Yuanzhao Zhang, Yiming Yang, Hai Wang, Zhuo Sun, and Che Liu
  • Model type: Model-based reinforcement learning checkpoints with diffusion policy optimization
  • Framework: PyTorch
  • Task type: Continuous control
  • License: MIT

Model Sources

Uses

These checkpoints are intended for researchers interested in model-based reinforcement learning, world models, diffusion policies, offline reinforcement learning, and offline-to-online fine-tuning.

They can be used for reproducing MBDPO results, evaluating pretrained agents, analyzing learned world models and policies, and initializing offline-to-online fine-tuning experiments.

Direct Use

Model checkpoints can be loaded and evaluated using the official implementation.

Example evaluation command:

python scripts/evaluate.py \
  task=mt80 \
  checkpoint=/path/to/checkpoint.pt \
  eval_episodes=10

Out-of-Scope Use

These checkpoints are research artifacts trained and evaluated in simulated continuous control environments. They are not intended for direct deployment in real-world robotics systems or safety-critical applications without additional validation.

We do not expect checkpoints to generalize reliably to unseen tasks or substantially different environments without fine-tuning or further training.

How to Get Started with the Models

Please first install the official implementation:

git clone https://github.com/Edmond1Cheng/MBDPO.git
cd MBDPO

Create the corresponding Conda environment. For example, for MT80 experiments:

conda env create -f conda_envs/mbdpo-mt80.yml
conda activate mbdpo-mt80

Other environment files are also provided for different experiment suites, such as ManiSkill2 and MyoSuite.

After downloading a checkpoint from this repository, run evaluation with:

python scripts/evaluate.py \
  task=mt80 \
  checkpoint=/path/to/checkpoint.pt \
  eval_episodes=10

For offline-to-online fine-tuning:

python scripts/offline_to_online.py \
  checkpoint=/path/to/checkpoint.pt \
  save_path=/path/to/output_dir \
  off2on_task="walker-run" \
  steps=40000

Please refer to the official repository for detailed installation instructions, configuration files, and experiment scripts.

Training Details

MBDPO supports three main experimental settings:

  1. Online training from scratch
  2. Multi-task offline pretraining
  3. Offline-to-online fine-tuning

Training Data

For multi-task offline pretraining, MBDPO uses replay buffer data from the open-sourced TD-MPC2 dataset:

The relevant subsets include mt30 and mt80.

Supported Tasks

MBDPO supports 121 continuous control tasks across the following domains:

Domain Number of Tasks
DMControl 39
MetaWorld 50
ManiSkill2 5
MyoSuite 10
Locomotion 7
Visual RL 10
Total 121

In the DMControl domain, MBDPO follows the TD-MPC2 setting and includes 11 custom tasks.

Citation

If you find our work useful, please consider citing the paper as follows:

BibTeX:

@misc{cheng2026scalingworldmodelreinforcementlearning,
      title={Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization}, 
      author={Xiaoyuan Cheng and Wenxuan Yuan and Zhancun Mu and Yuanzhao Zhang and Yiming Yang and Hai Wang and Zhuo Sun and Che Liu},
      year={2026},
      eprint={2605.26282},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={http://arxiv.org/abs/2605.26282}
}

Contact

For questions about the paper, please contact:

For bugs, feature requests, or contributions, please open an issue or pull request in the official GitHub repository:

https://github.com/Edmond1Cheng/MBDPO

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Dataset used to train BruceYuan/MBDPO

Paper for BruceYuan/MBDPO