ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation
Abstract
ChronoEdit addresses physical consistency in image editing by reframing it as a video generation problem, leveraging pretrained video models and temporal reasoning tokens.
Recent advances in large generative models have significantly advanced image editing and in-context image generation, yet a critical gap remains in ensuring physical consistency, where edited objects must remain coherent. This capability is especially vital for world simulation related tasks. In this paper, we present ChronoEdit, a framework that reframes image editing as a video generation problem. First, ChronoEdit treats the input and edited images as the first and last frames of a video, allowing it to leverage large pretrained video generative models that capture not only object appearance but also the implicit physics of motion and interaction through learned temporal consistency. Second, ChronoEdit introduces a temporal reasoning stage that explicitly performs editing at inference time. Under this setting, the target frame is jointly denoised with reasoning tokens to imagine a plausible editing trajectory that constrains the solution space to physically viable transformations. The reasoning tokens are then dropped after a few steps to avoid the high computational cost of rendering a full video. To validate ChronoEdit, we introduce PBench-Edit, a new benchmark of image-prompt pairs for contexts that require physical consistency, and demonstrate that ChronoEdit surpasses state-of-the-art baselines in both visual fidelity and physical plausibility. Code and models for both the 14B and 2B variants of ChronoEdit will be released on the project page: https://research.nvidia.com/labs/toronto-ai/chronoedit
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration (2025)
- Factuality Matters: When Image Generation and Editing Meet Structured Visuals (2025)
- Query-Kontext: An Unified Multimodal Model for Image Generation and Editing (2025)
- VChain: Chain-of-Visual-Thought for Reasoning in Video Generation (2025)
- Tinker: Diffusion's Gift to 3D-Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization (2025)
- FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction (2025)
- UniVid: The Open-Source Unified Video Model (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper