ReMoMask: Retrieval-Augmented Masked Motion Generation
Abstract
ReMoMask, a unified framework, addresses limitations in text-to-motion generation by integrating a Bidirectional Momentum Text-Motion Model, Semantic Spatio-temporal Attention, and RAG-Classier-Free Guidance, achieving state-of-the-art performance on HumanML3D and KIT-ML benchmarks.
Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatio-temporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks demonstrate the state-of-the-art performance of ReMoMask, achieving a 3.88% and 10.97% improvement in FID scores on HumanML3D and KIT-ML, respectively, compared to the previous SOTA method RAG-T2M. Code: https://github.com/AIGeeksGroup/ReMoMask. Website: https://aigeeksgroup.github.io/ReMoMask.
Community
Text-to-Motion (T2M) generation aims to synthesize realistic and semantically aligned human motion sequences from natural language descriptions. However, current approaches face dual challenges: Generative models (e.g., diffusion models) suffer from limited diversity, error accumulation, and physical implausibility, while Retrieval-Augmented Generation (RAG) methods exhibit diffusion inertia, partial-mode collapse, and asynchronous artifacts. To address these limitations, we propose ReMoMask, a unified framework integrating three key innovations: 1) A Bidirectional Momentum Text-Motion Model decouples negative sample scale from batch size via momentum queues, substantially improving cross-modal retrieval precision; 2) A Semantic Spatiotemporal Attention mechanism enforces biomechanical constraints during part-level fusion to eliminate asynchronous artifacts; 3) RAG-Classier-Free Guidance incorporates minor unconditional generation to enhance generalization. Built upon MoMask's RVQ-VAE, ReMoMask efficiently generates temporally coherent motions in minimal steps. Extensive experiments on standard benchmarks, including HumanML3D, demonstrate state-of-the-art performance, with the FID score significantly improved to 0.095 compared to SOTA RAG-t2m method.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SnapMoGen: Human Motion Generation from Expressive Texts (2025)
- PlanMoGPT: Flow-Enhanced Progressive Planning for Text to Motion Synthesis (2025)
- Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space (2025)
- MotionGPT3: Human Motion as a Second Modality (2025)
- DuetGen: Music Driven Two-Person Dance Generation via Hierarchical Masked Modeling (2025)
- Music-Aligned Holistic 3D Dance Generation via Hierarchical Motion Modeling (2025)
- Toward Rich Video Human-Motion2D Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper