Abstract
Foundation models for embodied artificial intelligence (Embodied AI) increasingly adopt diffusion modules as the action generation core of vision-language-action (VLA) policies, but the diffusion module's iterative denoising imposes prohibitive inference latency for real-time deployment. We address this bottleneck in isolation by rethinking the diffusion action generation module itself. We present Fast Robot Motion Diffusion (FRMD) , a fast robot motion diffusion framework that (i) operates in trajectory-parameter space by predicting movement-primitive coefficients in a low-dimensional manifold, and (ii) collapses multi-step sampling into a single inference step via trajectory-level consistency distillation over the probability-flow ordinary differential equation (ODE). Concretely, FRMD replaces stepwise action generation with a one-pass mapping from noise to full trajectories, followed by a fixed-cost basis expansion; this reduces policy latency from hundreds to tens of milliseconds without modifying upstream vision or language encoders. On standard robotic manipulation task benchmarks, FRMD attains 7 times faster than the vanilla diffusion policy and 10 times faster than the state-of-the-art MPD method, while matching the task success of multi-step diffusion policies. By targeting the diffusion component used throughout VLA systems, FRMD provides a plug-in, latency-optimized motion generator that preserves the advantages of diffusion and makes real-time embodied AI feasible.