Abstract
For humanoid robots to interact naturally with humans and seamlessly integrate into daily life, natural language serves as an essential communication medium. While recent advances in imitation learning have enabled robots to acquire complex motions through expert demonstration, traditional approaches often rely on rigid task specifications or single-modal inputs, limiting their ability to interpret high-level semantic instructions (e.g., natural language commands) or dynamically switch between actions. Directly translating natural language into executable control commands remains a significant challenge. To address this, we propose ToggleMimic, an end-to-end imitation learning framework that generates robotic motions from textual instructions, enabling language-driven multi-task control. In contrast to end-to-end methods that struggle with generalization or single-action models that lack flexibility, our ToggleMimic framework uniquely combines the following: (1) a two-stage policy distillation that efficiently bridges the sim-to-real gap, (2) a lightweight cross-attention mechanism for interpretable text-to-action mapping, and (3) a gating network that enhances robustness to linguistic variations. Extensive simulation and real-world experiments demonstrate the framework's effectiveness, generalization capability, and robust text-guided control performance. This work establishes an efficient, interpretable, and scalable learning paradigm for cross-modal semantic-driven autonomous robot control.