Abstract
When human instructors guide learners through motor tasks, they seamlessly coordinate physical touch with verbal explanations - a dance teacher positions a student's arms while describing the movement, a therapist supports a patient's limb while offering encouragement. In contrast, a robot applying physical forces without verbal context can feel invasive or unsettling to humans. We present a robot guidance controller that learns to coordinate physical and verbal guidance as human instructors naturally do. Our system adaptively balances these modalities based on real-time estimation of human compliance: when learners struggle, it provides firmer physical corrections with explicit instructions; as they improve, it transitions to lighter touch with encouraging phrases. Our method comprises three components: (1) an estimator that infers physical and verbal compliance from tracking errors, (2) an optimization method that dynamically allocates guidance between force and language, and (3) a force-to-language model that generates contextually appropriate utterances. User studies (N=12) demonstrate that adaptive coordination of guidance significantly outperforms single-modality guidance and fixed-combination baselines: up to 50% reduction in tracking error, 39% improvement in movement smoothness, and 27% faster task completion. While validated in rehabilitation therapy, our approach generalizes to any human-robot collaborative learning scenario.