Abstract
A safe physical human-robot interaction (pHRI) in rehabilitation requires reliable perception and low-latency decision making under heterogeneous and unreliable sensor inputs. This paper presents a multimodal sensor-fusion-based safety framework that integrates physical state estimation, semantic information fusion, and an edge-deployed large language model (LLM) for real-time pHRI safety control. A dynamics-based virtual sensing method is introduced to estimate internal joint torques from external force-torque measurements, achieving a normalized mean absolute error of 18.5% in real-world experiments. An asynchronous semantic state pool with a time-to-live mechanism is designed to fuse visual, force, posture, and human semantic cues while maintaining robustness to sensor delays and dropouts. Based on structured multimodal tokens, an instruction-tuned edge LLM outputs discrete safety decisions that are further mapped to continuous compliant control parameters. The framework is trained using a hybrid dataset consisting of limited real-world samples and LLM-augmented synthetic data, and evaluated on unseen real and mixed-condition scenarios. Experimental results show reliable detection of safety-critical events with a low emergency misdetection rate, while maintaining an end-to-end decision latency of approximately 223 ms on edge hardware. Real-world experiments on a rehabilitation robot demonstrate effective responses to impacts, user instability, and visual occlusions, indicating the practical applicability of the proposed approach for real-time pHRI safety monitoring.