Abstract
Stroke-related impairment remains a leading cause of long-term disability, limiting individuals' ability to perform daily activities. While wearable sensors offer scalable monitoring solutions during rehabilitation, they struggle to distinguish functional from non-functional movements, and manual annotation of sensor data is labor-intensive and prone to inconsistency. In this paper, we propose a novel framework that uses large language models (LLMs) to generate activity descriptions from video frames of therapy sessions. These descriptions are aligned with concurrently recorded accelerometer signals to create labeled training data. Through exploratory analysis, we demonstrate that accelerometer signals exhibit distinct temporal and statistical patterns corresponding to specific activities, supporting the feasibility of generating natural language narratives directly from sensor data. Our findings lay the foundation for future development of sensor-to-text models that can enable automated, non-intrusive, and scalable stroke rehabilitation monitoring without the need for manual or video-based annotation.