Abstract
Radio frequency (RF)-based human activity sensing is an essential component of ubiquitous computing, with WiFi sensing providing a practical and low-cost solution for gesture and activity recognition. However, challenges such as manual data collection, multipath interference, and poor cross-domain generalization hinder real-world deployment. Existing data augmentation approaches often neglect the biomechanical structure underlying RF signals. To address these limitations, we present CM-GR, a cross-modal gesture recognition framework that integrates semantic learning with generative modeling. CM-GR leverages 3D skeletal points extracted from vision data as semantic priors to guide the synthesis of realistic WiFi signals, thereby incorporating biomechanical constraints without requiring extensive manual labeling. In addition, dynamic conditional vectors are constructed from inter-subject skeletal differences, enabling user-specific WiFi data generation without the need for dedicated data collection and annotation for each new user. Extensive experiments on the public MM-Fi dataset and our SelfSet dataset demonstrate that CM-GR substantially improves the cross-subject gesture recognition accuracy, achieving gains of up to 10.26% and 9.5%, respectively. These results confirm the effectiveness of CM-GR in synthesizing personalized WiFi data and highlight its potential for robust and scalable gesture recognition in practical settings.