Abstract
While sign language combines sequential hand motions with concurrent non-manual cues (e.g., mouth shapes and head tilts), current recognition systems lack multimodal annotation methods capable of capturing their hierarchical semantics. To bridge this gap, we propose GSP-MC, the first method integrating generative large language models into sign language recognition. It leverages retrieval-augmented generation with domain-specific large language models and expert-validated corpora to produce precise multipart descriptions. A dual-encoder architecture bidirectionally aligns hierarchical skeleton features with multi-level text descriptions (global, synonym, part) through probabilistic matching. The approach combines global and part-level losses with KL divergence optimization, ensuring robust alignment across relevant text-skeleton pairs while capturing sign semantics and detailed dynamics. Experiments demonstrate state-of-the-art performance, achieving 97.1% accuracy on the Chinese SLR500 (surpassing SSRL's 96.9%) and 97.07% on the Turkish AUTSL (exceeding SML's 96.85%), confirming cross-lingual potential for inclusive communication technologies.