Abstract
The ability to generate dynamic, expressive dance routines that adapt to various musical compositions has broad applications in activity recognition, performance arts, entertainment, virtual reality, and interactive media, offering new avenues for creative professionals and audiences alike. In this article a deep learning framework is developed for music-synchronized dance choreography through modified vision transformers and graph convolutional networks based on Mexican hat wavelet function for position quantization and motion forecasting. More explicitly high-dimensional pose characteristics are extracted from dance video frames using modified vision transformer to generate a skeletal graph, while modified graph convolutional network captures the spatial and temporal relationships between human joints. The process of discretizing continuous pose data is performed by using K-mean clustering and vector quantized variational autoencoders, respectively. The music synchronization beat-aligned loss was optimized, and the best-tuned weight coefficients were found using two variants of the differential evolution algorithm, based on controlled mutation factors [Formula: see text] =log-sigmoid () and [Formula: see text] =rand(). The proposed architecture with [Formula: see text] =log sigmoid () achieves the lowest Fréchet inception distance (FIDk = 32.451, FIDg = 11.219) and music motion correlation of 0.341 demonstrating enhanced motion synthesis in comparison to existed state of art techniques. The mean fitness value of 6.0294 × 10-10 is obtained with an overall classification accuracy of 97.019% in 0.8431G FLOPs for differential evolution algorithm with [Formula: see text] log-sigmoid (). The framework may be utilized in AI-generated choreography, virtual dance instruction, and interactive entertainment.