Abstract
Human Action Recognition (HAR) is a pivotal area in computer vision, video surveillance, and human-computer interaction (HCI), driven by the need for efficient and accurate models to enhance HCI experiences. Traditional HAR methods often rely on hand-crafted features and shallow learning techniques, which limits their ability to capture complex patterns. In contrast, this study proposes an efficient HAR model that leverages deep neural networks, specifically a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), to enhance HCI through AI-powered action understanding. The model employs a pre-trained EfficientNetB7 network to extract rich spatial features from video frames, followed by a Long Short-Term Memory (LSTM) network to capture long-range temporal dependencies. This architecture enhances recognition accuracy while reducing computational complexity, making it highly suitable for HCI applications. Experimental results demonstrate the superior performance of the model, achieving a classification accuracy of 97.8% on the UCF101 dataset and 80.1% on the HMDB51 dataset, outperforming state-of-the-art HAR models. The proposed model eliminates the need for auxiliary assistive techniques like data augmentation, highlighting its efficiency and tremendous potential for real-world HCI applications that rely on accurate and efficient recognition of human actions.