Abstract
The integration of Internet of Things (IoT) technologies with deep learning has introduced powerful opportunities for advancing cross-media art and design. This paper proposed DeepFusionNet, an IoT-driven multimodal classification framework developed to process real-time visual, auditory, and motion data acquired from distributed sensor networks. Rather than generating new content, the system classifies contextual input states to activate predefined artistic modules within interactive multimedia environments. The architecture of DeepFusionNet integrates Convolutional Neural Networks (CNNs) for spatial feature extraction, as well as Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTM) layers for modeling temporal dependencies in auditory and motion data. Additionally, it features fully connected layers for multimodal feature fusion and final classification. Input data undergoes comprehensive preprocessing, including normalization, imputation, noise filtering, and augmentation, to ensure consistent and high-quality multimodal representations. Extracted features from each modality are fused within the network to identify user interaction contexts that guide adaptive system responses. Unlike existing multimodal transformer-based frameworks, DeepFusionNet prioritizes low-latency and synchronized IoT processing, offering a Lightweight yet robust alternative for real-time interaction. Employing deep multimodal fusion rather than simple rule-based triggers ensures contextual awareness, scalability, and resilience in interactive art installations. Experimental evaluations demonstrate that DeepFusionNet achieves high performance, with 94.2% accuracy, 92.5% sensitivity, 96.1% specificity, 93.8% F1-score, 95.0% precision, MCC of 0.846, and an AUC of 0.96. Furthermore, the model achieves a 15% reduction in latency compared to baseline frameworks. The DeepFusionNet offers a scalable and real-time infrastructure for user-aware, IoT-enhanced cross-media art applications.