Abstract
AR/VR and other immersive technologies are creating dynamic, learner-centred, and engaging language-learning environments. In these ever-changing situations, judging someone's language abilities is difficult. Managing multimodal learner inputs, understanding model predictions, and protecting user data across distributed systems are some of the most prominent challenges. This paper proposes TriNet-AQ, a federated, interpretable deep learning architecture for classifying English competency in AR/VR platforms. This technique addresses the difficulties raised. This work employs Quantum Sinusoidal Encoding (QSE), Triaxial Attention Fusion (TAF) for multimodal feature alignment, and Quantum Modulated Integration (QMI) to enhance context-aware learning by optimizing temporal representation. Hybrid Slime Gorilla Optimisation (HSGO) aids optimization. It accelerates convergence and improves performance and economy. TriNet-AQ provides decentralized training to many clients via federated learning, enhancing privacy and flexibility. TriNet-AQ outperforms classical, fuzzy, and hybrid baselines in real-world augmented and virtual reality instructional datasets. Its accuracy is 98.5%, AUC is 0.95, and EPES is 0.89. Even when it loses 3.5% accuracy on new data, it can generalize effectively. Another SHAP-based interpretability finding is the presence of obvious feature attributions and consistent relevance across users. Statistical analysis, including Cohen's d = 0.89 (p < 0.001), confirms the model's significance and reliability. TriNet-AQ provides robust, easy-to-understand, and private real-time, tailored language evaluation in next-generation immersive learning environments.