Abstract
The widespread adoption of Android has made it a primary target for increasingly sophisticated malware, posing a significant challenge to mobile security. Traditional static or behavioural approaches often struggle with obfuscation and lack contextual integration across multiple feature domains. In this work, we propose GIT-GuardNet, a novel Graph-Informed Transformer Network that leverages multi-modal learning to detect Android malware with high precision and robustness. GIT-GuardNet fuses three complementary perspectives: (i) static code attributes captured through a Transformer encoder, (ii) call graph structures modelled via a Graph Attention Network (GAT), and (iii) temporal behaviour traces learned using a Temporal Transformer. These encoders are integrated using a cross-attention fusion mechanism that dynamically weighs inter-modal dependencies, enabling more informed decision-making under both benign and adversarial conditions. We conducted comprehensive experiments on a large-scale dataset comprising 15,036 Android applications, including 5,560 malware samples from the Drebin project. GIT-GuardNet achieves state-of-the-art performance, reaching 99.85% accuracy, 99.89% precision, and 99.94 AUC, outperforming traditional machine learning models, single-view deep networks, and recent hybrid approaches like DroidFusion. Ablation studies confirm the complementary impact of each modality and the effectiveness of the cross-attention design. Our results demonstrate the strong generalization of GIT-GuardNet in obfuscated and stealthy threats, low inference overhead, and practical applicability for real-world mobile threat detection. This study provides a powerful and extensible framework for future research in secure mobile computing and intelligent malware defence.