Abstract
This paper presents a modular and scalable intrusion detection framework that combines graph-based feature extraction, Transformer-based autoencoding, and contrastive learning to improve detection accuracy in cloud environments. Network flows are modeled as graphs to capture relational patterns among IP addresses and services, and a Graph Neural Network (GNN) is used to extract structured embeddings. These embeddings are refined through a Transformer-based autoencoder to preserve contextual information, while contrastive learning enforces clear class separation during classification. The system is evaluated on NSL-KDD and CIC-IDS2018 datasets under both binary and multi-class scenarios. Experimental results show an average accuracy of 99.97%, with high precision and recall across all attack types, including minority classes such as U2R and R2L. The model achieves low false-positive rates and demonstrates real-time inference performance with modest resource requirements. Key contributions include an interpretable pipeline using SHAP for feature attribution, a strategy for mitigating class imbalance, and validation across datasets with detailed security and generalizability analyses. These results support the practical applicability of the proposed approach in high-throughput, cloud-based network environments.