Abstract
As Internet of Things (IoT) technology sees extensive adoption in smart agriculture, smart healthcare, and smart cities, emerging systems are increasingly confronted with complex and dynamic security threats. Intrusion Detection Systems (IDS), a key technology in network security, effectively enhance IoT system safety by detecting and monitoring anomalous activities. Nevertheless, IDS relying on traditional Machine Learning (ML) technologies demonstrate limited efficacy in classifying malicious traffic. In recent years, approaches that convert network security data into image sets and leverage Deep Transfer Learning (DTL) for classification have gained rapid popularity. While these methods substantially improve detection accuracy, they also lead to increased time and resource consumption during training. To balance high detection accuracy with reduced time consumption, this study introduces an efficient intrusion detection approach based on the Vision Transformer (ViT), utilizing its powerful feature extraction capabilities to enhance performance. The proposed High-performance ViT Intrusion Detection System (HiViT-IDS) begins by transforming one-dimensional network traffic data into RGB images and leverages the ViT model's exceptional representational power for efficient classification. Experimental results on the ToN-IoT and Edge-IIoTset datasets reveal classification accuracies of 99.70% and 100%, respectively. In comparison to existing mainstream DTL approaches, the proposed model achieves considerable reductions in training time while sustaining high performance. The findings suggest that the HiViT-IDS offers superior potential and a competitive edge in adapting to complex and dynamic network environments.