Abstract
Malicious network attacks are becoming complex and more diverse, and this poses a threat to the effectiveness of the traditional security defense mechanisms. To resolve this, this paper presents a new three-stage algorithm of network security situation awareness that can be used to effectively identify threats in network information transmission using machine learning and big data processing. The first step in our methodology is to create a rich feature vector out of the network traffic flows, including statistical features in both directions, temporal, flow-based, and relational features. The second step applies a hybrid feature selection approach to achieve a greater efficiency and accuracy of the model; it applies a Distributed K-Means (D-KMeans) algorithm to cluster the features and a Mutual Information (MI) analysis to select the most informative, non-redundant set. The last operation is the use of a Distributed K-Nearest Neighbor (D-KNN) model to perform robust and scalable network traffic classification. The algorithm proposed was strictly tested on the CICIDS2017 dataset. The experimental results are improved with 98.91 accuracy, 93.71 precision, 98.95 recall, and 96.00 F-Measure. It is a statistically significant increase of at least 1.2 percent in accuracy (p < 0.05) over other available state-of-the-art approaches. The findings confirm that our solution is an extremely efficient and effective means of identifying security threats within large-scale network environments.