Abstract
In the era of the Internet of Things (IoT), video surveillance, as a vital component of smart cities and public security systems, faces the critical challenge of efficiently detecting abnormal behaviors within massive video streams. However, existing weakly supervised video anomaly detection methods are often limited by the scarcity of abnormal samples, the similarity between normal and abnormal segments, and the insufficient modeling of temporal dependencies. To address these challenges, this paper proposes a novel approach that integrates temporal structural attention with contrastive learning. On the one hand, causal masks and temporal decay weights are incorporated into the attention mechanism to explicitly constrain temporal relations and prevent future information leakage; on the other hand, positive/negative offsets and a contrastive learning strategy are employed to enhance the discriminability of abnormal segments in the latent space. Experiments conducted on multiple public video anomaly detection datasets validate the effectiveness of the proposed method, with results showing superior performance over existing mainstream models: the AUC increases to 98.1%, ACC reaches 96.1%, and the F1-score improves to 94.5%. These findings demonstrate that the proposed method can provide more intelligent, efficient, and reliable anomaly detection for IoT-based video surveillance, holding significant implications for public safety and intelligent monitoring.