Abstract
Cyber bots have become prevalent across the Internet ecosystem, making behavioral understanding essential for threat intelligence. Since bot behaviors are encoded in traffic payloads that blend with normal traffic, honeypot sensors are widely adopted for capture and analysis. However, previous works face adaptation challenges when analyzing large-scale, diverse payloads from evolving bot techniques. In this paper, we conduct an 11-month measurement study to map cyber bot behaviors through payload pattern analysis in honeypot traffic. We propose TrafficPrint, a pattern extraction framework to enable adaptable analysis of diverse honeypot payloads. TrafficPrint combines representation learning with clustering to automatically extract human-understandable payload patterns without requiring protocol-specific expertise. Our globally distributed honeypot sensors collected 21.5 M application-layer payloads. Starting from only 168 K labeled payloads (0.8% of data), TrafficPrint extracted 296 patterns that automatically labeled 83.57% of previously unknown payloads. Our pattern-based analysis reveals actionable threat intelligence: 82% of patterns employ semi-customized structures balancing automation with targeted modifications; 13% contain distinctive identity markers enabling threat actor attribution, including CENSYS's unique signature; and bots exploit techniques like masquerading as crawlers, embedding commands in brute-force attacks, and using base64 encoding for detection evasion.