Semantic lossless encoded image representation for malware classification

用于恶意软件分类的语义无损编码图像表示

阅读:2

Abstract

Combining artificial intelligence with static analysis is an effective method for classifying malicious code. Due to the development of anti-analysis techniques, malicious code commonly employs obfuscation methods like packing, which result in garbled assembly code and the loss of original semantics. Consequently, existing pre-trained code language models are rendered ineffective in such scenarios. Current research addresses this issue by converting malicious bytecode into grayscale images and extracting visual features for classification. However, this process truncates the original sequence, compromising its coherence and structure. Furthermore, the image dimensions undergo compression and cropping based on the model's input requirements, leading to the loss of intricate details. Our solution is a lossless encoding method for the visual structure of code, enabling unrestricted processing of malicious code images of any size. We convert bytecode files into semantically lossless images with proportional width. Then, we use image interleaving encoding to address semantic truncation issues caused by traditional image preprocessing methods. This method also prevents the loss of original code information due to image cropping or compression. For feature extraction, our goal is to combine the lossless encoding results with both local receptive field features and global contextual features. For local features, we achieve uniform embedding of variably sized input samples into equally sized feature maps using a multi-scale feature extraction module. For global contextual features, we reframe the feature maps along the row dimension, treating them as long-text sequences embedded in a matrix. We segment the feature maps into multiple row patch blocks and modify the Transformer's input components to cache and merge the hidden states of each block. Comparative experiments on various malware datasets demonstrate the effectiveness of our method, consistently achieving outstanding performance across classification metrics.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。