基于特征集聚和卷积神经网络的恶意PDF文档检测方法-AET-电子技术应用

基于特征集聚和卷积神经网络的恶意PDF文档检测方法

信息技术与网络安全

俞远哲，王金双，邹霞　　

（陆军工程大学指挥控制工程学院，江苏南京210001）

摘要：针对现有恶意PDF文档检测方法存在特征维度高、数据集样本少导致模型欠拟合等问题，提出了一种基于特征集聚和卷积神经网络的恶意PDF文档检测方法。该方法以词袋模型为基础，从PDF文档中提取常规特征和结构特征。然后以合并后特征簇最小方差为目标，使用Ward最小方差聚类方法实现特征集聚。最后，将聚合特征送入卷积神经网络分类模型进行训练。根据不同聚合特征数下模型性能的好坏，确定最优的聚合特征数。实验结果表明，该方法降低了特征维度，提升了模型的召回率，缓解了模型的欠拟合问题。纵向比较来看，在不同的良性样本和恶意样本比例下，遍历得到最优的聚合特征数，召回率平均提升了53%，F-score平均提升了0.44，运行时间平均缩短了27%；与PJScan、PDFrate、Luxor 3种检测工具横向相比，检测的综合性能平均提升了5%。

关键词： 恶意PDF文档特征集聚静态检测卷积神经网络

中图分类号：TP309
文献标识码：A
DOI：10.19358/j.issn.2096-5133.2021.08.006
引用格式：俞远哲，王金双，邹霞。基于特征集聚和卷积神经网络的恶意PDF文档检测方法[J].信息技术与网络安全，2021，40（8）：35-41.

A malicious PDF detection method based on feature agglomeration and convolutional neural network

Yu Yuanzhe，Wang Jinshuang，Zou Xia

(Command & Control Engineering College，Army Engineering University of PLA，Nanjing 210001，China)

Abstract：To solve the high feature dimension problems and under-fitting due to the small dataset size, a malicious PDF document detection method based on feature agglomeration and CNN was proposed. Based on the word bag model, the regular and structural features are extracted from PDF documents. Then Ward′s Minimum Variance Clustering Method is used to achieve feature agglomeration according to the combined minimum variance of feature clusters. Afterwards, the agglomerate features are sent into the CNN classification model for training and evaluation. The optimal number of agglomerate features is determined by a comparison with the performances of the model under different numbers of agglomerate features. It was shown that the model proposed in this paper can reduce the dimension of the feature, improve the recall rate of model and mitigate the under-fitting problem at the same time.With different benign and malicious sample proportions, the recall rate is increased by 53% and the F-score is increased by 0.44 on average. Meanwhile, compared with detection tools PJScan, PDFrate and Luxor, the comprehensive detection performance is improved by 5% on average.

Key words :malicious PDF document；feature agglomeration；static detection；Convolutional Neural Network(CNN)

0 引言

PDF(Portable Document Format)文档的使用非常广泛，但随着版本的更新换代，PDF文档包含的功能也变得多种多样，其中一些鲜为人知的功能(如文件嵌入、JavaScript代码执行、动态表单等)越来越多地被不法分子利用，来实施恶意网络攻击行为[1]。APT(Advanced Persistent Threat)攻击[2]常常借助恶意PDF文档这一媒介，通过社会工程学、水坑攻击、钓鱼攻击等手段，构造巧妙伪装的恶意文档，诱骗受害者下载，从而侵入或破坏计算机系统。相比传统的可执行恶意程序攻击，恶意文档攻击具有更强的迷惑性。

近年来，基于机器学习的恶意PDF文档检测技术被广泛使用。相比于传统签名匹配检测，它能够及时发现新型恶意文档且检测模型更新方便迅速。其中基于静态检测的机器学习方法，具有高效、成本低、解释性强等特点。而深度学习相较于机器学习算法，更强调学习数据中的隐藏信息，如特征的相关性。

本文详细内容请下载：http://www.chinaaet.com/resource/share/2000003722

作者信息：

俞远哲，王金双，邹霞

（陆军工程大学指挥控制工程学院，江苏南京210001）

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容