基于单页语义特征的垃圾网页检测-AET-电子技术应用

基于单页语义特征的垃圾网页检测

电子技术应用

陈木生1，2，高斐1，吴俊华1

（1.江西理工大学软件工程学院，江西南昌 330013；2.南昌市虚拟数字工程与文化传播重点实验室，江西南昌 330013）

摘要： 为解决垃圾网页检测中特征提取难度高、计算量大的问题，提出一种仅基于当前网页的HTML脚本提取语义特征的方法。首先使用深度优先搜索和动态规划相结合的记忆化搜索算法对域名进行单词切割，采用隐含狄利克雷分布提取主题词，基于Word2Vec词向量和词移距离计算3个单页语义相似度特征；然后将单页语义相似度特征融合单页统计特征，使用随机森林等分类算法构建分类模型进行垃圾网页检测。实验结果表明，基于单页内容提取语义特征融合单页统计特征进行分类的AUC值达到88.0%，比对照方法提高4%左右。

关键词： 垃圾网页检测特征提取记忆化搜索隐含狄利克雷分布词向量

中图分类号：TP391.6
文献标志码：A
DOI: 10.16157/j.issn.0258-7998.223376
中文引用格式： 陈木生，高斐，吴俊华. 基于单页语义特征的垃圾网页检测[J]. 电子技术应用，2023，49(6)：24-29.
英文引用格式： Chen Musheng，Gao Fei，Wu Junhua. Web spam detection based on semantic features from current page[J]. Application of Electronic Technique，2023，49(6)：24-29.

Web spam detection based on semantic features from current page

Chen Musheng1，2，Gao Fei1，Wu Junhua1

(1.School of Software Engineering， Jiangxi University of Science and Technology， Nanchang 330013， China； 2.Nanchang Key Laboratory of Virtual Digital Engineering and Cultural Communication， Nanchang 330013， China)

Abstract： In order to solve the problem of high difficulty and large amount of computation in feature extraction for web spam detection, a method for extracting semantic features only based on the HTML script of the current page is proposed. Firstly, the domain name is segmented by a memorization search algorithm combining depth-first search and dynamic programming. Secondly, The latent Dirichlet distribution is used to extract subject words of the web page. Lastly, three single-page semantic similarity features are calculated based on Word2Vec and word mover distance. Combining the single-page semantic similarity features with single-page statistical features, classification algorithms such as random forest are used to build classification models for web spam detection. The experimental results show that the AUC value of single-page content extraction based on semantic and statistical features for classification reaches 88.0%, which is about 4% higher than that of the control method.

Key words : web spam detection；feature extraction；memory search；latent Dirichlet distribution；Word2Vec；word mover distance；random forest

0　引言

如今，随着互联网信息的快速增长，搜索引擎被认为是访问网站的关键工具，其用户占到网络用户的80%以上[1]。但是有研究表明，大约60%的用户只查看第一页中最初的5个结果[2]。可以看出，在搜索结果中排名靠前的网页会拥有更多的访问者，由此带来更多的收入。由于通过正常手段提高网页排名非常困难，于是某些网站便通过非正常手段和技术欺骗搜索引擎提高网页排名，这些网页被称为垃圾网页[3]。垃圾网页会降低搜索结果的质量，浪费用户的时间，侵占搜索引擎公司和其他内容网站的合法利益[4]。尽管搜索引擎公司已经使用了各种方法来应对垃圾网页，但至今为止，垃圾网页检测依然是搜索引擎需要重点突破的难题，也是学术领域的一个前沿课题。因此，高效、准确地检测垃圾网页具有重要意义。

本文详细内容请下载：https://www.chinaaet.com/resource/share/2000005343

作者信息：

陈木生1，2，高斐1，吴俊华1

（1.江西理工大学软件工程学院，江西南昌 330013；2.南昌市虚拟数字工程与文化传播重点实验室，江西南昌 330013）

微信图片_20210517164139.jpg

原创声明：此内容为AET网站原创，未经授权禁止转载。

相关内容