采用最小DFS的Deep Web结构化数据抽取 Deep Web Structured Data Extraction Based on Minimal DFS期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

采用最小DFS的Deep Web结构化数据抽取

引用本文：	刘荣辉,郑建国,王翔.采用最小DFS的Deep Web结构化数据抽取[J].图书情报工作,2010,54(14):126-130.

作者姓名：	刘荣辉郑建国王翔

作者单位：	东华大学管理学院

摘要：	通过分析动态数据在其Web 页面中的展示特点，提出一个新的自动化、结构化数据抽取方法。首先基于DOM利用算法实现快速定位数据区，从而避免处理大量噪音数据；其次引入最小DFS编码来表示DOM子树，通过聚类对记录数据区进行区分；最后对少量样本页面训练学习生成抽取规则用于数据抽取。利用原型系统针对实际网站中的页面进行数据抽取，实验结果显示其拥有较高的准确性和效率。
关键词：	Deep Web 结构化数据最小DFS 编辑距离信息抽取
收稿时间：	2010-03-16
Deep Web Structured Data Extraction Based on Minimal DFS

Liu Ronghui,Zheng Jianguo,Wang Xiang.Deep Web Structured Data Extraction Based on Minimal DFS[J].Library and Information Service,2010,54(14):126-130.

Authors:	Liu Ronghui Zheng Jianguo Wang Xiang

Institution:	School of Management, Donghua University,

Abstract:	A new automatical method to extract high quality data from Deep Web is proposed in this paper by analyzing layout features of Web pages. Firstly data region is quick located without deal with a lot of noisy data. Secondly cluster is used to distinguish data record region based on sub-trees of DOM represented by minimal DFS coding. Thirdly, extract rule is got by learning and training few sample pages. The result of experiment by the prototype to real websites web pages shows that the method is effective and efficient.

Keywords:	Deep Web structured data minimal DFS levenshtein distance information extraction
本文献已被万方数据等数据库收录！
	点击此处可从《图书情报工作》浏览原始摘要信息
	点击此处可从《图书情报工作》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏