WNBTE网页正文抽取方法研究 An Approach Based on Words Numbers for Extracting Text from Web Pages期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

WNBTE网页正文抽取方法研究

引用本文：	李纲,戴强斌.WNBTE网页正文抽取方法研究[J].情报科学,2008,26(3):333-336.

作者姓名：	李纲戴强斌

作者单位：	武汉大学,信息资源研究中心,湖北,武汉,430072

摘要：	WNBTE是一种基于文本字数统计信息,从网页中抽取正文内容的方法。该方法分析网页上存在的各种文字及其特点,寻找网页中包含字符数最多的结点,去掉该结点内的布局文字和说明文字,从而得到正文信息。该方法不需要人工参与,也不需要样本学习,克服了传统网页内容抽取方法中需要根据不同数据源构造不同抽取器的问题。
关键词：	信息处理网页正文抽取自动识别
文章编号：	1007-7634(2008)03-0333-04
修稿时间：	2007年12月7日
An Approach Based on Words Numbers for Extracting Text from Web Pages

LI Gang,DAI Qiang-bin.An Approach Based on Words Numbers for Extracting Text from Web Pages[J].Information Science,2008,26(3):333-336.

Authors:	LI Gang DAI Qiang-bin

Institution:	LI Gang,DAI Qiang-bin(Center for Studies of Information Resources of Wuhan University,Wuhan 430072,China)

Abstract:	WNBTE is a method for text extraction from web pages based on the statistics of words numbers.According to the characteristic of characters on web pages,WNBTE picks the node in which the most words are included.For getting the text,words used in layout and narrative words should be removed.Unlike the traditional text extraction method,it does not need user's intervention and extra samples studying.

Keywords:	information mining text extraction self-motion recognices
本文献已被 CNKI 维普万方数据等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏