首页 | 本学科首页   官方微博 | 高级检索  
     检索      

搜索引擎中分类技术研究
引用本文:万小容,马帅,刘利军.搜索引擎中分类技术研究[J].宁波广播电视大学学报,2008,6(2):116-118.
作者姓名:万小容  马帅  刘利军
作者单位:昆明理工大学信息工程与自动化学院,云南昆明,650051
基金项目:昆明理工大学信息工程与自动化学院基金
摘    要:本文提出了一种基于主题采集的Web文档自动分类算法,该算法对朴素贝叶斯分类模型进行了改进。利用该算法,我们实现了一个基于主题信息采集的网页分类系统。文中着重对该系统的页面解析、中文分词和文本分类模块进行了论述,并对改进后的贝叶斯分类方法进行了评估。实验结果表明,该算法对网页分类有较高的准确性。

关 键 词:主题采集  Spider采集  中文分词  文本分类  贝叶斯分类

On the Classification Technology in the Search Engine
WAN Xiao-rong,MA Shuai,LIU Li-jun.On the Classification Technology in the Search Engine[J].Journal of Ningbo Radio & TV University,2008,6(2):116-118.
Authors:WAN Xiao-rong  MA Shuai  LIU Li-jun
Institution:(Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650051,Yunnan, China )
Abstract:This article studied an auto-categorization algorithm based on focused crawling. Naive Bayes categorization model was improved by our algorithm. Using the algorithm, a system ofwebpage auto-categorization was implemented. The implementation of 3 main modules of the system were described in this article: page parsing, Chinese text splitter and text classifying. Finally, an evaluation of algorithm efficiency was given. Experiment data showed that this auto-categorization algorithm had a higher accuracy when applied to webpage classification.
Keywords:Topic collcting  Spider crawling  Chinese word segmenting  Text classifying  Bayes Classifier
本文献已被 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号