首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于Heritrix和Lucene的专题搜索引擎研究
引用本文:贾超,卫文学.基于Heritrix和Lucene的专题搜索引擎研究[J].中国科技信息,2012(10):95-96.
作者姓名:贾超  卫文学
作者单位:山东科技大学信息科学与工程学院,山东青岛,266590
摘    要:专题搜索引擎也称垂直搜索引擎,主要用来满足特定领域的用户需求。Heritrix是开源的网络爬虫,Heritrix的WebUI启动方式并不易用于广大用户。本文改变了往常对Heritrix用法,摒弃了Heritrix的WebUI启动方式,对Heritrix源码进行修改,将Lucene整合到Heritrix中,构建成一个完整的搜索引擎,并通过监听器监听搜索引擎状态,使搜索引擎能够进行自动爬取和数据更新。同时,本文添加了网页过滤模块以及对查询结果排序算法进行了改进,提高了搜索引擎的易用性和查询的准确率。

关 键 词:专题搜索引擎  Heritrix  Lucene  排序算法

Research on the topical serach engine based on Heritrix and Lucene
JIA Chao , WEI Wen-xue.Research on the topical serach engine based on Heritrix and Lucene[J].CHINA SCIENCE AND TECHNOLOGY INFORMATION,2012(10):95-96.
Authors:JIA Chao  WEI Wen-xue
Institution:College of Information Science & Engineering,Shandong University of Science & Technology,Qingdao Shandong 266590,China
Abstract:thematic search engine,also known as vertical search engines,mainly used to meet specific user needs.Heritrix is an open source Web crawler Heritrix the WebUI start way is not easy for the majority of users.Changed the usual Heritrix usage abandon the way of the Heritrix of WebUI start Heritrix source code be modified to integrate Lucene into Heritrix build into a complete search engine,and through the listener to monitor the status of the search engine,search engines can automatic crawling and data updates.Meanwhile,the paper added Web filtering module,and query results sorting algorithm has been improved,easy-to-use search engine and query accuracy.
Keywords:thematic search engine  Heritrix  Lucene  sorting algorithm
本文献已被 CNKI 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号