首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 531 毫秒
1.
搜索引擎检索结果的组织技术   总被引:9,自引:0,他引:9  
赵荣  黄燕云  张露 《情报学报》2004,23(1):69-72
本文综合分析了几种主要的搜索引擎检索结果排序组织技术的原理及应用 ,包括关键词词频和位置原理、网页链接级别算法和结果分类组织等。  相似文献   

2.
一种基于网页分割的Web信息检索方法   总被引:2,自引:0,他引:2  
提出一种基于网页内容分割的Web信息检索算法。该算法根据网页半结构化的特点,按照HTML标记和网页的内容将网页进行区域分割。在建立HTML标记树的基础上,利用了的内容相似性和视觉相似性进行节点的整合。在检索和排序中,根据用户的查询,充分利用了区域信息来对相关的检索结果进行排序。  相似文献   

3.
面对搜索引擎基于关键词全文检索导致检索准确度低和学科信息门户加工描述只到站点级别的问题,作者提出了将搜索引擎和学科信息门户结合构建智能学科门户搜索引擎的建议--在经过学科专家筛选的、学科信息门户目录中的高质量网站中自动收集网页,形成网页索引,利用自动标引与自动分类方法对收集到的网页进行标引和分类,最后通过分类浏览目录与主题词检索的方式,向用户提供学术资源网页的查找.文章重点介绍了智能学科门户搜索引擎的网页采集、网页自动标引与自动分类及用户接口的设计与实现,并对该搜索引擎存在的问题进行了分析和讨论.  相似文献   

4.
传统搜索引擎通常靠抓取全文关键词进行分析,由此带来三大缺陷:缺乏语义描述导致查准率低;检索结果冗余模糊导致检索效率低;检索途径不足。基于DC元数据描述网络资源的优越性,课题组设计了一个基于DC元数据的网络搜索引擎系统DCSE,力图克服传统搜索引擎的上述缺陷。DCSE系统自动抓取含DC描述的Web网页,把DC描述信息存入到数据库,排序索引后提供用户检索。检索界面设计成以15个DC元素为检索项的多项逻辑组合检索,检索结果以各DC元素的描述内容来显示,如标题、创建者、描述、日期等。用户通过多项组合检索提高查准率,并通过清晰的结果显示对所需信息做出快速判断、选择,从而达到提高检索效率的目的。  相似文献   

5.
基于ID3分类算法的深度网络爬虫设计   总被引:1,自引:0,他引:1  
针对目前Web信息挖掘中存在的信息覆盖率较低的问题,对网络爬虫系统进行研究,提出一种针对深度网络的、基于ID3分类算法的Web页面收集方法。对Web页面的特征进行分析、处理和分类,提取包含深度网页的表单,通过自动提交这些表单来进行更深和更广的页面获取,实验表明该方法可以有效减少现有搜索引擎的盲区,改善搜索结果。  相似文献   

6.
基于用户查询关键词的网页去重方法研究   总被引:2,自引:0,他引:2  
在研究传统的基于特征码去重算法的基础上,针对元搜索引擎中网页重复现象,提出一种基于用户查询关键词的网页去重方法,提高元搜索引擎检索质量,并且介绍算法的实现过程,通过实验验证算法的有效性。  相似文献   

7.
一种HTML网页净化方法   总被引:35,自引:1,他引:35  
张志刚  陈静  李晓明 《情报学报》2004,23(4):387-393
Web网页中的“噪音”是影响基于网页内容的Web应用系统工作质量的一个重要因素 ,快速准确地清除网页中的噪音内容是提高Web应用服务质量的关键技术之一。本文提出一种网页净化的方法及相应算法。该方法以一组启发式规则为基础 ,利用信息检索的技术以及Web网页的特征 ,提取网页的主题以及和主题相关的内容 ,从而达到网页净化的目的。该方法已经应用于搜索引擎系统 (天网 )的网页消重过程以及一个网页自动分类系统。通过网页净化对原有系统质量的改进验证了本文提出方法的正确性和有效性。  相似文献   

8.
一个基于特征向量的近似网页去重算法   总被引:1,自引:0,他引:1  
在搜索引擎的检索结果页面中,用户经常会得到内容相似的重复页面,它们中大多是由于网站之间转载造成的。为提高检索效率和用户满意度,提出一种基于特征向量的大规模中文近似网页检测算法DDW(Detect near—Duplicate Web Pages)。试验证明,比起其他网页去重算法(I—Match),DDW具有很好的抵抗噪声的能力及近似线性的时间和空间复杂度,在大规模实验中获得良好测试结果。  相似文献   

9.
虚拟图书馆中网页的自动分类研究   总被引:1,自引:0,他引:1  
概括了国内外对电子文本及Web网页进行自动分类的研究和试验,论述了虚拟图书馆中对网页进行自动分类与一般搜索引擎中对网页进行自动分类的区别,提出了一种用于虚拟图书馆中对网页进行自动分类的方法,并描述了按照此方法建立的“图书馆学情报学”虚拟图书馆的自动分类系统,对分类结果进行了分析。  相似文献   

10.
一种Web多维分析模型及应用   总被引:1,自引:0,他引:1  
朱家稷  闫宏飞 《情报学报》2004,23(5):553-560
Web上的网页正以惊人的速度增长和变化 ,给传统搜索引擎的效率和质量带来了许多新的问题和挑战。我们迫切需要一种研究方法 ,能够对搜索引擎收集来的海量网页进行有效的分析 ,以便对Web保持完整清晰的认识来指导搜索引擎进行更有效的服务。本文提出一种基于时间、空间和内容的三维Web分析模型 ,通过它可以对海量的网页数据进行多维度、多层次的分析工作 ,为我们认识Web提供一种全新的视角。在实验中我们简单地实现了该模型 ,并通过对 3批网页数据进行分析 ,得到网页变化率、网页空间分布、复制强度大的网页特点等数据 ,以及Internet作为“第四媒体”在信息传播上的一些特点。  相似文献   

11.
在现有相关研究的基础上,对基于通用搜索引擎的深层网络表面化方法的基本原理进行分析,对表单域取值范围的确定、查询处理、查询结果的超链接设置等与深层网络表面化相关的若干关键问题进行探讨。  相似文献   

12.
本文以PageRank算法和HITS算法为例,对基于超链接分析技术的搜索引擎排序算法进行分析,并总结了超链接分析技术应用于搜索引擎结果排序的局限性。  相似文献   

13.
This paper examines the way in which Taiwan is connected to on the World Wide Web in South Korea. The Web may represent a new channel for the communication among a global society's members and a reflection of international relations. Thus, it is necessary to explore the distribution of relations formed and maintained on the Web and the contents of those relations as well. This paper traced South Korean Web pages hyperlinking pages hosted in Taiwan, using a search engine. The context in which Taiwan appears in South Korean pages was also examined. Specifically, the structure of hyperlink connectivity from South Korea and Taiwan was analyzed. It was found that the hyperlink network was very sparsely connected in terms of the number of South Korean Web pages hyperlinking to the pages of the other country. The contents of hyperlink-connected information were categorized and analyzed. The most often occurring content category was ‘Computers & Internet’ in Taiwan. This suggests that South Korean Web users including organizations are more interested in computer-related products in Taiwan than any other things. The implication of this paper is to examine the state and form of international information flow from South Korea to Taiwan based on the patterns of hyperlink relations inscribed on South Korean Web pages and the type and content of information.  相似文献   

14.
复合型Web信息检索系统   总被引:5,自引:0,他引:5  
向桂林 《情报学报》2003,22(5):545-549
本文首先分析了常见的三种搜索引擎 :基于内容分析的搜索引擎、基于超链分析的搜索引擎、基于反馈分析的搜索引擎的弊端 ,提出了一种能够集三种搜索引擎优点于一身的复合型Web信息检索系统 ,并详细阐述了该系统的实现方法  相似文献   

15.
Web hyperlink analysis has been a key topic of Webometric research. However, inlink data collection from commercial search engines has been limited to only one source in recent years, which is not a promising prospect for the future development of the field. We need to tap into other Web data sources and to develop new methods. Toward this end, we propose a new Webometrics concept that is based on words rather than inlinks on Webpages. We propose that word co-occurrences on Webpages can be a measure of the relatedness of organizations. Word co-occurrence data can be collected from both general search engines and blog search engines, which expands data sources greatly. The proposed concept is tested in a group of companies in the LTE and WiMax sectors of the telecommunications industry. Data on the co-occurrences of company names on Webpages were collected from Google and Google Blog. The co-occurrence matrices were analyzed using MDS. The resulting MDS maps were compared with industry reality and with the MDS maps from co-link analysis. Results show that Web co-word analysis could potentially be as useful as Web co-link analysis. Google Blog seems to be a better source than Google for co-word data collection.  相似文献   

16.
Anchor texts complement Web page content and have been used extensively in commercial Web search engines. Existing methods for anchor text weighting rely on the hyperlink information which is created by page content editors. Since anchor texts are created to help user browse the Web, browsing behavior of Web users may also provide useful or complementary information for anchor text weighting. In this paper, we discuss the possibility and effectiveness of incorporating browsing activities of Web users into anchor texts for Web search. We first make an analysis on the effectiveness of anchor texts with browsing activities. And then we propose two new anchor models which incorporate browsing activities. To deal with the data sparseness problem of user-clicked anchor texts, two features of user’s browsing behavior are explored and analyzed. Based on these features, a smoothing method for the new anchor models is proposed. Experimental results show that by incorporating browsing activities the new anchor models outperform the state-of-art anchor models which use only the hyperlink information. This study demonstrates the benefits of Web browsing activities to affect anchor text weighting.  相似文献   

17.
专业搜索引擎的排序算法研究   总被引:5,自引:0,他引:5  
探讨影响搜索引擎排序的一般性因素:词频和词位置信息、用户行为信息、网页之间的链接信息等,在此基础上针对专业搜索引擎的排序算法,提出主题相关度并结合基础教育搜索引擎进行实验。实验结果表明,专业搜索引擎中主题相关度的适当应用能明显改善排序结果。  相似文献   

18.
Web search algorithms that rank Web pages by examining the link structure of the Web are attractive from both theoretical and practical aspects. Todays prevailing link-based ranking algorithms rank Web pages by using the dominant eigenvector of certain matrices—like the co-citation matrix or variations thereof. Recent analyses of ranking algorithms have focused attention on the case where the corresponding matrices are irreducible, thus avoiding singularities of reducible matrices. Consequently, rank analysis has been concentrated on authority connected graphs, which are graphs whose co-citation matrix is irreducible (after deleting zero rows and columns). Such graphs conceptually correspond to thematically related collections, in which most pages pertain to a single, dominant topic of interest.A link-based search algorithm A is rank-stable if minor changes in the link structure of the input graph, which is usually a subgraph of the Web, do not affect the ranking it produces; algorithms A,B are rank-similar if they produce similar rankings. These concepts were introduced and studied recently for various existing search algorithms.This paper studies the rank-stability and rank-similarity of three link-based ranking algorithms—PageRank, HITS and SALSA—in authority connected graphs. For this class of graphs, we show that neither HITS nor PageRank is rank stable. We then show that HITS and PageRank are not rank similar on this class, nor is any of them rank similar to SALSA.This research was supported by the Fund for the Promotion of Research at the Technion, and by the Barnard Elkin Chair in Computer Science.  相似文献   

19.
网络上科学信息的时效性测量   总被引:3,自引:0,他引:3  
时效性是影响网上信息质量的重要因素.本文以网上可公共获取的科学信息为对象,采用层次分析法分配信息时效性各测量指标的权重,选择数学、生命科学、物理、材料科学等8个学科门类的32个主题词进行跟踪查询,抽取Google、Yahoo和Altavista搜索引擎返回的前50个页面作为测量样本.测量结果为:网络科学信息时效性的平均得分为2.6482(总体样本2814个),仅有34.90%的网页时效性得分高于平均值.不同域名中,.gov测量结果最好;在不同资源类型方面,虚拟研究社区与博客的时效性最好.然而,时效性只是网络信息的质量特征之一,并不能仅仅根据时效性判断信息的质量.总的说来,网络科学信息的时效性有待提高.本研究中提出的时效性测评框架及方法有利于帮助研究人员和公众在查询信息时对其时效性作出初步判断.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号