首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
This paper is concerned with a framework to compute the importance of webpages by using real browsing behaviors of Web users. In contrast, many previous approaches like PageRank compute page importance through the use of the hyperlink graph of the Web. Recently, people have realized that the hyperlink graph is incomplete and inaccurate as a data source for determining page importance, and proposed using the real behaviors of Web users instead. In this paper, we propose a formal framework to compute page importance from user behavior data (which covers some previous works as special cases). First, we use a stochastic process to model the browsing behaviors of Web users. According to the analysis on hundreds of millions of real records of user behaviors, we justify that the process is actually a continuous-time time-homogeneous Markov process, and its stationary probability distribution can be used as the measure of page importance. Second, we propose a number of ways to estimate parameters of the stochastic process from real data, which result in a group of algorithms for page importance computation (all referred to as BrowseRank). Our experimental results have shown that the proposed algorithms can outperform the baseline methods such as PageRank and TrustRank in several tasks, demonstrating the advantage of using our proposed framework.  相似文献   

2.
This paper examines the way in which Taiwan is connected to on the World Wide Web in South Korea. The Web may represent a new channel for the communication among a global society's members and a reflection of international relations. Thus, it is necessary to explore the distribution of relations formed and maintained on the Web and the contents of those relations as well. This paper traced South Korean Web pages hyperlinking pages hosted in Taiwan, using a search engine. The context in which Taiwan appears in South Korean pages was also examined. Specifically, the structure of hyperlink connectivity from South Korea and Taiwan was analyzed. It was found that the hyperlink network was very sparsely connected in terms of the number of South Korean Web pages hyperlinking to the pages of the other country. The contents of hyperlink-connected information were categorized and analyzed. The most often occurring content category was ‘Computers & Internet’ in Taiwan. This suggests that South Korean Web users including organizations are more interested in computer-related products in Taiwan than any other things. The implication of this paper is to examine the state and form of international information flow from South Korea to Taiwan based on the patterns of hyperlink relations inscribed on South Korean Web pages and the type and content of information.  相似文献   

3.
主要介绍了设计开发Web主题信息采集系统的一个核心算法——超链接主题预测算法。文章在已有理论的基础上,通过实验分析,发现超链接的主题主要取决于三个因素:父网页的主题相关度、锚文本的主题相关度和Web子图的链接结构特性,从而提出了基于Web页面内容和链接结构的超链接主题预测算法,系统评价结果显示该算法有很好的效果。  相似文献   

4.
提出人才网页自动识别系统设计,实现对Nutch定向采集系统抓取的高校网站页面进行人才描述网页自动识别。识别过程中使用自动获取的网页的URL特征、网页Title标签特征、链接文字特征以及网页文本内容特征,使用人名词表、正面特征词表、负面特征词表对各项识别特征进行匹配以计算特征值,借助开源软件LibSVM实现基于多特征值的人才网页自动识别。  相似文献   

5.
目前,在网页分类中,对HTML主要结构特征进行加权的常用方法是绝对数值加权方法.这种方法的缺点是加权系数为定值,其对长文本和短文本所起的作用不同,使得结构特征对正文的影响随着正文长度的增加而削弱.针对该缺点,本文提出了一种改进型加权方法,即相对数值加权方法.通过网页层次分类的实验,比较了这两种方法对单个标签域以及多个标签域结合的分类性能.实验结果表明,相对数值加权方法能有效提高分类的精确度,并且效果优于绝对数值加权方法.  相似文献   

6.
The Institute for Scientific Information (ISI), part of Thompson Scientific, produces the Web of Science database as part of its Web of Knowledge. Recently, ISI introduced some new features, among them a new Author Finder feature, which allows users to zero in on a specific author in a very guided way. In addition, search results may be analyzed and reports created by users, both at the click of a button. This column focuses on these recently introduced features.  相似文献   

7.
ABSTRACT

Adding multiple sources of information in the display of Web search results may negatively affect users’ perceptual experience and information-seeking behavior. This claim was established by investigating the impact of different Web search compositions on users’ ability to extract specific information. In this article, we assumed that the quantity and order of different compositions (areas) in the Web search results page may contribute to individual’s ability to find information relevant to their search queries. An eye-tracking device was used to observe and compare the perceptual behavior of 14 users in an information-seeking task. The results showed that the use of different compositions in the display of Web results page significantly influenced users’ perceptual experience by reducing their attention to the organic results area. The quantity of these compositions was found to greatly increase the cognitive load of users when attempting to retrieve information from the organic area, which negatively affects their information-seeking performance. Our finding provides a rationale for further studies to consider the impact of quantity and order of Web page compositions on individuals’ perceptual attention and cognitive load in information-seeking task settings.  相似文献   

8.
The collective feedback of the users of an Information Retrieval (IR) system has been shown to provide semantic information that, while hard to extract using standard IR techniques, can be useful in Web mining tasks. In the last few years, several approaches have been proposed to process the logs stored by Internet Service Providers (ISP), Intranet proxies or Web search engines. However, the solutions proposed in the literature only partially represent the information available in the Web logs. In this paper, we propose to use a richer data structure, which is able to preserve most of the information available in the Web logs. This data structure consists of three groups of entities: users, documents and queries, which are connected in a network of relations. Query refinements correspond to separate transitions between the corresponding query nodes in the graph, while users are linked to the queries they have issued and to the documents they have selected. The classical query/document transitions, which connect a query to the documents selected by the users’ in the returned result page, are also considered. The resulting data structure is a complete representation of the collective search activity performed by the users of a search engine or of an Intranet. The experimental results show that this more powerful representation can be successfully used in several Web mining tasks like discovering semantically relevant query suggestions and Web page categorization by topic.  相似文献   

9.
一种基于源网页质量的锚文本相似度计算方法--LAAT   总被引:8,自引:0,他引:8  
陆一鸣  胡健  马范援 《情报学报》2005,24(5):548-554
锚文本作为对目标网页的描述,往往分布在不同的源网页上,质量也参差不齐。本文利用了超链接分析算法的成果,提出一种基于源网页质量的锚文本相似度计算方法——LAAT(Link Aid Anchor Text)。实验表明,利用源网页质量能够有效地综合各源网页上的锚文本组成,从而能够提高检索性能。  相似文献   

10.
Web页面中文文本主题的自动提取研究   总被引:14,自引:1,他引:13  
韩客松  王永成  滕伟 《情报学报》2001,20(2):217-223
Internet上的内容日益增多 ,搜索引擎返回的结果往往冗长。本文首先讨论Web页面文本与一般文本的四个不同点 ,然后介绍一种以统计方法为主、以匹配校验为辅的Web页面中文文本主题自动提取方法 ,它能帮助用户在最短时间内了解当前页面的主题。实验显示 ,所提取的前15个字串 ,反映主题的平均正确率在 85%以上 ,而处理时间仅为几十到几百毫秒。  相似文献   

11.
12.
Web 信息检索(Information Retrieval)技术研究是应用文本检索研究的成果,它结合Web图论的思想,研究Web上的信息检索,是行之有效的Web知识发现的途径。传统HITS方法所获得的信息精确度相当低,而PageRank作为一通用的搜索方法,不能够应用于特定主题的信息获取。在充分分析了PageRank、HITS等现有算法和Web文档的相似度计算方法的基础上,提出了Web上查询特定主题相关信息发现的RG-HITS算法。它结合了Web超链接、网页知识表示的信息相关度以及HITS方法来搜索Web上特定主题的相关知识。  相似文献   

13.
Searches with learning intent typically require the users to interact with the searching environment and perform knowledge acquisition features such as scan, read, and process the online content to fulfill their information needs. To capture indicators from searching behaviors that could account for the knowledge gained during a Web search, a qualitative study was performed using the Concurrent Think-Aloud protocol to observe the mechanisms of transfer and map knowledge flows during 78 search sessions. Findings indicate evidence of transfer of learning in the form of sixteen online information searching strategy indicators. This research aids the understanding of how knowledge is gained during search sessions and how to identify behaviors that could indicate that learning has occurred, which could be used to represent knowledge gain on Web search engines. In this way, it can aid search engines to become not only better tools of searching, but also tools of learning.  相似文献   

14.
Political candidates have responded to the public's desire to use the Internet as an interactive information source by creating their own online presence. This study is a content analysis of the Web sites and blogs of the 10 Americans vying to be the Democratic candidate for the 2004 presidential election. Focusing on interactivity, data indicated front pages hyperlink to participation areas such as Donation or Volunteer sections and rarely linked to external content. Blogs used hyperlinks at a rate less than Web sites. Interactivity was encouraged through text, as 83.7% of Web sites asked voters to become more involved. Blog posts discussed issues and attacked the opponents, including President Bush. For the most part, blog posts were personal in nature and used direct address. The tactical use of advanced Web site features showed a technological progression of political campaigning and an overall increase in interactivity through technology and text.  相似文献   

15.
复合型Web信息检索系统   总被引:5,自引:0,他引:5  
向桂林 《情报学报》2003,22(5):545-549
本文首先分析了常见的三种搜索引擎 :基于内容分析的搜索引擎、基于超链分析的搜索引擎、基于反馈分析的搜索引擎的弊端 ,提出了一种能够集三种搜索引擎优点于一身的复合型Web信息检索系统 ,并详细阐述了该系统的实现方法  相似文献   

16.
基于超链分析的Web资源自动发现技术   总被引:2,自引:0,他引:2  
传统的Web资源自动发现是基于Web页面内容实现的。本文试图从超链分析的角度探讨Web资源的自动发现技术。超链分析技术起源于社会网络分析和科学引文分析理论,它只分析页面之间的关系,而不关心页面本身的属性。通过试验证明,单纯使用超链,根据用户提供的网页实例,我们能够自动发现与学科资源相关的网站。该技术可以有效的减少网络爬行器的无谓爬行,提高采集效率,减轻网络负担,在学科资源建设中起了重要的作用。  相似文献   

17.
We investigate temporal factors in assessing the authoritativeness of web pages. We present three different metrics related to time: age, event, and trend. These metrics measure recentness, special event occurrence, and trend in revisions, respectively. An experimental dataset is created by crawling selected web pages for a period of several months. This data is used to compare page rankings by human users with rankings computed by the standard PageRank algorithm (which does not include temporal factors) and three algorithms that incorporate temporal factors, including the Time-Weighted PageRank (TWPR) algorithm introduced here. Analysis of the rankings shows that all three temporal-aware algorithms produce rankings more like those of human users than does the PageRank algorithm. Of these, the TWPR algorithm produces rankings most similar to human users’, indicating that all three temporal factors are relevant in page ranking. In addition, analysis of parameter values used to weight the three temporal factors reveals that age factor has the most impact on page rankings, while trend and event factors have the second and the least impact. Proper weighting of the three factors in TWPR algorithm provides the best ranking results.  相似文献   

18.
PageRank算法的原理简介   总被引:9,自引:0,他引:9  
在介绍PageRank算法基本思想、基本公式和计算实例的基础上,介绍如何利用PageR- ank算法提高网页PR的方法,最后指出PageRank算法存在的不足,并对其发展趋势进行分析。  相似文献   

19.
虚拟图书馆中网页的自动分类研究   总被引:1,自引:0,他引:1  
概括了国内外对电子文本及Web网页进行自动分类的研究和试验,论述了虚拟图书馆中对网页进行自动分类与一般搜索引擎中对网页进行自动分类的区别,提出了一种用于虚拟图书馆中对网页进行自动分类的方法,并描述了按照此方法建立的“图书馆学情报学”虚拟图书馆的自动分类系统,对分类结果进行了分析。  相似文献   

20.
Significant progress has been made in information retrieval covering text semantic indexing and multilingual analysis. However, developments in Arabic information retrieval did not follow the extraordinary growth of Arabic usage in the Web during the ten last years. In the tasks relating to semantic analysis, it is preferable to directly deal with texts in their original language. Studies on topic models, which provide a good way to automatically deal with semantic embedded in texts, are not complete enough to assess the effectiveness of the approach on Arabic texts. This paper investigates several text stemming methods for Arabic topic modeling. A new lemma-based stemmer is described and applied to newspaper articles. The Latent Dirichlet Allocation model is used to extract latent topics from three Arabic real-world corpora. For supervised classification in the topics space, experiments show an improvement when comparing to classification in the full words space or with root-based stemming approach. In addition, topic modeling with lemma-based stemming allows us to discover interesting subjects in the press articles published during the 2007–2009 period.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号