首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 23 毫秒
1.
This paper is concerned with Markov processes for computing page importance. Page importance is a key factor in Web search. Many algorithms such as PageRank and its variations have been proposed for computing the quantity in different scenarios, using different data sources, and with different assumptions. Then a question arises, as to whether these algorithms can be explained in a unified way, and whether there is a general guideline to design new algorithms for new scenarios. In order to answer these questions, we introduce a General Markov Framework in this paper. Under the framework, a Web Markov Skeleton Process is used to model the random walk conducted by the web surfer on a given graph. Page importance is then defined as the product of two factors: page reachability, the average possibility that the surfer arrives at the page, and page utility, the average value that the page gives to the surfer in a single visit. These two factors can be computed as the stationary probability distribution of the corresponding embedded Markov chain and the mean staying time on each page of the Web Markov Skeleton Process respectively. We show that this general framework can cover many existing algorithms including PageRank, TrustRank, and BrowseRank as its special cases. We also show that the framework can help us design new algorithms to handle more complex problems, by constructing graphs from new data sources, employing new family members of the Web Markov Skeleton Process, and using new methods to estimate these two factors. In particular, we demonstrate the use of the framework with the exploitation of a new process, named Mirror Semi-Markov Process. In the new process, the staying time on a page, as a random variable, is assumed to be dependent on both the current page and its inlink pages. Our experimental results on both the user browsing graph and the mobile web graph validate that the Mirror Semi-Markov Process is more effective than previous models in several tasks, even when there are web spams and when the assumption on preferential attachment does not hold.  相似文献   

2.
Anchor texts complement Web page content and have been used extensively in commercial Web search engines. Existing methods for anchor text weighting rely on the hyperlink information which is created by page content editors. Since anchor texts are created to help user browse the Web, browsing behavior of Web users may also provide useful or complementary information for anchor text weighting. In this paper, we discuss the possibility and effectiveness of incorporating browsing activities of Web users into anchor texts for Web search. We first make an analysis on the effectiveness of anchor texts with browsing activities. And then we propose two new anchor models which incorporate browsing activities. To deal with the data sparseness problem of user-clicked anchor texts, two features of user’s browsing behavior are explored and analyzed. Based on these features, a smoothing method for the new anchor models is proposed. Experimental results show that by incorporating browsing activities the new anchor models outperform the state-of-art anchor models which use only the hyperlink information. This study demonstrates the benefits of Web browsing activities to affect anchor text weighting.  相似文献   

3.
We present a framework for approximating random-walk based probability distributions over Web pages using graph aggregation. The basic idea is to partition the graph into classes of quasi-equivalent vertices, to project the page-based random walk to be approximated onto those classes, and to compute the stationary probability distribution of the resulting class-based random walk. From this distribution we can quickly reconstruct a distribution on pages. In particular, our framework can approximate the well-known PageRank distribution by setting the classes according to the set of pages on each Web host. We experimented on a Web-graph containing over 1.4 billion pages and over 6.6 billion links from a crawl of the Web conducted by AltaVista in September 2003. We were able to produce a ranking that has Spearman rank-order correlation of 0.95 with respect to PageRank. The clock time required by a simplistic implementation of our method was less than half the time required by a highly optimized implementation of PageRank, implying that larger speedup factors are probably possible. Significant portions of the work presented here were done while A. Broder and R. Lempel were employed by the AltaVista corporation.  相似文献   

4.
Online display advertising is a multi-billion dollar industry where advertisers promote their products to users by having publishers display their advertisements on popular Web pages. An important problem in online advertising is how to forecast the number of user visits for a Web page during a particular period of time. Prior research addressed the problem by using traditional time-series forecasting techniques on historical data of user visits; (e.g., via a single regression model built for forecasting based on historical data for all Web pages) and did not fully explore the fact that different types of Web pages and different time stamps have different patterns of user visits. In this paper, we propose a series of probabilistic latent class models to automatically learn the underlying user visit patterns among multiple Web pages and multiple time stamps. The last (and the most effective) proposed model identifies latent groups/classes of (i) Web pages and (ii) time stamps with similar user visit patterns, and learns a specialized forecast model for each latent Web page and time stamp class. Compared with a single regression model as well as several other baselines, the proposed latent class model approach has the capability of differentiating the importance of different types of information across different classes of Web pages and time stamps, and therefore has much better modeling flexibility. An extensive set of experiments along with detailed analysis carried out on real-world data from Yahoo! demonstrates the advantage of the proposed latent class models in forecasting online user visits in online display advertising.  相似文献   

5.
PageRank算法的原理简介   总被引:9,自引:0,他引:9  
在介绍PageRank算法基本思想、基本公式和计算实例的基础上,介绍如何利用PageR- ank算法提高网页PR的方法,最后指出PageRank算法存在的不足,并对其发展趋势进行分析。  相似文献   

6.
主要介绍了设计开发Web主题信息采集系统的一个核心算法——超链接主题预测算法。文章在已有理论的基础上,通过实验分析,发现超链接的主题主要取决于三个因素:父网页的主题相关度、锚文本的主题相关度和Web子图的链接结构特性,从而提出了基于Web页面内容和链接结构的超链接主题预测算法,系统评价结果显示该算法有很好的效果。  相似文献   

7.
[目的/意义]随着互联网技术的快速发展,知乎平台逐渐成为一个热议社会公众话题以及分享知识、经验的载体。因此,分析知乎平台中关键用户的影响力和挖掘其中的关键意见领袖在研究社交网络信息传播途径的过程中起到非常重要的作用。[方法/过程]通过提出改进的PageRank算法和HITS算法,分别基于知乎用户社交网络、问答网络构建用户影响力挖掘模型,能够准确、客观地识别出其中的关键用户及意见领袖。[结果/结论]实验结果表明,提出的PageRank算法和HITS算法能够有效挖掘出知乎平台中具有较为突出特性的关键意见领袖,并且算法的收敛速度较快,具有可复用性和迁移性。通过对知乎平台用户数据集进行处理和有效分析,成功建立用户影响力和关键意见领袖挖掘模型;同时,在具体话题上进行验证。因此,可以推断该模型有巨大应用价值和商业化推广前景。  相似文献   

8.
针对多媒体链接在网页中分布的特点,对PageRank、Shark-Search两种典型的主题搜索算法进行相关参数的改进,采用改进后的两种算法从网页内容和网页网页的角度计算多媒体网页与主题的相似度。实验结果表明,改进的Shark-Search多媒体主题搜索算法比改进后的PageRank搜索算法更能有效地提高多媒体主题搜索的效率,同时也更适合网络多媒体资源的主题搜索。  相似文献   

9.
The collective feedback of the users of an Information Retrieval (IR) system has been shown to provide semantic information that, while hard to extract using standard IR techniques, can be useful in Web mining tasks. In the last few years, several approaches have been proposed to process the logs stored by Internet Service Providers (ISP), Intranet proxies or Web search engines. However, the solutions proposed in the literature only partially represent the information available in the Web logs. In this paper, we propose to use a richer data structure, which is able to preserve most of the information available in the Web logs. This data structure consists of three groups of entities: users, documents and queries, which are connected in a network of relations. Query refinements correspond to separate transitions between the corresponding query nodes in the graph, while users are linked to the queries they have issued and to the documents they have selected. The classical query/document transitions, which connect a query to the documents selected by the users’ in the returned result page, are also considered. The resulting data structure is a complete representation of the collective search activity performed by the users of a search engine or of an Intranet. The experimental results show that this more powerful representation can be successfully used in several Web mining tasks like discovering semantically relevant query suggestions and Web page categorization by topic.  相似文献   

10.
In the past, recursive algorithms, such as PageRank originally conceived for the Web, have been successfully used to rank nodes in the citation networks of papers, authors, or journals. They have proved to determine prestige and not popularity, unlike citation counts. However, bibliographic networks, in contrast to the Web, have some specific features that enable the assigning of different weights to citations, thus adding more information to the process of finding prominence. For example, a citation between two authors may be weighed according to whether and when those two authors collaborated with each other, which is information that can be found in the co-authorship network. In this study, we define a couple of PageRank modifications that weigh citations between authors differently based on the information from the co-authorship graph. In addition, we put emphasis on the time of publications and citations. We test our algorithms on the Web of Science data of computer science journal articles and determine the most prominent computer scientists in the 10-year period of 1996–2005. Besides a correlation analysis, we also compare our rankings to the lists of ACM A. M. Turing Award and ACM SIGMOD E. F. Codd Innovations Award winners and find the new time-aware methods to outperform standard PageRank and its time-unaware weighted variants.  相似文献   

11.
张星  吴忧  刘汕 《图书情报工作》2019,63(6):103-115
[目的/意义]如何满足短视频用户需求,刺激用户参与行为,成功留住用户成为短视频行业亟待解决的问题。现有文献缺乏对短视频用户参与行为方面的研究。因此,本文基于社会-技术理论构建模型来研究影响移动短视频男性和女性用户的浏览行为和创造行为的因素。[方法/过程]本文采用问卷调查法收集到877份有效问卷,使用SPSS 24.0和AMOS 23.0检验所提出的假设。[结果/结论]研究结果表明,无论是男性还是女性用户,个体的外倾性和短视频的娱乐性功能正向影响使用行为;用户的自恋特质和归属需要正向作用于创造行为;用户的人气需要和短视频的信息记录功能正向作用于浏览行为。此外,男性用户的自恋特质负向影响其浏览行为,女性用户的自恋特质与浏览行为无显著关系;信息记录功能对男性用户的浏览行为无显著影响。研究结果为探究移动短视频的用户行为提供理论依据,同时为短视频的发展提供参考策略。  相似文献   

12.
通过挖掘网络日志中的查询词语义关系,将《知网》的语义知识加入到聚类算法中实现搜索引擎优化。该方法通过机器学习算法深入挖掘查询日志,对其中的查询串进行概念相似度、语义聚类等计算,使返回网页更加合理,将更准确的网页结果呈现在用户面前,能够更好地满足用户需求。  相似文献   

13.
王建雄 《图书情报工作》2012,56(21):114-118
在传统PageRank算法的基础上进行一些优化与改进,提出一种新的主题敏感的PageRank算法,通过计算超链接与领域向量的相似度来区分超链接对网页的贡献度,从而有效抑制主题漂移;同时为PageRank算法加入时间因子来防止PageRank偏重旧网页的问题,加入站内外区分因子来防止针对PageRank算法作弊的行为.改进算法弥补了原算法的不足,提高了主题搜索的效率.  相似文献   

14.
We investigate temporal factors in assessing the authoritativeness of web pages. We present three different metrics related to time: age, event, and trend. These metrics measure recentness, special event occurrence, and trend in revisions, respectively. An experimental dataset is created by crawling selected web pages for a period of several months. This data is used to compare page rankings by human users with rankings computed by the standard PageRank algorithm (which does not include temporal factors) and three algorithms that incorporate temporal factors, including the Time-Weighted PageRank (TWPR) algorithm introduced here. Analysis of the rankings shows that all three temporal-aware algorithms produce rankings more like those of human users than does the PageRank algorithm. Of these, the TWPR algorithm produces rankings most similar to human users’, indicating that all three temporal factors are relevant in page ranking. In addition, analysis of parameter values used to weight the three temporal factors reveals that age factor has the most impact on page rankings, while trend and event factors have the second and the least impact. Proper weighting of the three factors in TWPR algorithm provides the best ranking results.  相似文献   

15.
We evaluate author impact indicators and ranking algorithms on two publication databases using large test data sets of well-established researchers. The test data consists of (1) ACM fellowship and (2) various life-time achievement awards. We also evaluate different approaches of dividing credit of papers among co-authors and analyse the impact of self-citations. Furthermore, we evaluate different graph normalisation approaches for when PageRank is computed on author citation graphs.We find that PageRank outperforms citation counts in identifying well-established researchers. This holds true when PageRank is computed on author citation graphs but also when PageRank is computed on paper graphs and paper scores are divided among co-authors. In general, the best results are obtained when co-authors receive an equal share of a paper's score, independent of which impact indicator is used to compute paper scores. The results also show that removing author self-citations improves the results of most ranking metrics. Lastly, we find that it is more important to personalise the PageRank algorithm appropriately on the paper level than deciding whether to include or exclude self-citations. However, on the author level, we find that author graph normalisation is more important than personalisation.  相似文献   

16.
基于用户行为分析的自适应新闻推荐模型   总被引:4,自引:0,他引:4  
针对新闻浏览者的偏好易变等特点,通过度量在线用户的点击和阅读行为,依据其不同的阅读策略类型,分析其页面偏好,并综合各页面偏好和新闻偏好,以关键字偏好表的形式表示;然后设计自适应的评分推荐机制,动态地分析用户兴趣及其转移;设计学习机制,根据用户实际阅读的新闻,调整其关键字偏好,并采用模糊相似度来分析用户偏好结构与新闻结构的相似性,从而产生推荐。实验表明,所构造的模型能够提供良好的个性化新闻推荐服务。  相似文献   

17.
微博是Web2.0时代重要的网络服务工具,作为以用户为中心的信息发布、传播和分享平台,它包含了非常丰富的用户信息。在微博中,可以使用标签表示用户的兴趣和属性。而一个用户的兴趣和属性,通常包含在这个用户的文本信息和网络信息中。针对微博用户的标签进行分析,提出网络正则化的标签分发模型(NTDM)来为用户推荐标签。NTDM模型对用户个人简介中的词语和标签之间的关系进行建模,同时利用其社交网络结构作为模型的正则化因子。在真实数据集上的实验表明,NTDM在效果以及效率上都优于其他方法。  相似文献   

18.
提出一个结合本体论及通用个人资料的个性化推荐模式。首先以网络分类服务作为本体论来解释用户的网络浏览行为,以此挖掘用户的偏好;其次,利用Web使用挖掘技术过滤多余的浏览记录,增强个性化的准确度;最后,利用本体论的层次结构特点,从用户偏好类别中挖掘其潜在偏好,产生符合用户特征的通用个人资料。  相似文献   

19.
We propose a method for performing evaluation of relevance feedback based on simulating real users. The user simulation applies a model defining the user’s relevance threshold to accept individual documents as feedback in a graded relevance environment; user’s patience to browse the initial list of retrieved documents; and his/her effort in providing the feedback. We evaluate the result by using cumulated gain-based evaluation together with freezing all documents seen by the user in order to simulate the point of view of a user who is browsing the documents during the retrieval process. We demonstrate the method by performing a simulation in the laboratory setting and present the “branching” curve sets characteristic for the presented evaluation method. Both the average and topic-by-topic results indicate that if the freezing approach is adopted, giving feedback of mixed quality makes sense for various usage scenarios even though the modeled users prefer finding especially the most relevant documents.  相似文献   

20.
基于超链分析的Web资源自动发现技术   总被引:2,自引:0,他引:2  
传统的Web资源自动发现是基于Web页面内容实现的。本文试图从超链分析的角度探讨Web资源的自动发现技术。超链分析技术起源于社会网络分析和科学引文分析理论,它只分析页面之间的关系,而不关心页面本身的属性。通过试验证明,单纯使用超链,根据用户提供的网页实例,我们能够自动发现与学科资源相关的网站。该技术可以有效的减少网络爬行器的无谓爬行,提高采集效率,减轻网络负担,在学科资源建设中起了重要的作用。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号