首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
In order to evaluate the effectiveness of Information Retrieval (IR) systems it is key to collect relevance judgments from human assessors. Crowdsourcing has successfully been used as a method to scale-up the collection of manual relevance judgments, and previous research has investigated the impact of different judgment task design elements (e.g., highlighting query keywords in the document) on judgment quality and efficiency. In this work we investigate the positive and negative impacts of presenting crowd human assessors with more than just the topic and the document to be judged. We deploy different variants of crowdsourced relevance judgment tasks following a between-subjects design in which we present different types of metadata to the human assessor. Specifically, we investigate the effect of human metadata (e.g., what other human assessors think of the current document, as in which relevance level has already been selected by the majority crowd workers), machine metadata (e.g., how IR systems scored this document such as its average position in ranked lists, statistics about the document such as term frequencies). We look at the impact of metadata on judgment quality (i.e., the level of agreement with trained assessors) and cost (i.e., the time it takes for workers to complete the judgments) as well as at how metadata quality positively or negatively impact the collected judgments.  相似文献   

2.
Concurrent concepts of specificity are discussed and differentiated from each other to investigate the relationship between index term specificity and users’ relevance judgments. The identified concepts are term-document specificity, hierarchical specificity, statement specificity, and posting specificity. Among them, term-document specificity, which is a relationship between an index term and the document indexed with the term, is regarded as a fruitful research area. In an experiment involving three searches with 175 retrieved documents from 356 matched index terms, the impact of specificity on relevance judgments is analyzed and found to be statistically significant. Implications for index practice and for future research are discussed.  相似文献   

3.
Some of the most popular measures to evaluate information filtering systems are usually independent of the users because they are based in relevance judgments obtained from experts. On the other hand, the user-centred evaluation allows showing the different impressions that the users have perceived about the system running. This work is focused on discussing the problem of user-centred versus system-centred evaluation of a Web content personalization system where the personalization is based on a user model that stores long term (section, categories and keywords) and short term interests (adapted from user provided feedback). The user-centred evaluation is based on questionnaires filled in by the users before and after using the system and the system-centred evaluation is based on the comparison between ranking of documents, obtained from the application of a multi-tier selection process, and binary relevance judgments collected previously from real users. The user-centred and system-centred evaluations performed with 106 users during 14 working days have provided valuable data concerning the behaviour of the users with respect to issues such as document relevance or the relative importance attributed to different ways of personalization. The results obtained shows general satisfaction on both the personalization processes (selection, adaptation and presentation) and the system as a whole.  相似文献   

4.
Although relevance judgments are fundamental to the design and evaluation of all information retrieval systems, information scientists have not reached a consensus in defining the central concept of relevance. In this paper we ask two questions: What is the meaning of relevance? and What role does relevance play in information behavior? We attempt to address these questions by reviewing literature over the last 30 years that presents various views of relevance as topical, user-oriented, multidimensional, cognitive, and dynamic. We then discuss traditional assumptions on which most research in the field has been based and begin building a case for an approach to the problem of definition based on alternative assumptions. The dynamic, situational approach we suggest views the user — regardless of system — as the central and active determinant of the dimensions of relevance. We believe that relevance is a multidimensional concept; that it is dependent on both internal (cognitive) and external (situational) factors; that it is based on a dynamic human judgment process; and that it is a complex but systematic and measurable phenomenon.  相似文献   

5.
This paper addresses the problem of how to rank retrieval systems without the need for human relevance judgments, which are very resource intensive to obtain. Using TREC 3, 6, 7 and 8 data, it is shown how the overlap structure between the search results of multiple systems can be used to infer relative performance differences. In particular, the overlap structures for random groupings of five systems are computed, so that each system is selected an equal number of times. It is shown that the average percentage of a system’s documents that are only found by it and no other systems is strongly and negatively correlated with its retrieval performance effectiveness, such as its mean average precision or precision at 1000. The presented method uses the degree of consensus or agreement a retrieval system can generate to infer its quality. This paper also addresses the question of how many documents in a ranked list need to be examined to be able to rank the systems. It is shown that the overlap structure of the top 50 documents can be used to rank the systems, often producing the best results. The presented method significantly improves upon previous attempts to rank retrieval systems without the need for human relevance judgments. This “structure of overlap” method can be of value to communities that need to identify the best experts or rank them, but do not have the resources to evaluate the experts’ recommendations, since it does not require knowledge about the domain being searched or the information being requested.  相似文献   

6.
A new approach to the solicitation and measurement of relevance judgments is presented, which attempts to resolve some of the difficulties inherent in the nature of relevance and human judgment, and which further seeks to examine how users' judgments of document representations change as more information about documents is revealed to them. Subjects (university faculty and doctoral students) viewed three incremental versions of documents, and recorded ratio-level relevance judgments for each version. These judgments were analyzed by a variety of methods, including graphical inspection and examination of the number and degree of changes of judgments as new information is seen. A post questionnaire was also administered to obtain subjects' perceptions of the process and the individual fields of information presented. A consistent pattern of perception and importance of these fields is seen: Abstracts are by far the most important field and have the greatest impact, followed by titles, bibliographic information, and indexing.  相似文献   

7.
While test collections provide the cornerstone for Cranfield-based evaluation of information retrieval (IR) systems, it has become practically infeasible to rely on traditional pooling techniques to construct test collections at the scale of today’s massive document collections (e.g., ClueWeb12’s 700M+ Webpages). This has motivated a flurry of studies proposing more cost-effective yet reliable IR evaluation methods. In this paper, we propose a new intelligent topic selection method which reduces the number of search topics (and thereby costly human relevance judgments) needed for reliable IR evaluation. To rigorously assess our method, we integrate previously disparate lines of research on intelligent topic selection and deep vs. shallow judging (i.e., whether it is more cost-effective to collect many relevance judgments for a few topics or a few judgments for many topics). While prior work on intelligent topic selection has never been evaluated against shallow judging baselines, prior work on deep vs. shallow judging has largely argued for shallowed judging, but assuming random topic selection. We argue that for evaluating any topic selection method, ultimately one must ask whether it is actually useful to select topics, or should one simply perform shallow judging over many topics? In seeking a rigorous answer to this over-arching question, we conduct a comprehensive investigation over a set of relevant factors never previously studied together: 1) method of topic selection; 2) the effect of topic familiarity on human judging speed; and 3) how different topic generation processes (requiring varying human effort) impact (i) budget utilization and (ii) the resultant quality of judgments. Experiments on NIST TREC Robust 2003 and Robust 2004 test collections show that not only can we reliably evaluate IR systems with fewer topics, but also that: 1) when topics are intelligently selected, deep judging is often more cost-effective than shallow judging in evaluation reliability; and 2) topic familiarity and topic generation costs greatly impact the evaluation cost vs. reliability trade-off. Our findings challenge conventional wisdom in showing that deep judging is often preferable to shallow judging when topics are selected intelligently.  相似文献   

8.
This paper describes our novel retrieval model that is based on contexts of query terms in documents (i.e., document contexts). Our model is novel because it explicitly takes into account of the document contexts instead of implicitly using the document contexts to find query expansion terms. Our model is based on simulating a user making relevance decisions, and it is a hybrid of various existing effective models and techniques. It estimates the relevance decision preference of a document context as the log-odds and uses smoothing techniques as found in language models to solve the problem of zero probabilities. It combines these estimated preferences of document contexts using different types of aggregation operators that comply with different relevance decision principles (e.g., aggregate relevance principle). Our model is evaluated using retrospective experiments (i.e., with full relevance information), because such experiments can (a) reveal the potential of our model, (b) isolate the problems of the model from those of the parameter estimation, (c) provide information about the major factors affecting the retrieval effectiveness of the model, and (d) show that whether the model obeys the probability ranking principle. Our model is promising as its mean average precision is 60–80% in our experiments using different TREC ad hoc English collections and the NTCIR-5 ad hoc Chinese collection. Our experiments showed that (a) the operators that are consistent with aggregate relevance principle were effective in combining the estimated preferences, and (b) that estimating probabilities using the contexts in the relevant documents can produce better retrieval effectiveness than using the entire relevant documents.  相似文献   

9.
In this paper results from three studies examining 1295 relevance judgments by 36 information retrieval (IR) system end-users is reported. Both the region of the relevance judgments, from non-relevant to highly relevant, and the motivations or levels for the relevance judgments are examined. Three major findings are studied. First, the frequency distributions of relevance judgments by IR system end-users tend to take on a bi-modal shape with peaks at the extremes (non-relevant/relevant) with a flatter middle range. Second, the different type of scale (interval or ordinal) used in each study did not alter the shape of the relevance frequency distributions. And third, on an interval scale, the median point of relevance judgment distributions correlates with the point where relevant and partially relevant items begin to be retrieved. The median point of a distribution of relevance judgments may provide a measure of user/IR system interaction to supplement precision/recall measures. The implications of investigation for relevance theory and IR systems evaluation are discussed.  相似文献   

10.
Classical test theory offers theoretically derived reliability measures such as Cronbach’s alpha, which can be applied to measure the reliability of a set of Information Retrieval test results. The theory also supports item analysis, which identifies queries that are hampering the test’s reliability, and which may be candidates for refinement or removal. A generalization of Classical Test Theory, called Generalizability Theory, provides an even richer set of tools. It allows us to estimate the reliability of a test as a function of the number of queries, assessors (relevance judges), and other aspects of the test’s design. One novel aspect of Generalizability Theory is that it allows this estimation of reliability even before the test collection exists, based purely on the numbers of queries and assessors that it will contain. These calculations can help test designers in advance, by allowing them to compare the reliability of test designs with various numbers of queries and relevance assessors, and to spend their limited budgets on a design that maximizes reliability. Empirical analysis shows that in cases for which our data is representative, having more queries is more helpful for reliability than having more assessors. It also suggests that reliability may be improved with a per-document performance measure, as opposed to a document-set based performance measure, where appropriate. The theory also clarifies the implicit debate in IR literature regarding the nature of error in relevance judgments.  相似文献   

11.
Ranking aggregation is a task of combining multiple ranking lists given by several experts or simple rankers to get a hopefully better ranking. It is applicable in several fields such as meta search and collaborative filtering. Most of the existing work is under an unsupervised framework. In these methods, the performances are usually limited especially in unreliable case since labeled information is not involved in. In this paper, we propose a semi-supervised ranking aggregation method, in which preference constraints of several item pairs are given. In our method, the aggregation function is learned based on the ordering agreement of different rankers. The ranking scores assigned by this ranking function on the labeled data should be consistent with the given pairwise order constraints while the ranking scores on the unlabeled data obey the intrinsic manifold structure of the rank items. The experimental results on toy data and the OHSUMED data are presented to illustrate the validity of our method.  相似文献   

12.
Relevance judgments occur within an information search process, where time, context and situation can impact the judgments. The determination of relevance is dependent on a number of factors and variables which include the criteria used to determine relevance. The relevance judgment process and the criteria used to make those judgments are manifestations of the cognitive changes which occur during the information search process.Understanding why these relevance criteria choices are made, and how they vary over the information search process can provide important information about the dynamic relevance judgment process. This information can be used to guide the development of more adaptive information retrieval systems which respond to the cognitive changes of users during the information search process.The research data analyzed here was collected in two separate studies which examined a subject’s relevance judgment over an information search process. Statistical analysis was used to examine these results and determine if there were relationships between criteria selections, relevance judgments, and the subject’s progression through the information search process. Findings confirm and extend findings of previous studies, providing strong statistical evidence of an association between the information search process and the choices of relevance criteria by users, and identifying specific changes in the user preferences for specific criteria over the course of the information search process.  相似文献   

13.
李朝晖 《科教文汇》2012,(21):142-142
通过多年的排球课的实践,对排球裁判情况和学生上课具体情况进行分析,论证了高校体育课中培养排球课中学生的裁判能力及成为较高水平裁判员的重要性,并  相似文献   

14.
In this paper, we focus on the problem of discovering internally connected communities in event-based social networks (EBSNs) and propose a community detection method by utilizing social influences between users. Different from traditional social network, EBSNs contain different types of entities and links, and users in EBSNs have more complex behaviours. This leads to poor performance of the traditional social influence computation method in EBSNs. Therefore, to quantify the pairwise social influence accurately in EBSNs, we first propose to compute two types of social influences, i.e., structure-based social influence and behaviour-based social influence, by utilizing the online social network structure and offline social behaviours of users. In particular, based on the specific features of EBSNs, the similarities of user preference on three aspects (i.e., topics, regions and organizers) are utilized to measure the behaviour-based social influence. Then, we obtain the unified pairwise social influence by combining these two types of social influences through a weight function. Next, we present a social influence based community detection algorithm which is referred to as SICD. In SICD, inspired by the nonlinear feature learning ability of the autoencoder, we first devise a neighborhood based deep autoencoder algorithm to obtain nonlinear community-oriented latent representations of users, and then utilize the k-means algorithm for community detection. Experimental results conducted on real-world dataset show the effectiveness of our proposed algorithm.  相似文献   

15.
基于知识产权融资的特点,描述了一种新型的知识产权集群互助担保融资研究的趋势、业务模式、资信评价特征及存在的问题,从而提出了一种结合可拓层次分析法、模糊综合的信用评价方法。该方法通过用区间数取代传统的专家点值打分来构造可拓判断矩阵,能克服传统层次分析中专家模糊经验判断问题及判断矩阵的一致性不足问题,减少了大量试算工作;并能更合理的计算知识产权集群互助担保融资项目的信用度。最后将其应用与某个商标权的集群互助担保融资项目信用评价中,证明了本方法的实效性。  相似文献   

16.
This paper studies how to learn accurate ranking functions from noisy training data for information retrieval. Most previous work on learning to rank assumes that the relevance labels in the training data are reliable. In reality, however, the labels usually contain noise due to the difficulties of relevance judgments and several other reasons. To tackle the problem, in this paper we propose a novel approach to learning to rank, based on a probabilistic graphical model. Considering that the observed label might be noisy, we introduce a new variable to indicate the true label of each instance. We then use a graphical model to capture the joint distribution of the true labels and observed labels given features of documents. The graphical model distinguishes the true labels from observed labels, and is specially designed for ranking in information retrieval. Therefore, it helps to learn a more accurate model from noisy training data. Experiments on a real dataset for web search show that the proposed approach can significantly outperform previous approaches.  相似文献   

17.
Accurate term discrimination in information retrieval is essential for identifying important terms in specific documents. In addition to the widely known inverse document frequency (IDF) method, alternative approaches such as the residual inverse document frequency (RIDF) scheme have been introduced for term discrimination. However, existing methods' performance is not unconditionally convincing. We propose a new collection frequency weighting scheme derived from the negative binomial distribution model of term occurrences. Factorial experiments were performed to examine potential interaction effect between collection frequency weight methods and term frequency weight methods according to the mean average precision and normalized discounted cumulative gain performance assessors. The results indicate that our proposed term discrimination method offers a significant gain in accuracy as compared to the IDF and RIDF scheme. This finding is reinforced by the fact that the results show no interaction effects among factors.  相似文献   

18.
This paper is concerned with the quality of training data in learning to rank for information retrieval. While many data selection techniques have been proposed to improve the quality of training data for classification, the study on the same issue for ranking appears to be insufficient. As pointed out in this paper, it is inappropriate to extend technologies for classification to ranking, and the development of novel technologies is sorely needed. In this paper, we study the development of such technologies. To begin with, we propose the concept of “pairwise preference consistency” (PPC) to describe the quality of a training data collection from the ranking point of view. PPC takes into consideration the ordinal relationship between documents as well as the hierarchical structure on queries and documents, which are both unique properties of ranking. Then we select a subset of the original training documents, by maximizing the PPC of the selected subset. We further propose an efficient solution to the maximization problem. Empirical results on the LETOR benchmark datasets and a web search engine dataset show that with the subset of training data selected by our approach, the performance of the learned ranking model can be significantly improved.  相似文献   

19.
张丽丽 《科教文汇》2014,(1):57-57,59
分析了初等变换方法求矩阵的秩、利用初等变换求矩阵的秩与高斯消元法解线性方程组,向量组的线性表示.向量组的线性相关性的相通性原理,将初等变换求秩应用在以上方面,既解决了三个问题的求解判断,更将知识融会贯通.紧密联系在一起,为以后相关知识的学习奠定基础。  相似文献   

20.
Most existing search engines focus on document retrieval. However, information needs are certainly not limited to finding relevant documents. Instead, a user may want to find relevant entities such as persons and organizations. In this paper, we study the problem of related entity finding. Our goal is to rank entities based on their relevance to a structured query, which specifies an input entity, the type of related entities and the relation between the input and related entities. We first discuss a general probabilistic framework, derive six possible retrieval models to rank the related entities, and then compare these models both analytically and empirically. To further improve performance, we study the problem of feedback in the context of related entity finding. Specifically, we propose a mixture model based feedback method that can utilize the pseudo feedback entities to estimate an enriched model for the relation between the input and related entities. Experimental results over two standard TREC collections show that the derived relation generation model combined with a relation feedback method performs better than other models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号