期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

On information retrieval metrics designed for evaluation with incomplete relevance assessments 总被引：1，自引：0，他引：1

Tetsuya Sakai Noriko Kando 《Information Retrieval》2008,11(5):447-470

Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, where many documents exist that were never examined by the relevance assessors, is receiving a lot of attention. This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs—the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. Following previous work, we artificially reduce the original relevance data to simulate IR evaluation environments with extremely incomplete relevance data. We then investigate the effect of this reduction on discriminative power, which we define as the proportion of system pairs with a statistically significant difference for a given probability of Type I Error, and on Kendall’s rank correlation, which reflects the overall resemblance of two system rankings according to two different metrics or two different relevance data sets. According to these experiments, Q′, nDCG′ and AP′ proposed by Sakai are superior to bpref proposed by Buckley and Voorhees and to Rank-Biased Precision proposed by Moffat and Zobel. We also point out some weaknesses of bpref and Rank-Biased Precision by examining their formal definitions.

Noriko KandoEmail:

相似文献

2.

Construction of query concepts based on feature clustering of documents

Youjin Chang Minkoo Kim Vijay V. Raghavan 《Information Retrieval》2006,9(3):231-248

In Information Retrieval, since it is hard to identify users’ information needs, many approaches have been tried to solve this problem by expanding initial queries and reweighting the terms in the expanded queries using users’ relevance judgments. Although relevance feedback is most effective when relevance information about retrieved documents is provided by users, it is not always available. Another solution is to use correlated terms for query expansion. The main problem with this approach is how to construct the term-term correlations that can be used effectively to improve retrieval performance. In this study, we try to construct query concepts that denote users’ information needs from a document space, rather than to reformulate initial queries using the term correlations and/or users’ relevance feedback. To form query concepts, we extract features from each document, and then cluster the features into primitive concepts that are then used to form query concepts. Experiments are performed on the Associated Press (AP) dataset taken from the TREC collection. The experimental evaluation shows that our proposed framework called QCM (Query Concept Method) outperforms baseline probabilistic retrieval model on TREC retrieval. 相似文献

3.

Learning to rank for<Emphasis Type="Italic"> why</Emphasis>-question answering

Suzan Verberne Hans van Halteren Daphne Theijssen Stephan Raaijmakers Lou Boves 《Information Retrieval》2011,14(2):107-132

In this paper, we evaluate a number of machine learning techniques for the task of ranking answers to why-questions. We use TF-IDF together with a set of 36 linguistically motivated features that characterize questions and answers. We experiment with a number of machine learning techniques (among which several classifiers and regression techniques, Ranking SVM and SVM ^map) in various settings. The purpose of the experiments is to assess how the different machine learning approaches can cope with our highly imbalanced binary relevance data, with and without hyperparameter tuning. We find that with all machine learning techniques, we can obtain an MRR score that is significantly above the TF-IDF baseline of 0.25 and not significantly lower than the best score of 0.35. We provide an in-depth analysis of the effect of data imbalance and hyperparameter tuning, and we relate our findings to previous research on learning to rank for Information Retrieval. 相似文献

4.

Some comments on Egghe's derivation of the impact factor distribution

Ludo Waltman Nees Jan van Eck 《Journal of Informetrics》2009,3(4):363-366

In a recent paper, Egghe [Egghe, L. (in press). Mathematical derivation of the impact factor distribution. Journal of Informetrics] presents a mathematical analysis of the rank-order distribution of journal impact factors. The analysis is based on the central limit theorem. We criticize the empirical relevance of Egghe's analysis. More specifically, we argue that Egghe's analysis relies on an unrealistic assumption and we show that the analysis is not in agreement with empirical data. 相似文献

5.

Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database

Jovan?Pehcevski Email author James?A.?Thom Anne-Marie?Vercoustre 《Information Retrieval》2005,8(4):571-600

This paper investigates the impact of three approaches to XML retrieval: using Zettair, a full-text information retrieval system; using eXist, a native XML database; and using a hybrid system that takes full article answers from Zettair and uses eXist to extract elements from those articles. For the content-only topics, we undertake a preliminary analysis of the INEX 2003 relevance assessments in order to identify the types of highly relevant document components. Further analysis identifies two complementary sub-cases of relevance assessments (General and Specific) and two categories of topics (Broad and Narrow). We develop a novel retrieval module that for a content-only topic utilises the information from the resulting answer list of a native XML database and dynamically determines the preferable units of retrieval, which we call Coherent Retrieval Elements. The results of our experiments show that—when each of the three systems is evaluated against different retrieval scenarios (such as different cases of relevance assessments, different topic categories and different choices of evaluation metrics)—the XML retrieval systems exhibit varying behaviour and the best performance can be reached for different values of the retrieval parameters. In the case of INEX 2003 relevance assessments for the content-only topics, our newly developed hybrid XML retrieval system is substantially more effective than either Zettair or eXist, and yields a robust and a very effective XML retrieval. 相似文献

6.

Tendency correlation analysis for direct optimization of evaluation measures in information retrieval

Yin He Tie-Yan Liu 《Information Retrieval》2010,13(6):657-688

Direct optimization of evaluation measures has become an important branch of learning to rank for information retrieval (IR). Since IR evaluation measures are difficult to optimize due to their non-continuity and non-differentiability, most direct optimization methods optimize some surrogate functions instead, which we call surrogate measures. A critical issue regarding these methods is whether the optimization of the surrogate measures can really lead to the optimization of the original IR evaluation measures. In this work, we perform formal analysis on this issue. We propose a concept named “tendency correlation” to describe the relationship between a surrogate measure and its corresponding IR evaluation measure. We show that when a surrogate measure has arbitrarily strong tendency correlation with an IR evaluation measure, the optimization of it will lead to the effective optimization of the original IR evaluation measure. Then, we analyze the tendency correlations of the surrogate measures optimized in a number of direct optimization methods. We prove that the surrogate measures in SoftRank and ApproxRank can have arbitrarily strong tendency correlation with the original IR evaluation measures, regardless of the data distribution, when some parameters are appropriately set. However, the surrogate measures in SVM^MAP, DORM^NDCG, PermuRank^MAP, and SVM^NDCG cannot have arbitrarily strong tendency correlation with the original IR evaluation measures on certain distributions of data. Therefore SoftRank and ApproxRank are theoretically sounder than SVM^MAP, DORM^NDCG, PermuRank^MAP, and SVM^NDCG, and are expected to result in better ranking performances. Our theoretical findings can explain the experimental results observed on public benchmark datasets. 相似文献

7.

Applying Machine Learning to Text Segmentation for Information Retrieval 总被引：2，自引：0，他引：2

Xiangji Huang Fuchun Peng Dale Schuurmans Nick Cercone Stephen E. Robertson 《Information Retrieval》2003,6(3-4):333-362

We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are flexible and adaptive. We find that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give good retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. However, for Chinese, we find that the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, including 70% word segmentation accuracy from our self-supervised word-segmentation approach. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text. Our research suggests machine learning techniques can play an important role in building adaptable information retrieval systems and different evaluation standards for word segmentation should be given to different applications. 相似文献

8.

Extreme value theory applied to document retrieval from large collections

David Madigan Yehuda Vardi Ishay Weissman 《Information Retrieval》2006,9(3):273-294

We consider text retrieval applications that assign query-specific relevance scores to documents drawn from particular collections. Such applications represent a primary focus of the annual Text Retrieval Conference (TREC), where the participants compare the empirical performance of different approaches. P^(K), the proportion of the top K documents that are relevant, is a popular measure of retrieval effectiveness. Participants in the TREC Very Large Corpus track have observed that when the target is a random sample from a collection, P^(K) is substantially smaller than when the target is the entire collection. Hawking and Robertson (2003) confirmed this finding in a number of experimental settings. Hawking et al. (1999) posed as an open research question the cause of this phenomenon and proposed five possible explanatory hypotheses. In this paper, we present a mathematical analysis that sheds some light on these hypotheses and complements the experimental work of Hawking and Robertson (2003). We will also introduce C^(L), contamination at L, the number of irrelevant documents amongst the top L relevant documents, and describe its properties. Our analysis shows that while P^(K) typically will increase with collection size, the phenomenon is not universal. That is, the asymptotic behavior of P^(K) and C^(L) depends on the score distributions and relative proportions of relevant and irrelevant documents in the collection. While this article went to press, Yehuda Vardi passed away. We dedicate the paper to his memory. 相似文献

9.

Utilizing passage-based language models for ad hoc document retrieval

Michael Bendersky Oren Kurland 《Information Retrieval》2010,13(2):157-187

To cope with the fact that, in the ad hoc retrieval setting, documents relevant to a query could contain very few (short) parts (passages) with query-related information, researchers proposed passage-based document ranking approaches. We show that several of these retrieval methods can be understood, and new ones can be derived, using the same probabilistic model. We use language-model estimates to instantiate specific retrieval algorithms, and in doing so present a novel passage language model that integrates information from the containing document to an extent controlled by the estimated document homogeneity. Several document-homogeneity measures that we present yield passage language models that are more effective than the standard passage model for basic document retrieval and for constructing and utilizing passage-based relevance models; these relevance models also outperform a document-based relevance model. Finally, we demonstrate the merits in using the document-homogeneity measures for integrating document-query and passage-query similarity information for document retrieval. 相似文献

10.

Statistical query expansion for sentence retrieval and its effects on weak and strong queries

David E. Losada 《Information Retrieval》2010,13(5):485-506

The retrieval of sentences that are relevant to a given information need is a challenging passage retrieval task. In this context, the well-known vocabulary mismatch problem arises severely because of the fine granularity of the task. Short queries, which are usually the rule rather than the exception, aggravate the problem. Consequently, effective sentence retrieval methods tend to apply some form of query expansion, usually based on pseudo-relevance feedback. Nevertheless, there are no extensive studies comparing different statistical expansion strategies for sentence retrieval. In this work we study thoroughly the effect of distinct statistical expansion methods on sentence retrieval. We start from a set of retrieved documents in which relevant sentences have to be found. In our experiments different term selection strategies are evaluated and we provide empirical evidence to show that expansion before sentence retrieval yields competitive performance. This is particularly novel because expansion for sentence retrieval is often done after sentence retrieval (i.e. expansion terms are mined from a ranked set of sentences) and there are no comparative results available between both types of expansion. Furthermore, this comparison is particularly valuable because there are important implications in time efficiency. We also carefully analyze expansion on weak and strong queries and demonstrate clearly that expanding queries before sentence retrieval is not only more convenient for efficiency purposes, but also more effective when handling poor queries. 相似文献