A goodness of fit test approach in information retrieval |
| |
Authors: | Kostas Fragos Yannis Maistros |
| |
Institution: | (1) Department of Electrical and Computer Engineers, National Technical University of Athens Iroon, Polytexneniou 9, 15780 Zografou, Greece |
| |
Abstract: | In many probabilistic modeling approaches to Information Retrieval we are interested in estimating how well a document model
“fits” the user’s information need (query model). On the other hand in statistics, goodness of fit tests are well established
techniques for assessing the assumptions about the underlying distribution of a data set. Supposing that the query terms are
randomly distributed in the various documents of the collection, we actually want to know whether the occurrences of the query
terms are more frequently distributed by chance in a particular document. This can be quantified by the so-called goodness
of fit tests. In this paper, we present a new document ranking technique based on Chi-square goodness of fit tests. Given
the null hypothesis that there is no association between the query terms q and the document d irrespective of any chance occurrences, we perform a Chi-square goodness of fit test for assessing this hypothesis and calculate
the corresponding Chi-square values. Our retrieval formula is based on ranking the documents in the collection according to
these calculated Chi-square values. The method was evaluated over the entire test collection of TREC data, on disks 4 and
5, using the topics of TREC-7 and TREC-8 (50 topics each) conferences. It performs well, outperforming steadily the classical
OKAPI term frequency weighting formula but below that of KL-Divergence from language modeling approach. Despite this, we believe
that the technique is an important non-parametric way of thinking of retrieval, offering the possibility to try simple alternative
retrieval formulas within goodness-of-fit statistical tests’ framework, modeling the data in various ways estimating or assigning any arbitrary theoretical distribution
in terms. |
| |
Keywords: | goodness of fit tests Information Retrieval |
本文献已被 SpringerLink 等数据库收录! |
|