首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
This paper discusses various issues about the rank equivalence of Lafferty and Zhai between the log-odds ratio and the query likelihood of probabilistic retrieval models. It highlights that Robertson’s concerns about this equivalence may arise when multiple probability distributions are assumed to be uniformly distributed, after assuming that the marginal probability logically follows from Kolmogorov’s probability axioms. It also clarifies that there are two types of rank equivalence relations between probabilistic models, namely strict and weak rank equivalence. This paper focuses on the strict rank equivalence which requires the event spaces of the participating probabilistic models to be identical. It is possible that two probabilistic models are strict rank equivalent when they use different probability estimation methods. This paper shows that the query likelihood, p(q|d, r), is strict rank equivalent to p(q|d) of the language model of Ponte and Croft by applying assumptions 1 and 2 of Lafferty and Zhai. In addition, some statistical component language model may be strict rank equivalent to the log-odds ratio, and that some statistical component model using the log-odds ratio may be strict rank equivalent to the query likelihood. Finally, we suggest adding a random variable for the user information need to the probabilistic retrieval models for clarification when these models deal with multiple requests.  相似文献   

2.
A solid research path towards new information retrieval models is to further develop the theory behind existing models. A profound understanding of these models is therefore essential. In this paper, we revisit probability ranking principle (PRP)-based models, probability of relevance (PR) models, and language models, finding conceptual differences in their definition and interrelationships. The probabilistic model of the PRP has not been explicitly defined previously, but doing so leads to the formulation of two actual principles with different objectives. First, the belief probability ranking principle (BPRP), which considers uncertain relevance between known documents and the current query, and second, the popularity probability ranking principle (PPRP), which considers the probability of relevance of documents among multiple queries with the same features. Our analysis shows how some of the discussed PR models implement the BPRP or the PPRP while others do not. However, for some models the parameter estimation is challenging. Finally, language models are often presented as related to PR models. However, we find that language models differ from PR models in every aspect of a probabilistic model and the effectiveness of language models cannot be explained by the PRP.  相似文献   

3.
To cope with the fact that, in the ad hoc retrieval setting, documents relevant to a query could contain very few (short) parts (passages) with query-related information, researchers proposed passage-based document ranking approaches. We show that several of these retrieval methods can be understood, and new ones can be derived, using the same probabilistic model. We use language-model estimates to instantiate specific retrieval algorithms, and in doing so present a novel passage language model that integrates information from the containing document to an extent controlled by the estimated document homogeneity. Several document-homogeneity measures that we present yield passage language models that are more effective than the standard passage model for basic document retrieval and for constructing and utilizing passage-based relevance models; these relevance models also outperform a document-based relevance model. Finally, we demonstrate the merits in using the document-homogeneity measures for integrating document-query and passage-query similarity information for document retrieval.  相似文献   

4.
近几年来国外信息检索模型研究进展   总被引:2,自引:0,他引:2  
信息检索模型是信息检索的核心.近几年来国外对于布尔模型的研究主要表现在对布尔模型的改进及对扩展布尔模型的进一步优化.对向量空间模型的研究,主要集中在对向量空间模型的扩展研究及对向量空间模型的应用方面.概率模型的发展主要集中在继续对概率模型进一步的研究,其与其它信息检索模型的结合,以及语言模型的研究和发展.近年来对于新兴的基于本体的信息检索模型的研究,主要集中在对基于本体的信息检索模型理论的研究,与其它检索模型的融合,以及基于本体检索模型的应用.国外信息检索模型研究的最新成果,为国内此方面的研究提供了前沿性的参考信息.  相似文献   

5.
The Web contains a tremendous amount of information. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach. We notice that many Web documents contain a sequence of chunks of textual information, each of which constitutes a record. This type of documents is referred to as multiple-record documents. In our categorization approach, a document is represented by a set of term frequencies of index terms, a density heuristic value, and a grouping heuristic value. We first apply the logistic regression analysis on relevant probabilities using the (i) index terms, (ii) density value, and (iii) grouping value of each training document. Hereafter, the relevant probability of each test document is interpolated from the fitting curves. Contrary to other probabilistic retrieval models, our model makes only a weak independent assumption and is capable of handling any important dependent relationships among index terms. In addition, we use logistic regression, instead of linear regression analysis, because the relevance probabilities of training documents are discrete. Using a test set of car-ads and another one for obituary Web documents, our probabilistic model achieves the averaged recall ratio of 100%, precision ratio of 83.3%, and accuracy ratio of 92.5%.  相似文献   

6.
认知建构视角下交互式信息检索模型研究   总被引:1,自引:0,他引:1  
[目的/意义]信息检索本质上是一个认知过程,研究促进用户认知的交互式信息检索模型具有重要意义。[研究设计/方法]以建构主义理论为指导,以促进用户的认知发展为研究目标,构建了以信息空间层、用户空间层和界面交互层为顶层分析框架的交互式信息检索模型,并开发了原型系统。[结论/发现]实验结果表明原型系统能有效地促进用户对信息空间的探索与挖掘,帮助用户积极主动地进行认知建构,发展认知空间。[创新/价值]将认知建构理论运用于信息检索领域,从交互设计方面对检索系统提出了改进建议,以更好地提供认知支持。  相似文献   

7.
We first present in this paper an analytical view of heuristic retrieval constraints which yields simple tests to determine whether a retrieval function satisfies the constraints or not. We then review empirical findings on word frequency distributions and the central role played by burstiness in this context. This leads us to propose a formal definition of burstiness which can be used to characterize probability distributions with respect to this phenomenon. We then introduce the family of information-based IR models which naturally captures heuristic retrieval constraints when the underlying probability distribution is bursty and propose a new IR model within this family, based on the log-logistic distribution. The experiments we conduct on several collections illustrate the good behavior of the log-logistic IR model: It significantly outperforms the Jelinek-Mercer and Dirichlet prior language models on most collections we have used, with both short and long queries and for both the MAP and the precision at 10 documents. It also compares favorably to BM25 and has similar performance to classical DFR models such as InL2 and PL2.  相似文献   

8.
A probability ranking principle for interactive information retrieval   总被引:1,自引:1,他引:0  
The classical Probability Ranking Principle (PRP) forms the theoretical basis for probabilistic Information Retrieval (IR) models, which are dominating IR theory since about 20 years. However, the assumptions underlying the PRP often do not hold, and its view is too narrow for interactive information retrieval (IIR). In this article, a new theoretical framework for interactive retrieval is proposed: The basic idea is that during IIR, a user moves between situations. In each situation, the system presents to the user a list of choices, about which s/he has to decide, and the first positive decision moves the user to a new situation. Each choice is associated with a number of cost and probability parameters. Based on these parameters, an optimum ordering of the choices can the derived—the PRP for IIR. The relationship of this rule to the classical PRP is described, and issues of further research are pointed out.
Norbert FuhrEmail:
  相似文献   

9.
Information Retrieval systems typically sort the result with respect to document retrieval status values (RSV). According to the Probability Ranking Principle, this ranking ensures optimum retrieval quality if the RSVs are monotonously increasing with the probabilities of relevance (as e.g. for probabilistic IR models). However, advanced applications like filtering or distributed retrieval require estimates of the actual probability of relevance. The relationship between the RSV of a document and its probability of relevance can be described by a normalisation function which maps the retrieval status value onto the probability of relevance (mapping functions). In this paper, we explore the use of linear and logistic mapping functions for different retrieval methods. In a series of upper-bound experiments, we compare the approximation quality of the different mapping functions. We also investigate the effect on the resulting retrieval quality in distributed retrieval (only merging, without resource selection). These experiments show that good estimates of the actual probability of relevance can be achieved, and that the logistic model outperforms the linear one. Retrieval quality for distributed retrieval is only slightly improved by using the logistic function.  相似文献   

10.
In Information Retrieval, since it is hard to identify users’ information needs, many approaches have been tried to solve this problem by expanding initial queries and reweighting the terms in the expanded queries using users’ relevance judgments. Although relevance feedback is most effective when relevance information about retrieved documents is provided by users, it is not always available. Another solution is to use correlated terms for query expansion. The main problem with this approach is how to construct the term-term correlations that can be used effectively to improve retrieval performance. In this study, we try to construct query concepts that denote users’ information needs from a document space, rather than to reformulate initial queries using the term correlations and/or users’ relevance feedback. To form query concepts, we extract features from each document, and then cluster the features into primitive concepts that are then used to form query concepts. Experiments are performed on the Associated Press (AP) dataset taken from the TREC collection. The experimental evaluation shows that our proposed framework called QCM (Query Concept Method) outperforms baseline probabilistic retrieval model on TREC retrieval.  相似文献   

11.
We propose a hybrid information retrieval (IR) procedure that builds on two well-known IR approaches: data fusion and query expansion via relevance feedback. This IR procedure is designed to exploit the strengths of data fusion and relevance feedback and to avoid some weaknesses of these approaches. We show that our IR procedure is built on postulates that can be justified analytically and empirically. Additionally, we offer an empirical investigation of the procedure, showing that it is superior to relevance feedback on some dimensions and comparable on other dimensions. The empirical investigation also verifies the conditions under which the use of our IR procedure could be beneficial.  相似文献   

12.
现代情报检索模型理论比较与发展研究   总被引:9,自引:0,他引:9  
关于相关性的计量一直是情报检索的核心问题,为此,人们提出了一系列检索模型。本文从比较与发展的角度,按时间顺序介绍了布尔模型、向量空间模型、概率模型、模糊模型、逻辑模型、概念模型、网络模型等,并在分析、比较与评价的基础上,对情报检索模型研究发展的未来趋势做了一些预测。  相似文献   

13.
[目的/意义]信息检索处理的是相关性的不确定性问题,但在技术层面则通常将不确定性转化为确定性的处理方法,对信息内容中存在的不确定性语义关注不多,而这一问题在某些信息检索应用场景中可能显著地影响信息检索的结果,因此针对这类不确定性语义,需要考虑针对性的处理方法。[方法/过程]提出基于D-S证据理论的不确定性语义表示方法和将这类不确定性语义特征与文本特征、主题特征相融合的检索模型,并利用公开的数据集开展实验研究,对所提出的模型进行实验。[结果/结论]D-S理论中的证据区间概念能够描述上述不确定性,多源证据融合方法也能够将这类不确定性语义特征与文本特征、主题特征融合,并通过模型训练得出理想参数,进而改进检索结果。这一模型在理论上具有包容性与可扩展性,基于该模型融合其他检索方法是进一步需研究的内容。  相似文献   

14.
In information retrieval research, models and systems traditionally assume that a single person is querying and reviewing the results. However, several empirical studies of professional practice identified collaboration during IR as everyday work patterns in order to solve a shared information need and to benefit from the diverse expertise and experience of the team members. Moreover, most IR systems that are employed in professional work routines are designed for individual use and prototype collaborative systems are too limited to support use in todays work practice. To bridge this gap, this papers develops and formalizes a decision theoretic approach towards supporting a team of people that explicitly set out together to resolve a shared information need. We develop a formal cost model for collaborative IR that considers the trade-off between estimated relevance of a document as well as estimated document redundancy. From this cost model, we use a decision theoretic approach to derive the notion of activity suggestions, that is, a formal optimum criterion that describes optimum collaboration strategies in IR as the solution of an integer linear program. Those collaboration strategies are suggested to team members with the aim to facilitate the collaborative performance of information retrieval tasks. We demonstrate the application of our model by means of search result division in two collaborative search tasks. In the conducted experiments, we study the effects of different domain knowledge and resulting relevance assessments of team members in four different conditions. The gathered results indicate that our approach can improve the retrieval effectiveness of teams in recall-oriented tasks.  相似文献   

15.
介绍并评价了情报检索的先驱索顿先生等人设计的情报检索系统的三种数学模型———集合模型、代数模型和概率模型,指出这三种模型各有不可替代的优缺点,今后的研究重点应放在扬长避短的综合性模型上。  相似文献   

16.
This paper reports on the underlying IR problems encountered when indexing and searching with the Bulgarian language. For this language we propose a general light stemmer and demonstrate that it can be quite effective, producing significantly better MAP (around + 34%) than an approach not applying stemming. We implement the GL2 model derived from the Divergence from Randomness paradigm and find its retrieval effectiveness better than other probabilistic, vector-space and language models. The resulting MAP is found to be about 50% better than the classical tf idf approach. Moreover, increasing the query size enhances the MAP by around 10% (from T to TD). In order to compare the retrieval effectiveness of our suggested stopword list and the light stemmer developed for the Bulgarian language, we conduct a set of experiments on another stopword list and also a more complex and aggressive stemmer. Results tend to indicate that there is no statistically significant difference between these variants and our suggested approach. This paper evaluates other indexing strategies such as 4-gram indexing and indexing based on the automatic decompounding of compound words. Finally, we analyze certain queries to discover why we obtained poor results, when indexing Bulgarian documents using the suggested word-based approach.  相似文献   

17.
基于向量空间模型的主动推送系统设计与优化   总被引:3,自引:0,他引:3  
主动信息服务是信息检索的发展方向之一,传统向量空间模型用于设计主动推送系统具有一定的优点,但仍不能克服检索结果不相关的问题。本文提出了一系列优化措施,设计了一个基于向量空间模型的主动推送系统原型,更好的满足网上信息检索效率的提高。  相似文献   

18.
This study develops regression models for predicting the performance of cross-language information retrieval (CLIR). The model assumes that CLIR performance can be explained by two factors: (1) the ease of search inherent in each query and (2) the translation quality in the process of CLIR systems. As operational variables, monolingual information retrieval (IR) performance is used for measuring the ease of search, and the well-known evaluation metric BLEU is used to measure the translation quality. This study also proposes an alternative metric, weighted average for matched unigrams (WAMU), which is tailored to gauging translation quality for special IR purposes. The data for regression analysis are obtained from a retrieval experiment of English-to-Italian bilingual searches using the CLEF 2003 test collection. The CLIR and monolingual IR performances are measured by average precision score. The result shows that the proposed regression model can explain about 60% of the variation in CLIR performance, and WAMU has more predictive power than BLEU. A back translation method for applying the regression model to operational CLIR systems in real situations is discussed.  相似文献   

19.
在海量信息中检索时,与用户查询相关的信息常常被漏掉,而与查询无关的信息———信息垃圾,却大量地出现在检索结果中。改进文本信息检索系统的质量,提高检索效能,已成为亟待解决的问题。本文针对能够影响检索效力的一个易被忽略的因素———修饰语,研究其在文本信息检索中的作用。为此,构建了修正的向量空间模型(Modified Vector Space Model,MVSM),并以英文文本进行试验,进而说明修饰语的作用。  相似文献   

20.
Applying Machine Learning to Text Segmentation for Information Retrieval   总被引:2,自引:0,他引:2  
We propose a self-supervised word segmentation technique for text segmentation in Chinese information retrieval. This method combines the advantages of traditional dictionary based, character based and mutual information based approaches, while overcoming many of their shortcomings. Experiments on TREC data show this method is promising. Our method is completely language independent and unsupervised, which provides a promising avenue for constructing accurate multi-lingual or cross-lingual information retrieval systems that are flexible and adaptive. We find that although the segmentation accuracy of self-supervised segmentation is not as high as some other segmentation methods, it is enough to give good retrieval performance. It is commonly believed that word segmentation accuracy is monotonically related to retrieval performance in Chinese information retrieval. However, for Chinese, we find that the relationship between segmentation and retrieval performance is in fact nonmonotonic; that is, at around 70% word segmentation accuracy an over-segmentation phenomenon begins to occur which leads to a reduction in information retrieval performance. We demonstrate this effect by presenting an empirical investigation of information retrieval on Chinese TREC data, using a wide variety of word segmentation algorithms with word segmentation accuracies ranging from 44% to 95%, including 70% word segmentation accuracy from our self-supervised word-segmentation approach. It appears that the main reason for the drop in retrieval performance is that correct compounds and collocations are preserved by accurate segmenters, while they are broken up by less accurate (but reasonable) segmenters, to a surprising advantage. This suggests that words themselves might be too broad a notion to conveniently capture the general semantic meaning of Chinese text. Our research suggests machine learning techniques can play an important role in building adaptable information retrieval systems and different evaluation standards for word segmentation should be given to different applications.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号