首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
This paper presents a probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models, user preferences are modeled through loss functions, and retrieval is cast as a risk minimization problem. We discuss how this framework can unify existing retrieval models and accommodate systematic development of new retrieval models. As an example of using the framework to model non-traditional retrieval problems, we derive retrieval models for subtopic retrieval, which is concerned with retrieving documents to cover many different subtopics of a general query topic. These new models differ from traditional retrieval models in that they relax the traditional assumption of independent relevance of documents.  相似文献   

3.
The classical probabilistic models attempt to capture the ad hoc information retrieval problem within a rigorous probabilistic framework. It has long been recognized that the primary obstacle to the effective performance of the probabilistic models is the need to estimate a relevance model. The Dirichlet compound multinomial (DCM) distribution based on the Polya Urn scheme, which can also be considered as a hierarchical Bayesian model, is a more appropriate generative model than the traditional multinomial distribution for text documents. We explore a new probabilistic model based on the DCM distribution, which enables efficient retrieval and accurate ranking. Because the DCM distribution captures the dependency of repetitive word occurrences, the new probabilistic model based on this distribution is able to model the concavity of the score function more effectively. To avoid the empirical tuning of retrieval parameters, we design several parameter estimation algorithms to automatically set model parameters. Additionally, we propose a pseudo-relevance feedback algorithm based on the mixture modeling of the Dirichlet compound multinomial distribution to further improve retrieval accuracy. Finally, our experiments show that both the baseline probabilistic retrieval algorithm based on the DCM distribution and the corresponding pseudo-relevance feedback algorithm outperform the existing language modeling systems on several TREC retrieval tasks. The main objective of this research is to develop an effective probabilistic model based on the DCM distribution. A secondary objective is to provide a thorough understanding of the probabilistic retrieval model by a theoretical understanding of various text distribution assumptions.  相似文献   

4.
In this paper, we propose a new language model, namely, a dependency structure language model, for topic detection and tracking (TDT) to compensate for weakness of unigram and bigram language models. The dependency structure language model is based on the Chow expansion theory and the dependency parse tree generated by a linguistic parser. So, long-distance dependencies can be naturally captured by the dependency structure language model. We carried out extensive experiments to verify the proposed model on topic tracking and link detection in TDT. In both cases, the dependency structure language models perform better than strong baseline approaches.  相似文献   

5.
【目的/意义】大数据时代对各领域信息检索系统检索模型查准率提出了较高要求。然而,现阶段对于传统检索模型的相关研究陷入瓶颈,表现为近若干年被提出的相关模型查准率提升幅度小,无法较好满足当前用户对于精准查询的需求。由此,高查准率检索模型亟待探索。近年来,一种基于数字信号处理理论的新型检索模型构架(Digital Signal Processing Framework:DSPF)被提出。同时,基于该模型构架的检索模型已被验证相较于传统检索模型具备显著的查准率优势。【方法/过程】据此,本研究基于数字信号处理理论构架,引入了经典概率模型F2LOG与F2EXP的词项权重计算方法,提出了模型DSPF-F2LOG与DSPF-F2EXP。为验证其查准率,本研究通过实验法,基于多种不同类型的标准数据集,采用多项查准率指标,将其与多个经典检索模型进行查准率对比分析。【结果/结论】实验结果表明,本研究所提模型较经典检索模型普遍具备更高查准率,且至少与当前查准率最高的基于数字信号处理理论的检索模型具备相当的查准率表现。本研究所提出的两个高查准率DSP模型可有效提高当前各领域信息检索系统对于非结构化文本的查准率。...  相似文献   

6.
Term weighting for document ranking and retrieval has been an important research topic in information retrieval for decades. We propose a novel term weighting method based on a hypothesis that a term’s role in accumulated retrieval sessions in the past affects its general importance regardless. It utilizes availability of past retrieval results consisting of the queries that contain a particular term, retrieved documents, and their relevance judgments. A term’s evidential weight, as we propose in this paper, depends on the degree to which the mean frequency values for the relevant and non-relevant document distributions in the past are different. More precisely, it takes into account the rankings and similarity values of the relevant and non-relevant documents. Our experimental result using standard test collections shows that the proposed term weighting scheme improves conventional TF*IDF and language model based schemes. It indicates that evidential term weights bring in a new aspect of term importance and complement the collection statistics based on TF*IDF. We also show how the proposed term weighting scheme based on the notion of evidential weights are related to the well-known weighting schemes based on language modeling and probabilistic models.  相似文献   

7.
Two probabilistic approaches to cross-lingual retrieval are in wide use today, those based on probabilistic models of relevance, as exemplified by INQUERY, and those based on language modeling. INQUERY, as a query net model, allows the easy incorporation of query operators, including a synonym operator, which has proven to be extremely useful in cross-language information retrieval (CLIR), in an approach often called structured query translation. In contrast, language models incorporate translation probabilities into a unified framework. We compare the two approaches on Arabic and Spanish data sets, using two kinds of bilingual dictionaries––one derived from a conventional dictionary, and one derived from a parallel corpus. We find that structured query processing gives slightly better results when queries are not expanded. On the other hand, when queries are expanded, language modeling gives better results, but only when using a probabilistic dictionary derived from a parallel corpus.We pursue two additional issues inherent in the comparison of structured query processing with language modeling. The first concerns query expansion, and the second is the role of translation probabilities. We compare conventional expansion techniques (pseudo-relevance feedback) with relevance modeling, a new IR approach which fits into the formal framework of language modeling. We find that relevance modeling and pseudo-relevance feedback achieve comparable levels of retrieval and that good translation probabilities confer a small but significant advantage.  相似文献   

8.
In the KL divergence framework, the extended language modeling approach has a critical problem of estimating a query model, which is the probabilistic model that encodes the user’s information need. For query expansion in initial retrieval, the translation model had been proposed to involve term co-occurrence statistics. However, the translation model was difficult to apply, because the term co-occurrence statistics must be constructed in the offline time. Especially in a large collection, constructing such a large matrix of term co-occurrences statistics prohibitively increases time and space complexity. In addition, reliable retrieval performance cannot be guaranteed because the translation model may comprise noisy non-topical terms in documents. To resolve these problems, this paper investigates an effective method to construct co-occurrence statistics and eliminate noisy terms by employing a parsimonious translation model. The parsimonious translation model is a compact version of a translation model that can reduce the number of terms containing non-zero probabilities by eliminating non-topical terms in documents. Through experimentation on seven different test collections, we show that the query model estimated from the parsimonious translation model significantly outperforms not only the baseline language modeling, but also the non-parsimonious models.  相似文献   

9.
Interdocument similarities are the fundamental information source required in cluster-based retrieval, which is an advanced retrieval approach that significantly improves performance during information retrieval (IR). An effective similarity metric is query-sensitive similarity, which was introduced by Tombros and Rijsbergen as method to more directly satisfy the cluster hypothesis that forms the basis of cluster-based retrieval. Although this method is reported to be effective, existing applications of query-specific similarity are still limited to vector space models wherein there is no connection to probabilistic approaches. We suggest a probabilistic framework that defines query-sensitive similarity based on probabilistic co-relevance, where the similarity between two documents is proportional to the probability that they are both co-relevant to a specific given query. We further simplify the proposed co-relevance-based similarity by decomposing it into two separate relevance models. We then formulate all the requisite components for the proposed similarity metric in terms of scoring functions used by language modeling methods. Experimental results obtained using standard TREC test collections consistently showed that the proposed query-sensitive similarity measure performs better than term-based similarity and existing query-sensitive similarity in the context of Voorhees’ nearest neighbor test (NNT).  相似文献   

10.
This paper reports on the underlying IR problems encountered when dealing with the complex morphology and compound constructions found in the Hungarian language. It describes evaluations carried out on two general stemming strategies for this language, and also demonstrates that a light stemming approach could be quite effective. Based on searches done on the CLEF test collection, we find that a more aggressive suffix-stripping approach may produce better MAP. When compared to an IR scheme without stemming or one based on only a light stemmer, we find the differences to be statistically significant. When compared with probabilistic, vector-space and language models, we find that the Okapi model results in the best retrieval effectiveness. The resulting MAP is found to be about 35% better than the classical tf idf approach, particularly for very short requests. Finally, we demonstrate that applying an automatic decompounding procedure for both queries and documents significantly improves IR performance (+10%), compared to word-based indexing strategies.  相似文献   

11.
The multi-modal retrieval is considered as performing information retrieval among different modalities of multimedia information. Nowadays, it becomes increasingly important in the information science field. However, it is so difficult to bridge the meanings of different multimedia modalities that the performance of multimodal retrieval is deteriorated now. In this paper, we propose a new mechanism to build the relationship between visual and textual modalities and to verify the multimodal retrieval. Specifically, this mechanism depends on the multimodal binary classifiers based on the Extreme Learning Machine (ELM) to verify whether the answers are related to the query examples. Firstly, we propose the multimodal probabilistic semantic model to rank the answers according to their generative probabilities. Furthermore, we build the multimodal binary classifiers to filter out unrelated answers. The multimodal binary classifiers are called the word classifiers. It can improve the performance of the multimodal probabilistic semantic model. The experimental results show that the multimodal probabilistic semantic model and the word classifiers are effective and efficient. Also they demonstrate that the word classifiers based on ELM not only can improve the performance of the probabilistic semantic model but also can be easily applied to other probabilistic semantic models.  相似文献   

12.
基于Web的文本检索位置加权模型研究   总被引:4,自引:0,他引:4  
刘海峰  王倩  王元元 《情报科学》2007,25(3):451-455
本文分析了传统向量空间检索模型在Web环境下的不足,提出了两种改进的线性加权模型及非线性加权模型;实验结果表明,两种模型的文本检索能力相比传统向量模型都有较大提高。  相似文献   

13.
Collaborative information retrieval involves retrieval settings in which a group of users collaborates to satisfy the same underlying need. One core issue of collaborative IR models involves either supporting collaboration with adapted tools or developing IR models for a multiple-user context and providing a ranked list of documents adapted for each collaborator. In this paper, we introduce the first document-ranking model supporting collaboration between two users characterized by roles relying on different domain expertise levels. Specifically, we propose a two-step ranking model: we first compute a document-relevance score, taking into consideration domain expertise-based roles. We introduce specificity and novelty factors into language-model smoothing, and then we assign, via an Expectation–Maximization algorithm, documents to the best-suited collaborator. Our experiments employ a simulation-based framework of collaborative information retrieval and show the significant effectiveness of our model at different search levels.  相似文献   

14.
In this article we investigate the expressions of collaborative activities within information seeking and retrieval processes (IS&R). Generally, information seeking and retrieval is regarded as an individual and isolated process in IR research. We assume that an IS&R situation is not merely an individual effort, but inherently involves various collaborative activities. We present empirical results from a real-life and information-intensive setting within the patent domain, showing that the patent task performance process involves highly collaborative aspects throughout the stages of the information seeking and retrieval process. Furthermore, we show that these activities may be categorised and related to different stages in an information seeking and retrieval process. Therefore, the assumption that information retrieval performance is purely individual needs to be reconsidered. Finally, we also propose a refined IR framework involving collaborative aspects.  相似文献   

15.
In this paper we will present a language-independent probabilistic model which can automatically generate stemmers. Stemmers can improve the retrieval effectiveness of information retrieval systems, however the designing and the implementation of stemmers requires a laborious amount of effort due to the fact that documents and queries are often written or spoken in several different languages. The probabilistic model proposed in this paper aims at the development of stemmers used for several languages. The proposed model describes the mutual reinforcement relationship between stems and derivations and then provides a probabilistic interpretation. A series of experiments shows that the stemmers generated by the probabilistic model are as effective as the ones based on linguistic knowledge.  相似文献   

16.
In this paper, we present a well-defined general matrix framework for modelling Information Retrieval (IR). In this framework, collections, documents and queries correspond to matrix spaces. Retrieval aspects, such as content, structure and semantics, are expressed by matrices defined in these spaces and by matrix operations applied on them. The dualities of these spaces are identified through the application of frequency-based operations on the proposed matrices and through the investigation of the meaning of their eigenvectors. This allows term weighting concepts used for content-based retrieval, such as term frequency and inverse document frequency, to translate directly to concepts for structure-based retrieval. In addition, concepts such as pagerank, authorities and hubs, determined by exploiting the structural relationships between linked documents, can be defined with respect to the semantic relationships between terms. Moreover, this mathematical framework can be used to express classical and alternative evaluation measures, involving, for instance, the structure of documents, and to further explain and relate IR models and theory. The high level of reusability and abstraction of the framework leads to a logical layer for IR that makes system design and construction significantly more efficient, and thus, better and increasingly personalised systems can be built at lower costs.  相似文献   

17.
This paper examines the estimation of global term weights (such as IDF) in information retrieval scenarios where a global view on the collection is not available. In particular, the two options of either sampling documents or of using a reference corpus independent of the target retrieval collection are compared using standard IR test collections. In addition, the possibility of pruning term lists based on frequency is evaluated.  相似文献   

18.
Most existing search engines focus on document retrieval. However, information needs are certainly not limited to finding relevant documents. Instead, a user may want to find relevant entities such as persons and organizations. In this paper, we study the problem of related entity finding. Our goal is to rank entities based on their relevance to a structured query, which specifies an input entity, the type of related entities and the relation between the input and related entities. We first discuss a general probabilistic framework, derive six possible retrieval models to rank the related entities, and then compare these models both analytically and empirically. To further improve performance, we study the problem of feedback in the context of related entity finding. Specifically, we propose a mixture model based feedback method that can utilize the pseudo feedback entities to estimate an enriched model for the relation between the input and related entities. Experimental results over two standard TREC collections show that the derived relation generation model combined with a relation feedback method performs better than other models.  相似文献   

19.
We investigate a novel perspective to the development of effective algorithms for contact recommendation in social networks, where the problem consists of automatically predicting people that a given user may wish or benefit from connecting to in the network. Specifically, we explore the connection between contact recommendation and the text information retrieval (IR) task, by investigating the adaptation of IR models (classical and supervised) for recommending people in social networks, using only the structure of these networks.We first explore the use of adapted unsupervised IR models as direct standalone recommender systems. Seeking additional effectiveness enhancements, we further explore the use of IR models as neighbor selection methods, in place of common similarity measures, in user-based and item-based nearest-neighbors (kNN) collaborative filtering approaches. On top of this, we investigate the application of learning to rank approaches borrowed from text IR to achieve additional improvements.We report thorough experiments over data obtained from Twitter and Facebook where we observe that IR models, particularly BM25, are competitive compared to state-of-the art contact recommendation methods. We provide further empirical analysis of the additional effectiveness that can be achieved by the integration of IR models into kNN and learning to rank schemes. Our research shows that the IR models are effective in three roles: as direct contact recommenders, as neighbor selectors in collaborative filtering and as samplers and features in learning to rank.  相似文献   

20.
Experimental results of cross-language information retrieval (CLIR) do not indicate why a model fails or how a model could be improved. One basic research question is thus whether it is possible to provide conditions by which one can evaluate any existing or new CLIR strategy analytically and one can improve the design of CLIR models. Inspired by the heuristics in monolingual IR, we introduce in this paper Dilution/Concentration (D/C) conditions to characterize good CLIR models based on direct intuitions under artificial settings. The conditions, derived from first principles in CLIR, generalize the idea of query structuring approach. Empirical results with state-of-the-art CLIR models show that when a condition is not satisfied, it often indicates non-optimality of the method. In general, we find that the empirical performance of a retrieval formula is tightly related to how well it satisfies the conditions. Lastly, we propose, by following the D/C conditions, several novel CLIR models based on the information-based models, which again shows that the D/C conditions are efficient to feature good CLIR models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号