首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 734 毫秒
1.
The retrieval of sentences that are relevant to a given information need is a challenging passage retrieval task. In this context, the well-known vocabulary mismatch problem arises severely because of the fine granularity of the task. Short queries, which are usually the rule rather than the exception, aggravate the problem. Consequently, effective sentence retrieval methods tend to apply some form of query expansion, usually based on pseudo-relevance feedback. Nevertheless, there are no extensive studies comparing different statistical expansion strategies for sentence retrieval. In this work we study thoroughly the effect of distinct statistical expansion methods on sentence retrieval. We start from a set of retrieved documents in which relevant sentences have to be found. In our experiments different term selection strategies are evaluated and we provide empirical evidence to show that expansion before sentence retrieval yields competitive performance. This is particularly novel because expansion for sentence retrieval is often done after sentence retrieval (i.e. expansion terms are mined from a ranked set of sentences) and there are no comparative results available between both types of expansion. Furthermore, this comparison is particularly valuable because there are important implications in time efficiency. We also carefully analyze expansion on weak and strong queries and demonstrate clearly that expanding queries before sentence retrieval is not only more convenient for efficiency purposes, but also more effective when handling poor queries.  相似文献   

2.
A review of text and image retrieval approaches for broadcast news video   总被引:1,自引:0,他引:1  
The effectiveness of a video retrieval system largely depends on the choice of underlying text and image retrieval components. The unique properties of video collections (e.g., multiple sources, noisy features and temporal relations) suggest we examine the performance of these retrieval methods in such a multimodal environment, and identify the relative importance of the underlying retrieval components. In this paper, we review a variety of text/image retrieval approaches as well as their individual components in the context of broadcast news video. Numerous components of text/image retrieval have been discussed in detail, including retrieval models, text sources, temporal expansion methods, query expansion methods, image features, and similarity measures. For each component, we conduct a series of retrieval experiments on TRECVID video collections to identify their advantages and disadvantages. To provide a more complete coverage of video retrieval, we briefly discuss an emerging approach called concept-based video retrieval, and review strategies for combining multiple retrieval outputs.  相似文献   

3.
Both English and Chinese ad-hoc information retrieval were investigated in this Tipster 3 project. Part of our objectives is to study the use of various term level and phrasal level evidence to improve retrieval accuracy. For short queries, we studied five term level techniques that together can lead to good improvements over standard ad-hoc 2-stage retrieval for TREC5-8 experiments. For long queries, we studied the use of linguistic phrases to re-rank retrieval lists. Its effect is small but consistently positive.For Chinese IR, we investigated three simple representations for documents and queries: short-words, bigrams and characters. Both approximate short-word segmentation or bigrams, augmented with characters, give highly effective results. Accurate word segmentation appears not crucial for overall result of a query set. Character indexing by itself is not competitive. Additional improvements may be obtained using collection enrichment and combination of retrieval lists.Our PIRCS document-focused retrieval is also shown to have similarity with a simple language model approach to IR.  相似文献   

4.
The application of word sense disambiguation (WSD) techniques to information retrieval (IR) has yet to provide convincing retrieval results. Major obstacles to effective WSD in IR include coverage and granularity problems of word sense inventories, sparsity of document context, and limited information provided by short queries. In this paper, to alleviate these issues, we propose the construction of latent context models for terms using latent Dirichlet allocation. We propose building one latent context per word, using a well principled representation of local context based on word features. In particular, context words are weighted using a decaying function according to their distance to the target word, which is learnt from data in an unsupervised manner. The resulting latent features are used to discriminate word contexts, so as to constrict query’s semantic scope. Consistent and substantial improvements, including on difficult queries, are observed on TREC test collections, and the techniques combines well with blind relevance feedback. Compared to traditional topic modeling, WSD and positional indexing techniques, the proposed retrieval model is more effective and scales well on large-scale collections.  相似文献   

5.
Blog feed search aims to identify a blog feed of recurring interest to users on a given topic. A blog feed, the retrieval unit for blog feed search, comprises blog posts of diverse topics. This topical diversity of blog feeds often causes performance deterioration of blog feed search. To alleviate the problem, this paper proposes several approaches based on passage retrieval, widely regarded as effective to handle topical diversity at document level in ad-hoc retrieval. We define the global and local evidence for blog feed search, which correspond to the document-level and passage-level evidence for passage retrieval, respectively, and investigate their influence on blog feed search, in terms of both initial retrieval and pseudo-relevance feedback. For initial retrieval, we propose a retrieval framework to integrate global evidence with local evidence. For pseudo-relevance feedback, we gather feedback information from the local evidence of the top K ranked blog feeds to capture diverse and accurate information related to a given topic. Experimental results show that our approaches using local evidence consistently and significantly outperform traditional ones.  相似文献   

6.
We first present in this paper an analytical view of heuristic retrieval constraints which yields simple tests to determine whether a retrieval function satisfies the constraints or not. We then review empirical findings on word frequency distributions and the central role played by burstiness in this context. This leads us to propose a formal definition of burstiness which can be used to characterize probability distributions with respect to this phenomenon. We then introduce the family of information-based IR models which naturally captures heuristic retrieval constraints when the underlying probability distribution is bursty and propose a new IR model within this family, based on the log-logistic distribution. The experiments we conduct on several collections illustrate the good behavior of the log-logistic IR model: It significantly outperforms the Jelinek-Mercer and Dirichlet prior language models on most collections we have used, with both short and long queries and for both the MAP and the precision at 10 documents. It also compares favorably to BM25 and has similar performance to classical DFR models such as InL2 and PL2.  相似文献   

7.
To cope with the fact that, in the ad hoc retrieval setting, documents relevant to a query could contain very few (short) parts (passages) with query-related information, researchers proposed passage-based document ranking approaches. We show that several of these retrieval methods can be understood, and new ones can be derived, using the same probabilistic model. We use language-model estimates to instantiate specific retrieval algorithms, and in doing so present a novel passage language model that integrates information from the containing document to an extent controlled by the estimated document homogeneity. Several document-homogeneity measures that we present yield passage language models that are more effective than the standard passage model for basic document retrieval and for constructing and utilizing passage-based relevance models; these relevance models also outperform a document-based relevance model. Finally, we demonstrate the merits in using the document-homogeneity measures for integrating document-query and passage-query similarity information for document retrieval.  相似文献   

8.
The combination of evidence can increase retrieval effectiveness. In this paper, we investigate the effectiveness of a decision mechanism for the selective combination of evidence for Web Information Retrieval and particularly for topic distillation. We introduce two measures of a query’s broadness and use them to select an appropriate combination of evidence for each query. The results from our experiments show that there is a statistically significant association between the output of the decision mechanism and the relative effectiveness of the different combinations of evidence. Moreover, we show that the proposed methodology can be applied in an operational setting, where relevance information is not available, by setting the decision mechanism’s thresholds automatically.  相似文献   

9.
[目的/意义]信息检索处理的是相关性的不确定性问题,但在技术层面则通常将不确定性转化为确定性的处理方法,对信息内容中存在的不确定性语义关注不多,而这一问题在某些信息检索应用场景中可能显著地影响信息检索的结果,因此针对这类不确定性语义,需要考虑针对性的处理方法。[方法/过程]提出基于D-S证据理论的不确定性语义表示方法和将这类不确定性语义特征与文本特征、主题特征相融合的检索模型,并利用公开的数据集开展实验研究,对所提出的模型进行实验。[结果/结论]D-S理论中的证据区间概念能够描述上述不确定性,多源证据融合方法也能够将这类不确定性语义特征与文本特征、主题特征融合,并通过模型训练得出理想参数,进而改进检索结果。这一模型在理论上具有包容性与可扩展性,基于该模型融合其他检索方法是进一步需研究的内容。  相似文献   

10.
In this paper, a novel neighborhood based document smoothing model for information retrieval has been proposed. Lexical association between terms is used to provide a context sensitive indexing weight to the document terms, i.e. the term weights are redistributed based on the lexical association with the context words. A generalized retrieval framework has been presented and it has been shown that the vector space model (VSM), divergence from randomness (DFR), Okapi Best Matching 25 (BM25) and the language model (LM) based retrieval frameworks are special cases of this generalized framework. Being proposed in the generalized retrieval framework, the neighborhood based document smoothing model is applicable to all the indexing models that use the term-document frequency scheme. The proposed smoothing model is as efficient as the baseline retrieval frameworks at runtime. Experiments over the TREC datasets show that the neighborhood based document smoothing model consistently improves the retrieval performance of VSM, DFR, BM25 and LM and the improvements are statistically significant.  相似文献   

11.
Many queries have multiple interpretations; they are ambiguous or underspecified. This is especially true in the context of Web search. To account for this, much recent research has focused on creating systems that produce diverse ranked lists. In order to validate these systems, several new evaluation measures have been created to quantify diversity. Ideally, diversity evaluation measures would distinguish between systems by the amount of diversity in the ranked lists they produce. Unfortunately, diversity is also a function of the collection over which the system is run and a system’s performance at ad-hoc retrieval. A ranked list built from a collection that does not cover multiple subtopics cannot be diversified; neither can a ranked list that contains no relevant documents. To ensure that we are assessing systems by their diversity, we develop (1) a family of evaluation measures that take into account the diversity of the collection and (2) a meta-evaluation measure that explicitly controls for performance. We demonstrate experimentally that our new measures can achieve substantial improvements in sensitivity to diversity without reducing discriminative power.  相似文献   

12.
语言模型在信息检索中的应用   总被引:1,自引:0,他引:1  
基于语言模型的检索方法为信息检索领域开辟了一个很有前景同时也具有相当挑战性的方向。与传统检索模型相比,语言模型不仅具有良好的理论基础,而且非常灵活,经过简单的变换很容易推演出其他经典的检索模型。此外,大量的实验结果表明,该方法的检索效果优于其他检索模型,因而一经提出便受到了广大研究人员的青睐。然而当前语言模型方法的研究主要集中在单语检索任务中,很少有研究关注语言模型方法在跨语言检索中的应用,针对这个问题,本文在系统介绍基于语言模型检索方法的基础上,将语言模型方法扩展到跨语言检索任务中,介绍了两个跨语言检索模型:统计翻译模型和跨语言相关语言模型。  相似文献   

13.
基于KNN与自动检索的迭代近邻法在自动分类中的应用   总被引:8,自引:3,他引:8  
杨建良  王永成 《情报学报》2004,23(2):137-141
本文研究了一种基于KNN与自动检索的自动分类算法———迭代近邻法 (IterativeKNN ,I KNN) ,用以解决KNN算法在小样本库的环境下分类效果不佳的问题。在无法得到足够的定类样本时 ,通过检索的方法将待分样本的局部主题特征放大 ,进而得到足够定类的相似样本。实验证明 ,迭代近邻法既增加了获取相似样本的几率 ,同时也有效地控制了样本相似度条件限制放宽后可能引入的分类噪声 ,在实际应用中能较好地提升自动分类系统的查全率和查准率。  相似文献   

14.
根据相关的国际标准归纳出一个解决电子商务交易中争议所必需的证据管理依据,提出证据管理的概念架构;同时,依据所应用的密码学方法环境和可信赖第三方参与的模式构建证据管理的基本流程和一般化参考模型。此架构能够在交易事件发生时产生、记录、传递、储存和检验证据,并在争议发生时,取出证据作为争议解决的依据。  相似文献   

15.
一种提高WEB信息检索系统查准率的新方法   总被引:2,自引:1,他引:1  
随着Internet上的信息量急剧增加 ,如何使用户获得有用的信息已成为Web信息检索研究急需解决的问题。文中提出了一种新方法Improveaccuracy,该方法综合了一系列措施 ,较好地解决了由于不能准确地处理用户所表达的查询请求而造成的查准率较差等问题 ,避免了目前一些系统为了提高查准率而对搜索结果进行进一步处理所造成代价较高的弊端 ,从而既提高了搜索的精度又提高了搜索的效率。  相似文献   

16.
Due to the great variation of biological names in biomedical text, appropriate tokenization is an important preprocessing step for biomedical information retrieval. Despite its importance, there has been little study on the evaluation of various tokenization strategies for biomedical text. In this work, we conducted a careful, systematic evaluation of a set of tokenization heuristics on all the available TREC biomedical text collections for ad hoc document retrieval, using two representative retrieval methods and a pseudo-relevance feedback method. We also studied the effect of stemming and stop word removal on the retrieval performance. As expected, our experiment results show that tokenization can significantly affect the retrieval accuracy; appropriate tokenization can improve the performance by up to 96%, measured by mean average precision (MAP). In particular, it is shown that different query types require different tokenization heuristics, stemming is effective only for certain queries, and stop word removal in general does not improve the retrieval performance on biomedical text.  相似文献   

17.
There have been a number of linear, feature-based models proposed by the information retrieval community recently. Although each model is presented differently, they all share a common underlying framework. In this paper, we explore and discuss the theoretical issues of this framework, including a novel look at the parameter space. We then detail supervised training algorithms that directly maximize the evaluation metric under consideration, such as mean average precision. We present results that show training models in this way can lead to significantly better test set performance compared to other training methods that do not directly maximize the metric. Finally, we show that linear feature-based models can consistently and significantly outperform current state of the art retrieval models with the correct choice of features.
  相似文献   

18.
一个构造良好的查询是信息检索质量的基本保证,语义查询扩展技术解决了传统信息检索系统不能很好理解用户查询意图的问题,在提高检索查全率的同时保证了检索准确率。本文以查询关键字之间的语义关联为切入点,辅以隐式反馈技术获取消歧上下文,以WordNet本体库和WordNet Domains扩展库作为消歧数据源,使用基于局部上下文和基于图论的两类无导词义消歧方法进行查询关键字到本体概念的映射,最后基于概念词汇关联完成基于语义的查询扩展。综合WordNet本体库和WordNet Domains扩展库中的各项知识源对查询词义进行判定,保证了词义消歧的精度;采用无导词义消歧实现查询词义的快速判定,保证了信息检索的实时性;根据查询关键词的多寡分别提出两类消歧方法,满足了各种查询需求。  相似文献   

19.
User generated content forms an important domain for mining knowledge. In this paper, we address the task of blog feed search: to find blogs that are principally devoted to a given topic, as opposed to blogs that merely happen to mention the topic in passing. The large number of blogs makes the blogosphere a challenging domain, both in terms of effectiveness and of storage and retrieval efficiency. We examine the effectiveness of an approach to blog feed search that is based on individual posts as indexing units (instead of full blogs). Working in the setting of a probabilistic language modeling approach to information retrieval, we model the blog feed search task by aggregating over a blogger’s posts to collect evidence of relevance to the topic and persistence of interest in the topic. This approach achieves state-of-the-art performance in terms of effectiveness. We then introduce a two-stage model where a pre-selection of candidate blogs is followed by a ranking step. The model integrates aggressive pruning techniques as well as very lean representations of the contents of blog posts, resulting in substantial gains in efficiency while maintaining effectiveness at a very competitive level.  相似文献   

20.
In the field of scientometrics, impact indicators and ranking algorithms are frequently evaluated using unlabelled test data comprising relevant entities (e.g., papers, authors, or institutions) that are considered important. The rationale is that the higher some algorithm ranks these entities, the better its performance. To compute a performance score for an algorithm, an evaluation measure is required to translate the rank distribution of the relevant entities into a single-value performance score. Until recently, it was simply assumed that taking the average rank (of the relevant entities) is an appropriate evaluation measure when comparing ranking algorithms or fine-tuning algorithm parameters.With this paper we propose a framework for evaluating the evaluation measures themselves. Using this framework the following questions can now be answered: (1) which evaluation measure should be chosen for an experiment, and (2) given an evaluation measure and corresponding performance scores for the algorithms under investigation, how significant are the observed performance differences?Using two publication databases and four test data sets we demonstrate the functionality of the framework and analyse the stability and discriminative power of the most common information retrieval evaluation measures. We find that there is no clear winner and that the performance of the evaluation measures is highly dependent on the underlying data. Our results show that the average rank is indeed an adequate and stable measure. However, we also show that relatively large performance differences are required to confidently determine if one ranking algorithm is significantly superior to another. Lastly, we list alternative measures that also yield stable results and highlight measures that should not be used in this context.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号