共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
3.
An information retrieval (IR) system can often fail to retrieve relevant documents due to the incomplete specification of information need in the user’s query. Pseudo-relevance feedback (PRF) aims to improve IR effectiveness by exploiting potentially relevant aspects of the information need present in the documents retrieved in an initial search. Standard PRF approaches utilize the information contained in these top ranked documents from the initial search with the assumption that documents as a whole are relevant to the information need. However, in practice, documents are often multi-topical where only a portion of the documents may be relevant to the query. In this situation, exploitation of the topical composition of the top ranked documents, estimated with statistical topic modeling based approaches, can potentially be a useful cue to improve PRF effectiveness. The key idea behind our PRF method is to use the term-topic and the document-topic distributions obtained from topic modeling over the set of top ranked documents to re-rank the initially retrieved documents. The objective is to improve the ranks of documents that are primarily composed of the relevant topics expressed in the information need of the query. Our RF model can further be improved by making use of non-parametric topic modeling, where the number of topics can grow according to the document contents, thus giving the RF model the capability to adjust the number of topics based on the content of the top ranked documents. We empirically validate our topic model based RF approach on two document collections of diverse length and topical composition characteristics: (1) ad-hoc retrieval using the TREC 6-8 and the TREC Robust ’04 dataset, and (2) tweet retrieval using the TREC Microblog ’11 dataset. Results indicate that our proposed approach increases MAP by up to 9% in comparison to the results obtained with an LDA based language model (for initial retrieval) coupled with the relevance model (for feedback). Moreover, the non-parametric version of our proposed approach is shown to be more effective than its parametric counterpart due to its advantage of adapting the number of topics, improving results by up to 5.6% of MAP compared to the parametric version. 相似文献
4.
Multilingual information retrieval is generally understood to mean the retrieval of relevant information in multiple target
languages in response to a user query in a single source language. In a multilingual federated search environment, different
information sources contain documents in different languages. A general search strategy in multilingual federated search environments
is to translate the user query to each language of the information sources and run a monolingual search in each information
source. It is then necessary to obtain a single ranked document list by merging the individual ranked lists from the information
sources that are in different languages. This is known as the results merging problem for multilingual information retrieval.
Previous research has shown that the simple approach of normalizing source-specific document scores is not effective. On the
other side, a more effective merging method was proposed to download and translate all retrieved documents into the source
language and generate the final ranked list by running a monolingual search in the search client. The latter method is more
effective but is associated with a large amount of online communication and computation costs. This paper proposes an effective
and efficient approach for the results merging task of multilingual ranked lists. Particularly, it downloads only a small
number of documents from the individual ranked lists of each user query to calculate comparable document scores by utilizing
both the query-based translation method and the document-based translation method. Then, query-specific and source-specific
transformation models can be trained for individual ranked lists by using the information of these downloaded documents. These
transformation models are used to estimate comparable document scores for all retrieved documents and thus the documents can
be sorted into a final ranked list. This merging approach is efficient as only a subset of the retrieved documents are downloaded
and translated online. Furthermore, an extensive set of experiments on the Cross-Language Evaluation Forum (CLEF) () data has demonstrated the effectiveness of the query-specific and source-specific results merging algorithm against other
alternatives. The new research in this paper proposes different variants of the query-specific and source-specific results
merging algorithm with different transformation models. This paper also provides thorough experimental results as well as
detailed analysis. All of the work substantially extends the preliminary research in (Si and Callan, in: Peters (ed.) Results
of the cross-language evaluation forum-CLEF 2005, 2005).
相似文献
Hao YuanEmail: |
5.
6.
文章通过对常用中文搜索引擎中奥运运动相关术语的检索与结果比较,分析目前常用中文搜索引擎专业运动术语信息查询能力.为深层次的专业信息查询提供借鉴,并促进中文搜索引擎文献信息专业服务的发展. 相似文献
7.
丁一 《现代图书情报技术》2005,21(6):26-29
Web 信息检索(Information Retrieval)技术研究是应用文本检索研究的成果,它结合Web图论的思想,研究Web上的信息检索,是行之有效的Web知识发现的途径。传统HITS方法所获得的信息精确度相当低,而PageRank作为一通用的搜索方法,不能够应用于特定主题的信息获取。在充分分析了PageRank、HITS等现有算法和Web文档的相似度计算方法的基础上,提出了Web上查询特定主题相关信息发现的RG-HITS算法。它结合了Web超链接、网页知识表示的信息相关度以及HITS方法来搜索Web上特定主题的相关知识。 相似文献
8.
In this paper we evaluate the application of data fusion or meta-search methods, combining different algorithms and XML elements, to content-oriented retrieval of XML structured data. The primary approach is the combination of a probabilistic methods using Logistic regression and the Okapi BM-25 algorithm for estimation of document relevance or XML element relevance, in conjunction with Boolean approaches for some query elements. In the evaluation we use the INEX XML test collection to examine the relative performance of individual algorithms and elements and compare these to the performance of the data fusion approaches. 相似文献
9.
Christoph Baumgarten 《Information Retrieval》2000,3(3):253-271
This paper describes a probabilistic model for optimum information retrieval in a distributed heterogeneous environment.The model assumes the collection of documents offered by the environment to be partitioned into subcollections. Documents as well as subcollections have to be indexed, where indexing methods using different indexing vocabularies can be employed. A query provided by a user is answered in terms of a ranked list of documents. The model determines a procedure for ranking the documents that stems from the Probability Ranking Principle: For each subcollection, the subcollection's documents are ranked; the resulting ranked lists are combined into a final ranked list of documents, where the ordering is determined by the documents' probabilities of being relevant with respect to the user's query. Various probabilistic ranking methods may be involved in the distributed ranking process. A criterion for effectively limiting the ranking process to a subset of subcollections extends the model.The property that different ranking methods and indexing vocabularies can be used is important when the subcollections are heterogeneous with respect to their content.The model's applicability is experimentally confirmed. When exploiting the degrees of freedom provided by the model, experiments showed evidence that the model even outperforms comparable models for the non-distributed case with respect to retrieval effectiveness. 相似文献
10.
We present a system for multilingual information retrieval that allows users to formulate queries in their preferred language and retrieve relevant information from a collection containing documents in multiple languages. The system is based on a process of document level alignments, where documents of different languages are paired according to their similarity. The resulting mapping allows us to produce a multilingual comparable corpus. Such a corpus has multiple interesting applications. It allows us to build a data structure for query translation in cross-language information retrieval (CLIR). Moreover, we also perform pseudo relevance feedback on the alignments to improve our retrieval results. And finally, multiple retrieval runs can be merged into one unified result list. The resulting system is inexpensive, adaptable to domain-specific collections and new languages and has performed very well at the TREC-7 conference CLIR system comparison. 相似文献
11.
User queries to the Web tend to have more than one interpretation due to their ambiguity and other characteristics. How to diversify the ranking results to meet users’ various potential information needs has attracted considerable attention recently. This paper is aimed at mining the subtopics of a query either indirectly from the returned results of retrieval systems or directly from the query itself to diversify the search results. For the indirect subtopic mining approach, clustering the retrieval results and summarizing the content of clusters is investigated. In addition, labeling topic categories and concept tags on each returned document is explored. For the direct subtopic mining approach, several external resources, such as Wikipedia, Open Directory Project, search query logs, and the related search services of search engines, are consulted. Furthermore, we propose a diversified retrieval model to rank documents with respect to the mined subtopics for balancing relevance and diversity. Experiments are conducted on the ClueWeb09 dataset with the topics of the TREC09 and TREC10 Web Track diversity tasks. Experimental results show that the proposed subtopic-based diversification algorithm significantly outperforms the state-of-the-art models in the TREC09 and TREC10 Web Track diversity tasks. The best performance our proposed algorithm achieves is α-nDCG@5 0.307, IA-P@5 0.121, and α#-nDCG@5 0.214 on the TREC09, as well as α-nDCG@10 0.421, IA-P@10 0.201, and α#-nDCG@10 0.311 on the TREC10. The results conclude that the subtopic mining technique with the up-to-date users’ search query logs is the most effective way to generate the subtopics of a query, and the proposed subtopic-based diversification algorithm can select the documents covering various subtopics. 相似文献
12.
13.
14.
Relevance feedback is an effective technique for improving search accuracy in interactive information retrieval. In this paper, we study an interesting optimization problem in interactive feedback that aims at optimizing the tradeoff between presenting search results with the highest immediate utility to a user (but not necessarily most useful for collecting feedback information) and presenting search results with the best potential for collecting useful feedback information (but not necessarily the most useful documents from a user’s perspective). Optimizing such an exploration–exploitation tradeoff is key to the optimization of the overall utility of relevance feedback to a user in the entire session of relevance feedback. We formally frame this tradeoff as a problem of optimizing the diversification of search results since relevance judgments on more diversified results have been shown to be more useful for relevance feedback. We propose a machine learning approach to adaptively optimizing the diversification of search results for each query so as to optimize the overall utility in an entire session. Experiment results on three representative retrieval test collections show that the proposed learning approach can effectively optimize the exploration–exploitation tradeoff and outperforms the traditional relevance feedback approach which only does exploitation without exploration. 相似文献
15.
As the volume and variety of information sources continues to grow, there is increasing difficulty with respect to obtaining
information that accurately matches user information needs. A number of factors affect information retrieval effectiveness
(the accuracy of matching user information needs against the retrieved information). First, users often do not present search
queries in the form that optimally represents their information need. Second, the measure of a document’s relevance is often
highly subjective between different users. Third, information sources might contain heterogeneous documents, in multiple formats
and the representation of documents is not unified. This paper discusses an approach for improvement of information retrieval
effectiveness from document databases. It is proposed that retrieval effectiveness can be improved by applying computational
intelligence techniques for modelling information needs, through interactive reinforcement learning. The method combines qualitative
(subjective) user relevance feedback with quantitative (algorithmic) measures of the relevance of retrieved documents. An
information retrieval is developed whose retrieval effectiveness is evaluated using traditional precision and recall. 相似文献
16.
17.
交互式跨语言信息检索是信息检索的一个重要分支。在分析交互式跨语言信息检索过程、评价指标、用户行为进展等理论研究基础上,设计一个让用户参与跨语言信息检索全过程的用户检索实验。实验结果表明:用户检索词主要来自检索主题的标题;用户判断文档相关性的准确率较高;目标语言文档全文、译文摘要、译文全文都是用户认可的判断依据;翻译优化方法以及翻译优化与查询扩展的结合方法在用户交互环境下非常有效;用户对于反馈后的翻译仍然愿意做进一步选择;用户对于与跨语言信息检索系统进行交互是有需求并认可的。用户行为分析有助于指导交互式跨语言信息检索系统的设计与实践。 相似文献
18.
Social tagging systems have gained increasing popularity as a method of annotating and categorizing a wide range of different web resources. Web search that utilizes social tagging data suffers from an extreme example of the vocabulary mismatch problem encountered in traditional information retrieval (IR). This is due to the personalized, unrestricted vocabulary that users choose to describe and tag each resource. Previous research has proposed the utilization of query expansion to deal with search in this rather complicated space. However, non-personalized approaches based on relevance feedback and personalized approaches based on co-occurrence statistics only showed limited improvements. This paper proposes a novel query expansion framework based on individual user profiles mined from the annotations and resources the user has marked. The underlying theory is to regularize the smoothness of word associations over a connected graph using a regularizer function on terms extracted from top-ranked documents. The intuition behind the model is the prior assumption of term consistency: the most appropriate expansion terms for a query are likely to be associated with, and influenced by terms extracted from the documents ranked highly for the initial query. The framework also simultaneously incorporates annotations and web documents through a Tag-Topic model in a latent graph. The experimental results suggest that the proposed personalized query expansion method can produce better results than both the classical non-personalized search approach and other personalized query expansion methods. Hence, the proposed approach significantly benefits personalized web search by leveraging users’ social media data. 相似文献
19.
Measuring Search Engine Quality 总被引:12,自引:3,他引:9
The effectiveness of twenty public search engines is evaluated using TREC-inspired methods and a set of 54 queries taken from real Web search logs. The World Wide Web is taken as the test collection and a combination of crawler and text retrieval system is evaluated. The engines are compared on a range of measures derivable from binary relevance judgments of the first seven live results returned. Statistical testing reveals a significant difference between engines and high intercorrelations between measures. Surprisingly, given the dynamic nature of the Web and the time elapsed, there is also a high correlation between results of this study and a previous study by Gordon and Pathak. For nearly all engines, there is a gradual decline in precision at increasing cutoff after some initial fluctuation. Performance of the engines as a group is found to be inferior to the group of participants in the TREC-8 Large Web task, although the best engines approach the median of those systems. Shortcomings of current Web search evaluation methodology are identified and recommendations are made for future improvements. In particular, the present study and its predecessors deal with queries which are assumed to derive from a need to find a selection of documents relevant to a topic. By contrast, real Web search reflects a range of other information need types which require different judging and different measures. 相似文献
20.
In the information retrieval process, functions that rank documents according to their estimated relevance to a query typically regard query terms as being independent. However, it is often the joint presence of query terms that is of interest to the user, which is overlooked when matching independent terms. One feature that can be used to express the relatedness of co-occurring terms is their proximity in text. In past research, models that are trained on the proximity information in a collection have performed better than models that are not estimated on data. We analyzed how co-occurring query terms can be used to estimate the relevance of documents based on their distance in text, which is used to extend a unigram ranking function with a proximity model that accumulates the scores of all occurring term combinations. This proximity model is more practical than existing models, since it does not require any co-occurrence statistics, it obviates the need to tune additional parameters, and has a retrieval speed close to competing models. We show that this approach is more robust than existing models, on both Web and newswire corpora, and on average performs equal or better than existing proximity models across collections. 相似文献