首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
XML搜索引擎研究   总被引:1,自引:0,他引:1  
首先分析传统搜索引擎查准率不高的原因,然后介绍XML以及XML搜索引擎研究现状,并对XML搜索引擎所涉及的文档存储、索引、查询等关键技术进行详尽探讨。在此基础上,设计现行网络环境下的XML搜索引擎模型。认为该模型可充分利用XML文档的DTD模式信息,并能大幅度提高查询的准确率。  相似文献   

2.
3.
An information retrieval (IR) system can often fail to retrieve relevant documents due to the incomplete specification of information need in the user’s query. Pseudo-relevance feedback (PRF) aims to improve IR effectiveness by exploiting potentially relevant aspects of the information need present in the documents retrieved in an initial search. Standard PRF approaches utilize the information contained in these top ranked documents from the initial search with the assumption that documents as a whole are relevant to the information need. However, in practice, documents are often multi-topical where only a portion of the documents may be relevant to the query. In this situation, exploitation of the topical composition of the top ranked documents, estimated with statistical topic modeling based approaches, can potentially be a useful cue to improve PRF effectiveness. The key idea behind our PRF method is to use the term-topic and the document-topic distributions obtained from topic modeling over the set of top ranked documents to re-rank the initially retrieved documents. The objective is to improve the ranks of documents that are primarily composed of the relevant topics expressed in the information need of the query. Our RF model can further be improved by making use of non-parametric topic modeling, where the number of topics can grow according to the document contents, thus giving the RF model the capability to adjust the number of topics based on the content of the top ranked documents. We empirically validate our topic model based RF approach on two document collections of diverse length and topical composition characteristics: (1) ad-hoc retrieval using the TREC 6-8 and the TREC Robust ’04 dataset, and (2) tweet retrieval using the TREC Microblog ’11 dataset. Results indicate that our proposed approach increases MAP by up to 9% in comparison to the results obtained with an LDA based language model (for initial retrieval) coupled with the relevance model (for feedback). Moreover, the non-parametric version of our proposed approach is shown to be more effective than its parametric counterpart due to its advantage of adapting the number of topics, improving results by up to 5.6% of MAP compared to the parametric version.  相似文献   

4.
Multilingual information retrieval is generally understood to mean the retrieval of relevant information in multiple target languages in response to a user query in a single source language. In a multilingual federated search environment, different information sources contain documents in different languages. A general search strategy in multilingual federated search environments is to translate the user query to each language of the information sources and run a monolingual search in each information source. It is then necessary to obtain a single ranked document list by merging the individual ranked lists from the information sources that are in different languages. This is known as the results merging problem for multilingual information retrieval. Previous research has shown that the simple approach of normalizing source-specific document scores is not effective. On the other side, a more effective merging method was proposed to download and translate all retrieved documents into the source language and generate the final ranked list by running a monolingual search in the search client. The latter method is more effective but is associated with a large amount of online communication and computation costs. This paper proposes an effective and efficient approach for the results merging task of multilingual ranked lists. Particularly, it downloads only a small number of documents from the individual ranked lists of each user query to calculate comparable document scores by utilizing both the query-based translation method and the document-based translation method. Then, query-specific and source-specific transformation models can be trained for individual ranked lists by using the information of these downloaded documents. These transformation models are used to estimate comparable document scores for all retrieved documents and thus the documents can be sorted into a final ranked list. This merging approach is efficient as only a subset of the retrieved documents are downloaded and translated online. Furthermore, an extensive set of experiments on the Cross-Language Evaluation Forum (CLEF) () data has demonstrated the effectiveness of the query-specific and source-specific results merging algorithm against other alternatives. The new research in this paper proposes different variants of the query-specific and source-specific results merging algorithm with different transformation models. This paper also provides thorough experimental results as well as detailed analysis. All of the work substantially extends the preliminary research in (Si and Callan, in: Peters (ed.) Results of the cross-language evaluation forum-CLEF 2005, 2005).
Hao YuanEmail:
  相似文献   

5.
探讨当前搜索引擎存在的问题以及搜索引擎的语义功能需求,然后基于Web搜索引擎和语义Web,提出语义Web环境下的搜索引擎功能流图,并针对crawler、本体与知识库、语义注释、筛选与推理、语义索引、语义检索等对搜索引擎的功能进行分析。语义Web环境下的搜索引擎将促进信息、知识需求得到更好、更精确的语义表述和满足,推动高效的信息和知识管理。  相似文献   

6.
文章通过对常用中文搜索引擎中奥运运动相关术语的检索与结果比较,分析目前常用中文搜索引擎专业运动术语信息查询能力.为深层次的专业信息查询提供借鉴,并促进中文搜索引擎文献信息专业服务的发展.  相似文献   

7.
Web 信息检索(Information Retrieval)技术研究是应用文本检索研究的成果,它结合Web图论的思想,研究Web上的信息检索,是行之有效的Web知识发现的途径。传统HITS方法所获得的信息精确度相当低,而PageRank作为一通用的搜索方法,不能够应用于特定主题的信息获取。在充分分析了PageRank、HITS等现有算法和Web文档的相似度计算方法的基础上,提出了Web上查询特定主题相关信息发现的RG-HITS算法。它结合了Web超链接、网页知识表示的信息相关度以及HITS方法来搜索Web上特定主题的相关知识。  相似文献   

8.
In this paper we evaluate the application of data fusion or meta-search methods, combining different algorithms and XML elements, to content-oriented retrieval of XML structured data. The primary approach is the combination of a probabilistic methods using Logistic regression and the Okapi BM-25 algorithm for estimation of document relevance or XML element relevance, in conjunction with Boolean approaches for some query elements. In the evaluation we use the INEX XML test collection to examine the relative performance of individual algorithms and elements and compare these to the performance of the data fusion approaches.  相似文献   

9.
This paper describes a probabilistic model for optimum information retrieval in a distributed heterogeneous environment.The model assumes the collection of documents offered by the environment to be partitioned into subcollections. Documents as well as subcollections have to be indexed, where indexing methods using different indexing vocabularies can be employed. A query provided by a user is answered in terms of a ranked list of documents. The model determines a procedure for ranking the documents that stems from the Probability Ranking Principle: For each subcollection, the subcollection's documents are ranked; the resulting ranked lists are combined into a final ranked list of documents, where the ordering is determined by the documents' probabilities of being relevant with respect to the user's query. Various probabilistic ranking methods may be involved in the distributed ranking process. A criterion for effectively limiting the ranking process to a subset of subcollections extends the model.The property that different ranking methods and indexing vocabularies can be used is important when the subcollections are heterogeneous with respect to their content.The model's applicability is experimentally confirmed. When exploiting the degrees of freedom provided by the model, experiments showed evidence that the model even outperforms comparable models for the non-distributed case with respect to retrieval effectiveness.  相似文献   

10.
We present a system for multilingual information retrieval that allows users to formulate queries in their preferred language and retrieve relevant information from a collection containing documents in multiple languages. The system is based on a process of document level alignments, where documents of different languages are paired according to their similarity. The resulting mapping allows us to produce a multilingual comparable corpus. Such a corpus has multiple interesting applications. It allows us to build a data structure for query translation in cross-language information retrieval (CLIR). Moreover, we also perform pseudo relevance feedback on the alignments to improve our retrieval results. And finally, multiple retrieval runs can be merged into one unified result list. The resulting system is inexpensive, adaptable to domain-specific collections and new languages and has performed very well at the TREC-7 conference CLIR system comparison.  相似文献   

11.
User queries to the Web tend to have more than one interpretation due to their ambiguity and other characteristics. How to diversify the ranking results to meet users’ various potential information needs has attracted considerable attention recently. This paper is aimed at mining the subtopics of a query either indirectly from the returned results of retrieval systems or directly from the query itself to diversify the search results. For the indirect subtopic mining approach, clustering the retrieval results and summarizing the content of clusters is investigated. In addition, labeling topic categories and concept tags on each returned document is explored. For the direct subtopic mining approach, several external resources, such as Wikipedia, Open Directory Project, search query logs, and the related search services of search engines, are consulted. Furthermore, we propose a diversified retrieval model to rank documents with respect to the mined subtopics for balancing relevance and diversity. Experiments are conducted on the ClueWeb09 dataset with the topics of the TREC09 and TREC10 Web Track diversity tasks. Experimental results show that the proposed subtopic-based diversification algorithm significantly outperforms the state-of-the-art models in the TREC09 and TREC10 Web Track diversity tasks. The best performance our proposed algorithm achieves is α-nDCG@5 0.307, IA-P@5 0.121, and α#-nDCG@5 0.214 on the TREC09, as well as α-nDCG@10 0.421, IA-P@10 0.201, and α#-nDCG@10 0.311 on the TREC10. The results conclude that the subtopic mining technique with the up-to-date users’ search query logs is the most effective way to generate the subtopics of a query, and the proposed subtopic-based diversification algorithm can select the documents covering various subtopics.  相似文献   

12.
XML信息检索探究   总被引:4,自引:0,他引:4  
廖述梅  万常选  徐升华 《情报学报》2007,381(2):229-234
XML文档是具有层次结构和文本内容的半结构化数据。现有的Web信息检索是基于HTML文档的关键词全文检索,无法胜任XML元素粒度的检索;同时,XML数据库检索实现的是精确查找,检索结果无排序支持。因此,融合信息检索和数据库技术研究XML检索问题成为必然。本文从XML检索的问题域出发,阐述了XML信息检索(XML IR)的国内外研究现状与特点,并分析了目前XML IR的热点和难点问题。  相似文献   

13.
基于Lucene的Ftp搜索引擎的设计   总被引:2,自引:0,他引:2  
针对当前网络中所使用的基于数据库的Ftp搜索引擎没有标准资源文档且不支持中文分词和动态数据更新的缺陷,提出基于Lucene这个功能强大的全文索引引擎工具包的Ftp搜索引擎的设计方案。此Ftp搜索引擎不仅能够自动生成标准格式的XML资源文档,而且采用基于字典的前向最大匹配中文分词法在Lucene中动态更新全文索引。该设计还能够对检索关键字进行中英文混合分析和检索。  相似文献   

14.
Relevance feedback is an effective technique for improving search accuracy in interactive information retrieval. In this paper, we study an interesting optimization problem in interactive feedback that aims at optimizing the tradeoff between presenting search results with the highest immediate utility to a user (but not necessarily most useful for collecting feedback information) and presenting search results with the best potential for collecting useful feedback information (but not necessarily the most useful documents from a user’s perspective). Optimizing such an exploration–exploitation tradeoff is key to the optimization of the overall utility of relevance feedback to a user in the entire session of relevance feedback. We formally frame this tradeoff as a problem of optimizing the diversification of search results since relevance judgments on more diversified results have been shown to be more useful for relevance feedback. We propose a machine learning approach to adaptively optimizing the diversification of search results for each query so as to optimize the overall utility in an entire session. Experiment results on three representative retrieval test collections show that the proposed learning approach can effectively optimize the exploration–exploitation tradeoff and outperforms the traditional relevance feedback approach which only does exploitation without exploration.  相似文献   

15.
As the volume and variety of information sources continues to grow, there is increasing difficulty with respect to obtaining information that accurately matches user information needs. A number of factors affect information retrieval effectiveness (the accuracy of matching user information needs against the retrieved information). First, users often do not present search queries in the form that optimally represents their information need. Second, the measure of a document’s relevance is often highly subjective between different users. Third, information sources might contain heterogeneous documents, in multiple formats and the representation of documents is not unified. This paper discusses an approach for improvement of information retrieval effectiveness from document databases. It is proposed that retrieval effectiveness can be improved by applying computational intelligence techniques for modelling information needs, through interactive reinforcement learning. The method combines qualitative (subjective) user relevance feedback with quantitative (algorithmic) measures of the relevance of retrieved documents. An information retrieval is developed whose retrieval effectiveness is evaluated using traditional precision and recall.  相似文献   

16.
王树梅  金澎  吴慧中 《情报学报》2003,22(5):557-560
如何对索引词加权是信息检索中的一个关键技术。本文针对搜索引擎检索的对象是Web页面这一特点 ,通过分析HTML标签的修饰功能 ,结合传统的tf idf加权公式 ,对网页进行加权索引。实验证明对于精确匹配 ,在查全率较低时系统的查准率有较大的提高。  相似文献   

17.
交互式跨语言信息检索是信息检索的一个重要分支。在分析交互式跨语言信息检索过程、评价指标、用户行为进展等理论研究基础上,设计一个让用户参与跨语言信息检索全过程的用户检索实验。实验结果表明:用户检索词主要来自检索主题的标题;用户判断文档相关性的准确率较高;目标语言文档全文、译文摘要、译文全文都是用户认可的判断依据;翻译优化方法以及翻译优化与查询扩展的结合方法在用户交互环境下非常有效;用户对于反馈后的翻译仍然愿意做进一步选择;用户对于与跨语言信息检索系统进行交互是有需求并认可的。用户行为分析有助于指导交互式跨语言信息检索系统的设计与实践。  相似文献   

18.
Social tagging systems have gained increasing popularity as a method of annotating and categorizing a wide range of different web resources. Web search that utilizes social tagging data suffers from an extreme example of the vocabulary mismatch problem encountered in traditional information retrieval (IR). This is due to the personalized, unrestricted vocabulary that users choose to describe and tag each resource. Previous research has proposed the utilization of query expansion to deal with search in this rather complicated space. However, non-personalized approaches based on relevance feedback and personalized approaches based on co-occurrence statistics only showed limited improvements. This paper proposes a novel query expansion framework based on individual user profiles mined from the annotations and resources the user has marked. The underlying theory is to regularize the smoothness of word associations over a connected graph using a regularizer function on terms extracted from top-ranked documents. The intuition behind the model is the prior assumption of term consistency: the most appropriate expansion terms for a query are likely to be associated with, and influenced by terms extracted from the documents ranked highly for the initial query. The framework also simultaneously incorporates annotations and web documents through a Tag-Topic model in a latent graph. The experimental results suggest that the proposed personalized query expansion method can produce better results than both the classical non-personalized search approach and other personalized query expansion methods. Hence, the proposed approach significantly benefits personalized web search by leveraging users’ social media data.  相似文献   

19.
Measuring Search Engine Quality   总被引:12,自引:3,他引:9  
The effectiveness of twenty public search engines is evaluated using TREC-inspired methods and a set of 54 queries taken from real Web search logs. The World Wide Web is taken as the test collection and a combination of crawler and text retrieval system is evaluated. The engines are compared on a range of measures derivable from binary relevance judgments of the first seven live results returned. Statistical testing reveals a significant difference between engines and high intercorrelations between measures. Surprisingly, given the dynamic nature of the Web and the time elapsed, there is also a high correlation between results of this study and a previous study by Gordon and Pathak. For nearly all engines, there is a gradual decline in precision at increasing cutoff after some initial fluctuation. Performance of the engines as a group is found to be inferior to the group of participants in the TREC-8 Large Web task, although the best engines approach the median of those systems. Shortcomings of current Web search evaluation methodology are identified and recommendations are made for future improvements. In particular, the present study and its predecessors deal with queries which are assumed to derive from a need to find a selection of documents relevant to a topic. By contrast, real Web search reflects a range of other information need types which require different judging and different measures.  相似文献   

20.
In the information retrieval process, functions that rank documents according to their estimated relevance to a query typically regard query terms as being independent. However, it is often the joint presence of query terms that is of interest to the user, which is overlooked when matching independent terms. One feature that can be used to express the relatedness of co-occurring terms is their proximity in text. In past research, models that are trained on the proximity information in a collection have performed better than models that are not estimated on data. We analyzed how co-occurring query terms can be used to estimate the relevance of documents based on their distance in text, which is used to extend a unigram ranking function with a proximity model that accumulates the scores of all occurring term combinations. This proximity model is more practical than existing models, since it does not require any co-occurrence statistics, it obviates the need to tune additional parameters, and has a retrieval speed close to competing models. We show that this approach is more robust than existing models, on both Web and newswire corpora, and on average performs equal or better than existing proximity models across collections.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号