首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Interdocument similarities are the fundamental information source required in cluster-based retrieval, which is an advanced retrieval approach that significantly improves performance during information retrieval (IR). An effective similarity metric is query-sensitive similarity, which was introduced by Tombros and Rijsbergen as method to more directly satisfy the cluster hypothesis that forms the basis of cluster-based retrieval. Although this method is reported to be effective, existing applications of query-specific similarity are still limited to vector space models wherein there is no connection to probabilistic approaches. We suggest a probabilistic framework that defines query-sensitive similarity based on probabilistic co-relevance, where the similarity between two documents is proportional to the probability that they are both co-relevant to a specific given query. We further simplify the proposed co-relevance-based similarity by decomposing it into two separate relevance models. We then formulate all the requisite components for the proposed similarity metric in terms of scoring functions used by language modeling methods. Experimental results obtained using standard TREC test collections consistently showed that the proposed query-sensitive similarity measure performs better than term-based similarity and existing query-sensitive similarity in the context of Voorhees’ nearest neighbor test (NNT).  相似文献   

2.
Contextual document clustering is a novel approach which uses information theoretic measures to cluster semantically related documents bound together by an implicit set of concepts or themes of narrow specificity. It facilitates cluster-based retrieval by assessing the similarity between a query and the cluster themes’ probability distribution. In this paper, we assess a relevance feedback mechanism, based on query refinement, that modifies the query’s probability distribution using a small number of documents that have been judged relevant to the query. We demonstrate that by providing only one relevance judgment, a performance improvement of 33% was obtained.  相似文献   

3.
4.
The indirect retrieval method proposed by Goffman is outlined and some similarities to other retrieval methods are indicated. The method is then evaluated and the results are compared with those obtained on the same document collection with cluster-based retrieval using single-link clustering.The comparisons show that although the effectiveness of the indirect retrieval method can be comparable to cluster-based retrieval, the efficiency is lower.  相似文献   

5.
To resolve some of lexical disagreement problems between queries and FAQs, we propose a reliable FAQ retrieval system using query log clustering. On indexing time, the proposed system clusters the logs of users’ queries into predefined FAQ categories. To increase the precision and the recall rate of clustering, the proposed system adopts a new similarity measure using a machine readable dictionary. On searching time, the proposed system calculates the similarities between users’ queries and each cluster in order to smooth FAQs. By virtue of the cluster-based retrieval technique, the proposed system could partially bridge lexical chasms between queries and FAQs. In addition, the proposed system outperforms the traditional information retrieval systems in FAQ retrieval.  相似文献   

6.
The term mismatch problem in information retrieval is a critical problem, and several techniques have been developed, such as query expansion, cluster-based retrieval and dimensionality reduction to resolve this issue. Of these techniques, this paper performs an empirical study on query expansion and cluster-based retrieval. We examine the effect of using parsimony in query expansion and the effect of clustering algorithms in cluster-based retrieval. In addition, query expansion and cluster-based retrieval are compared, and their combinations are evaluated in terms of retrieval performance by performing experimentations on seven test collections of NTCIR and TREC.  相似文献   

7.
How to merge and organise query results retrieved from different resources is one of the key issues in distributed information retrieval. Some previous research and experiments suggest that cluster-based document browsing is more effective than a single merged list. Cluster-based retrieval results presentation is based on the cluster hypothesis, which states that documents that cluster together have a similar relevance to a given query. However, while this hypothesis has been demonstrated to hold in classical information retrieval environments, it has never been fully tested in heterogeneous distributed information retrieval environments. Heterogeneous document representations, the presence of document duplicates, and disparate qualities of retrieval results, are major features of an heterogeneous distributed information retrieval environment that might disrupt the effectiveness of the cluster hypothesis. In this paper we report on an experimental investigation into the validity and effectiveness of the cluster hypothesis in highly heterogeneous distributed information retrieval environments. The results show that although clustering is affected by different retrieval results representations and quality, the cluster hypothesis still holds and that generating hierarchical clusters in highly heterogeneous distributed information retrieval environments is still a very effective way of presenting retrieval results to users.  相似文献   

8.
Term weighting for document ranking and retrieval has been an important research topic in information retrieval for decades. We propose a novel term weighting method based on a hypothesis that a term’s role in accumulated retrieval sessions in the past affects its general importance regardless. It utilizes availability of past retrieval results consisting of the queries that contain a particular term, retrieved documents, and their relevance judgments. A term’s evidential weight, as we propose in this paper, depends on the degree to which the mean frequency values for the relevant and non-relevant document distributions in the past are different. More precisely, it takes into account the rankings and similarity values of the relevant and non-relevant documents. Our experimental result using standard test collections shows that the proposed term weighting scheme improves conventional TF*IDF and language model based schemes. It indicates that evidential term weights bring in a new aspect of term importance and complement the collection statistics based on TF*IDF. We also show how the proposed term weighting scheme based on the notion of evidential weights are related to the well-known weighting schemes based on language modeling and probabilistic models.  相似文献   

9.
一种基于主题和分众分类的信息检索优化方法   总被引:1,自引:0,他引:1  
本文针对目前搜索引擎存在的检索结果缺乏组织导致检准率不高的问题,提出一种基于主题和分众分类的信息检索优化方法.首先对用户检索主题进行获取和表达,然后以社会标签为聚类项,采用向量空间模型实现基于分众分类的文档主题聚类,并将检索结果按相似度和标签"受欢迎度"复合排序,达到提高检索准确率和优化检索的效果.  相似文献   

10.
The Internet, together with the large amount of textual information available in document archives, has increased the relevance of information retrieval related tools. In this work we present an extension of the Gambal system for clustering and visualization of documents based on fuzzy clustering techniques. The tool allows to structure the set of documents in a hierarchical way (using a fuzzy hierarchical structure) and represent this structure in a graphical interface (a 3D sphere) over which the user can navigate.Gambal allows the analysis of the documents and the computation of their similarity not only on the basis of the syntactic similarity between words but also based on a dictionary (Wordnet 1.7) and latent semantics analysis.  相似文献   

11.
Conventional information retrieval technology (i.e. VSM) faces many difficulties when being implemented in complex P2P systems for the lack of global statistic information (e.g. IDF) and central services. In this paper, we suggest a novel query optimization scheme (Semantic Dual Query Expansion, SDQE) that makes full use of the context information supplied by the local document collection. Latent Semantic Indexing (LSI) is used to explore the local context information. By comparing the different local context information hidden in different document collections, it is possible to solve the synonymy–polysemy problem in VSM. The experiments prove that our scheme is effective to improve the retrieval performance in P2P systems without knowing the global statistic information.  相似文献   

12.
This paper presents a robust and comprehensive graph-based rank aggregation approach, used to combine results of isolated ranker models in retrieval tasks. The method follows an unsupervised scheme, which is independent of how the isolated ranks are formulated. Our approach is able to combine arbitrary models, defined in terms of different ranking criteria, such as those based on textual, image or hybrid content representations.We reformulate the ad-hoc retrieval problem as a document retrieval based on fusion graphs, which we propose as a new unified representation model capable of merging multiple ranks and expressing inter-relationships of retrieval results automatically. By doing so, we claim that the retrieval system can benefit from learning the manifold structure of datasets, thus leading to more effective results. Another contribution is that our graph-based aggregation formulation, unlike existing approaches, allows for encapsulating contextual information encoded from multiple ranks, which can be directly used for ranking, without further computations and post-processing steps over the graphs. Based on the graphs, a novel similarity retrieval score is formulated using an efficient computation of minimum common subgraphs. Finally, another benefit over existing approaches is the absence of hyperparameters.A comprehensive experimental evaluation was conducted considering diverse well-known public datasets, composed of textual, image, and multimodal documents. Performed experiments demonstrate that our method reaches top performance, yielding better effectiveness scores than state-of-the-art baseline methods and promoting large gains over the rankers being fused, thus demonstrating the successful capability of the proposal in representing queries based on a unified graph-based model of rank fusions.  相似文献   

13.
We are interested in how ideas from document clustering can be used to improve the retrieval accuracy of ranked lists in interactive systems. In particular, we are interested in ways to evaluate the effectiveness of such systems to decide how they might best be constructed. In this study, we construct and evaluate systems that present the user with ranked lists and a visualization of inter-document similarities. We first carry out a user study to evaluate the clustering/ranked list combination on instance-oriented retrieval, the task of the TREC-6 Interactive Track. We find that although users generally prefer the combination, they are not able to use it to improve effectiveness. In the second half of this study, we develop and evaluate an approach that more directly combines the ranked list with information from inter-document similarities. Using the TREC collections and relevance judgments, we show that it is possible to realize substantial improvements in effectiveness by doing so, and that although users can use the combined information effectively, the system can provide hints that substantially improve on the user's solo effort. The resulting approach shares much in common with an interactive application of incremental relevance feedback. Throughout this study, we illustrate our work using two prototype systems constructed for these evaluations. The first, AspInQuery, is a classic information retrieval system augmented with a specialized tool for recording information about instances of relevance. The other system, Lighthouse, is a Web-based application that combines a ranked list with a portrayal of inter-document similarity. Lighthouse can work with collections such as TREC, as well as the results of Web search engines.  相似文献   

14.
The inverted file is the most popular indexing mechanism for document search in an information retrieval system. Compressing an inverted file can greatly improve document search rate. Traditionally, the d-gap technique is used in the inverted file compression by replacing document identifiers with usually much smaller gap values. However, fluctuating gap values cannot be efficiently compressed by some well-known prefix-free codes. To smoothen and reduce the gap values, we propose a document-identifier reassignment algorithm. This reassignment is based on a similarity factor between documents. We generate a reassignment order for all documents according to the similarity to reassign closer identifiers to the documents having closer relationships. Simulation results show that the average gap values of sample inverted files can be reduced by 30%, and the compression rate of d-gapped inverted file with prefix-free codes can be improved by 15%.  相似文献   

15.
Language modeling is an effective and theoretically attractive probabilistic framework for text information retrieval. The basic idea of this approach is to estimate a language model of a given document (or document set), and then do retrieval or classification based on this model. A common language modeling approach assumes the data D is generated from a mixture of several language models. The core problem is to find the maximum likelihood estimation of one language model mixture, given the fixed mixture weights and the other language model mixture. The EM algorithm is usually used to find the solution.  相似文献   

16.
Hierarchic document clustering has been widely applied to information retrieval (IR) on the grounds of its potential improved effectiveness over inverted file search (IFS). However, previous research has been inconclusive as to whether clustering does bring improvements. In this paper we take the view that if hierarchic clustering is applied to search results (query-specific clustering), then it has the potential to increase the retrieval effectiveness compared both to that of static clustering and of conventional IFS. We conducted a number of experiments using five document collections and four hierarchic clustering methods. Our results show that the effectiveness of query-specific clustering is indeed higher, and suggest that there is scope for its application to IR.  相似文献   

17.
In this paper, we propose a re-ranking algorithm using post-retrieval clustering for content-based image retrieval (CBIR). In conventional CBIR systems, it is often observed that images visually dissimilar to a query image are ranked high in retrieval results. To remedy this problem, we utilize the similarity relationship of the retrieved results via post-retrieval clustering. In the first step of our method, images are retrieved using visual features such as color histogram. Next, the retrieved images are analyzed using hierarchical agglomerative clustering methods (HACM) and the rank of the results is adjusted according to the distance of a cluster from a query. In addition, we analyze the effects of clustering methods, query-cluster similarity functions, and weighting factors in the proposed method. We conducted a number of experiments using several clustering methods and cluster parameters. Experimental results show that the proposed method achieves an improvement of retrieval effectiveness of over 10% on average in the average normalized modified retrieval rank (ANMRR) measure.  相似文献   

18.
The large amount of information available and the difficulty on processing it has made knowledge management a promising area of research. Several topics are related to it, for example distributed and intelligent information retrieval, information filtering and information evaluation, which became crucial. In this paper, we focus our attention on the knowledge evaluation problem. With the aim of evaluating information coded in the standard non-proprietary format SGML (as also in XML), we propose some evaluation methods based on L-grammars which are fuzzy grammars. In particular we apply these methods to the evaluation of documents in SGML-format and to the evaluation of HTML-pages in the World Wide Web. L-grammars generate recursively enumerable L-languages, as it has been proved in Gerla ((1991), Information Sciences 53), and so they can be used to generate fuzzy languages based on extensions of the document type definitions (DTD) involved by SGML. Given a DTD, we extend its associated language by adding a judgement label. By selecting a particular label and by taking the start symbol of the grammar associated to the DTD, we can generate any DTD-compliant document with a fuzzy degree of membership derived from the judgement label. In this way we fit the computational model underlying the recursively enumerable L-languages to the process of collecting different evaluations of the same document. Finally, we outline how the generalization of these methods of evaluation can be applied in different contexts and for different roles, as for example for information filtering.  相似文献   

19.
Privacy-preserving collaborative filtering is an emerging web-adaptation tool to cope with information overload problem without jeopardizing individuals’ privacy. However, collaborative filtering with privacy schemes commonly suffer from scalability and sparseness as the content in the domain proliferates. Moreover, applying privacy measures causes a distortion in collected data, which in turn defects accuracy of such systems. In this work, we propose a novel privacy-preserving collaborative filtering scheme based on bisecting k-means clustering in which we apply two preprocessing methods. The first preprocessing scheme deals with scalability problem by constructing a binary decision tree through a bisecting k-means clustering approach while the second produces clones of users by inserting pseudo-self-predictions into original user profiles to boost accuracy of scalability-enhanced structure. Sparse nature of collections are handled by transforming ratings into item features-based profiles. After analyzing our scheme with respect to privacy and supplementary costs, we perform experiments on benchmark data sets to evaluate it in terms of accuracy and online performance. Our empirical outcomes verify that combined effects of the proposed preprocessing schemes relieve scalability and augment accuracy significantly.  相似文献   

20.
Clusters of queries submitted to a given information retrieval system can be used as a basis for an effective method of clustering documents. This indirect procedure of document clustering requires the availability of a similarity measure for queries. Research carried out along these lines has resulted in the development of some methodologies for estimating such query similarities, applicable both in the case of queries characterized by sets of weighted or unweighted index terms and in the case of queries represented by Boolean combinations of index terms. This paper reports the results of further research by the author into a methodology of the latter type, i.e. a methodology for determining the similarity between queries characterized by Boolean search request formulations. The novelty of the presented approach, as compared with the methodology introduced in an earlier paper by the author, is that some relations among index terms are now taken into account. A number of similarity measures for Boolean combinations of index terms are discussed here in some detail. The rationale behind these measures is outlined, and the conditions to be met for ensuring their equivalence are identified. Moreover, the results of an experiment concerning two of the similarity measures introduced are given.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号