首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
To cope with the fact that, in the ad hoc retrieval setting, documents relevant to a query could contain very few (short) parts (passages) with query-related information, researchers proposed passage-based document ranking approaches. We show that several of these retrieval methods can be understood, and new ones can be derived, using the same probabilistic model. We use language-model estimates to instantiate specific retrieval algorithms, and in doing so present a novel passage language model that integrates information from the containing document to an extent controlled by the estimated document homogeneity. Several document-homogeneity measures that we present yield passage language models that are more effective than the standard passage model for basic document retrieval and for constructing and utilizing passage-based relevance models; these relevance models also outperform a document-based relevance model. Finally, we demonstrate the merits in using the document-homogeneity measures for integrating document-query and passage-query similarity information for document retrieval.  相似文献   

2.
Evaluating the effectiveness of content-oriented XML retrieval methods   总被引:1,自引:0,他引:1  
Content-oriented XML retrieval approaches aim at a more focused retrieval strategy: Instead of retrieving whole documents, document components that are exhaustive to the information need while at the same time being as specific as possible should be retrieved. In this article, we show that the evaluation methods developed for standard retrieval must be modified in order to deal with the structure of XML documents. More precisely, the size and overlap of document components must be taken into account. For this purpose, we propose a new effectiveness metric based on the definition of a concept space defined upon the notions of exhaustiveness and specificity of a search result. We compare the results of this new metric by the results obtained with the official metric used in INEX, the evaluation initiative for content-oriented XML retrieval.
Gabriella KazaiEmail:
  相似文献   

3.
Most recent document standards like XML rely on structured representations. On the other hand, current information retrieval systems have been developed for flat document representations and cannot be easily extended to cope with more complex document types. The design of such systems is still an open problem. We present a new model for structured document retrieval which allows computing scores of document parts. This model is based on Bayesian networks whose conditional probabilities are learnt from a labelled collection of structured documents—which is composed of documents, queries and their associated assessments. Training these models is a complex machine learning task and is not standard. This is the focus of the paper: we propose here to train the structured Bayesian Network model using a cross-entropy training criterion. Results are presented on the INEX corpus of XML documents.  相似文献   

4.
Information Retrieval from Documents: A Survey   总被引:4,自引:0,他引:4  
Given the phenomenal growth in the variety and quantity of data available to users through electronic media, there is a great demand for efficient and effective ways to organize and search through all this information. Besides speech, our principal means of communication is through visual media, and in particular, through documents. In this paper, we provide an update on Doermann's comprehensive survey (1998) of research results in the broad area of document-based information retrieval. The scope of this survey is also somewhat broader, and there is a greater emphasis on relating document image analysis methods to conventional IR methods.Documents are available in a wide variety of formats. Technical papers are often available as ASCII files of clean, correct, text. Other documents may only be available as hardcopies. These documents have to be scanned and stored as images so that they may be processed by a computer. The textual content of these documents may also be extracted and recognized using OCR methods. Our survey covers the broad spectrum of methods that are required to handle different formats like text and images. The core of the paper focuses on methods that manipulate document images directly, and perform various information processing tasks such as retrieval, categorization, and summarization, without attempting to completely recognize the textual content of the document. We start, however, with a brief overview of traditional IR techniques that operate on clean text. We also discuss research dealing with text that is generated by running OCR on document images. Finally, we also briefly touch on the related problem of content-based image retrieval.  相似文献   

5.
We adapt the cluster hypothesis for score-based information retrieval by claiming that closely related documents should have similar scores. Given a retrieval from an arbitrary system, we describe an algorithm which directly optimizes this objective by adjusting retrieval scores so that topically related documents receive similar scores. We refer to this process as score regularization. Because score regularization operates on retrieval scores, regardless of their origin, we can apply the technique to arbitrary initial retrieval rankings. Document rankings derived from regularized scores, when compared to rankings derived from un-regularized scores, consistently and significantly result in improved performance given a variety of baseline retrieval algorithms. We also present several proofs demonstrating that regularization generalizes methods such as pseudo-relevance feedback, document expansion, and cluster-based retrieval. Because of these strong empirical and theoretical results, we argue for the adoption of score regularization as general design principle or post-processing step for information retrieval systems.
Fernando DiazEmail:
  相似文献   

6.
Document clustering of scientific texts using citation contexts   总被引:3,自引:0,他引:3  
Document clustering has many important applications in the area of data mining and information retrieval. Many existing document clustering techniques use the “bag-of-words” model to represent the content of a document. However, this representation is only effective for grouping related documents when these documents share a large proportion of lexically equivalent terms. In other words, instances of synonymy between related documents are ignored, which can reduce the effectiveness of applications using a standard full-text document representation. To address this problem, we present a new approach for clustering scientific documents, based on the utilization of citation contexts. A citation context is essentially the text surrounding the reference markers used to refer to other scientific works. We hypothesize that citation contexts will provide relevant synonymous and related vocabulary which will help increase the effectiveness of the bag-of-words representation. In this paper, we investigate the power of these citation-specific word features, and compare them with the original document’s textual representation in a document clustering task on two collections of labeled scientific journal papers from two distinct domains: High Energy Physics and Genomics. We also compare these text-based clustering techniques with a link-based clustering algorithm which determines the similarity between documents based on the number of co-citations, that is in-links represented by citing documents and out-links represented by cited documents. Our experimental results indicate that the use of citation contexts, when combined with the vocabulary in the full-text of the document, is a promising alternative means of capturing critical topics covered by journal articles. More specifically, this document representation strategy when used by the clustering algorithm investigated in this paper, outperforms both the full-text clustering approach and the link-based clustering technique on both scientific journal datasets.  相似文献   

7.
在海量信息中检索时,与用户查询相关的信息常常被漏掉,而与查询无关的信息———信息垃圾,却大量地出现在检索结果中。改进文本信息检索系统的质量,提高检索效能,已成为亟待解决的问题。本文针对能够影响检索效力的一个易被忽略的因素———修饰语,研究其在文本信息检索中的作用。为此,构建了修正的向量空间模型(Modified Vector Space Model,MVSM),并以英文文本进行试验,进而说明修饰语的作用。  相似文献   

8.
As the volume and variety of information sources continues to grow, there is increasing difficulty with respect to obtaining information that accurately matches user information needs. A number of factors affect information retrieval effectiveness (the accuracy of matching user information needs against the retrieved information). First, users often do not present search queries in the form that optimally represents their information need. Second, the measure of a document’s relevance is often highly subjective between different users. Third, information sources might contain heterogeneous documents, in multiple formats and the representation of documents is not unified. This paper discusses an approach for improvement of information retrieval effectiveness from document databases. It is proposed that retrieval effectiveness can be improved by applying computational intelligence techniques for modelling information needs, through interactive reinforcement learning. The method combines qualitative (subjective) user relevance feedback with quantitative (algorithmic) measures of the relevance of retrieved documents. An information retrieval is developed whose retrieval effectiveness is evaluated using traditional precision and recall.  相似文献   

9.
Information Retrieval Systeme haben in den letzten Jahren nur geringe Verbesserungen in der Retrieval Performance erzielt. Wir arbeiten an neuen Ans?tzen, dem sogenannten Collaborativen Information Retrieval (CIR), die das Potential haben, starke Verbesserungen zu erreichen. CIR ist die Methode, mit der durch Ausnutzen von Informationen aus früheren Anfragen die Retrieval Peformance für die aktuelle Anfrage verbessert wird. Wir haben ein eingeschr?nktes Szenario, in dem nur alte Anfragen und dazu relevante Antwortdokumente zur Verfügung stehen. Neue Ans?tze für Methoden der Query Expansion führen unter diesen Bedingungen zu Verbesserungen der Retrieval Performance . The accuracy of ad-hoc document retrieval systems has reached a stable plateau in the last few years. We are working on so-called collaborative information retrieval (CIR) systems which have the potential to overcome the current limits. We define CIR as a task, where an information retrieval (IR) system uses information gathered from previous search processes from one or several users to improve retrieval performance for the current user searching for information. We focus on a restricted setting in CIR in which only old queries and correct answer documents to these queries are available for improving a new query. For this restricted setting we propose new approaches for query expansion procedures. We show how CIR methods can improve overall IR performance.
CR Subject Classification H.3.3  相似文献   

10.
Genetic Approach to Query Space Exploration   总被引:2,自引:0,他引:2  
This paper describes a genetic algorithm approach for intelligent information retrieval. The goal is to find an optimal set of documents which best matches the user's needs by exploring and exploiting the document space. More precisely, we define a specific genetic algorithm for information retrieval based on knowledge based operators and guided by a heuristic for relevance multi-modality problem solving. Experiments with TREC-6 French data and queries show the effectiveness of our approach.  相似文献   

11.
Automatic detection of source code plagiarism is an important research field for both the commercial software industry and within the research community. Existing methods of plagiarism detection primarily involve exhaustive pairwise document comparison, which does not scale well for large software collections. To achieve scalability, we approach the problem from an information retrieval (IR) perspective. We retrieve a ranked list of candidate documents in response to a pseudo-query representation constructed from each source code document in the collection. The challenge in source code document retrieval is that the standard bag-of-words (BoW) representation model for such documents is likely to result in many false positives being retrieved, because of the use of identical programming language specific constructs and keywords. To address this problem, we make use of an abstract syntax tree (AST) representation of the source code documents. While the IR approach is efficient, it is essentially unsupervised in nature. To further improve its effectiveness, we apply a supervised classifier (pre-trained with features extracted from sample plagiarized source code pairs) on the top ranked retrieved documents. We report experiments on the SOCO-2014 dataset comprising 12K Java source files with almost 1M lines of code. Our experiments confirm that the AST based approach produces significantly better retrieval effectiveness than a standard BoW representation, i.e., the AST based approach is able to identify a higher number of plagiarized source code documents at top ranks in response to a query source code document. The supervised classifier, trained on features extracted from sample plagiarized source code pairs, is shown to effectively filter and thus further improve the ranked list of retrieved candidate plagiarized documents.  相似文献   

12.
This paper presents four novel techniques for open-vocabulary spoken document retrieval: a method to detect slots that possibly contain a query feature; a method to estimate occurrence probabilities; a technique that we call collection-wide probability re-estimation and a weighting scheme which takes advantage of the fact that long query features are detected more reliably. These four techniques have been evaluated using the TREC-6 spoken document retrieval test collection to determine the improvements in retrieval effectiveness with respect to a baseline retrieval method. Results show that the retrieval effectiveness can be improved considerably despite the large number of speech recognition errors.  相似文献   

13.
To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and using information induced from these (often called) query-specific clusters for re-ranking the list. However, results concerning the effectiveness of various automatic cluster-based re-ranking methods have been inconclusive. We show that using query-specific clusters for automatic re-ranking of top-retrieved documents is effective with several methods in which clusters play different roles, among which is the smoothing of document language models. We do so by adapting previously-proposed cluster-based retrieval approaches, which are based on (static) query-independent clusters for ranking all documents in a corpus, to the re-ranking setting wherein clusters are query-specific. The best performing method that we develop outperforms both the initial document-based ranking and some previously proposed cluster-based re-ranking approaches; furthermore, this algorithm consistently outperforms a state-of-the-art pseudo-feedback-based approach. In further exploration we study the performance of cluster-based smoothing methods for re-ranking with various (soft and hard) clustering algorithms, and demonstrate the importance of clusters in providing context from the initial list through a comparison to using single documents to this end.
Oren KurlandEmail:
  相似文献   

14.
We present a system for multilingual information retrieval that allows users to formulate queries in their preferred language and retrieve relevant information from a collection containing documents in multiple languages. The system is based on a process of document level alignments, where documents of different languages are paired according to their similarity. The resulting mapping allows us to produce a multilingual comparable corpus. Such a corpus has multiple interesting applications. It allows us to build a data structure for query translation in cross-language information retrieval (CLIR). Moreover, we also perform pseudo relevance feedback on the alignments to improve our retrieval results. And finally, multiple retrieval runs can be merged into one unified result list. The resulting system is inexpensive, adaptable to domain-specific collections and new languages and has performed very well at the TREC-7 conference CLIR system comparison.  相似文献   

15.
Multilingual information retrieval is generally understood to mean the retrieval of relevant information in multiple target languages in response to a user query in a single source language. In a multilingual federated search environment, different information sources contain documents in different languages. A general search strategy in multilingual federated search environments is to translate the user query to each language of the information sources and run a monolingual search in each information source. It is then necessary to obtain a single ranked document list by merging the individual ranked lists from the information sources that are in different languages. This is known as the results merging problem for multilingual information retrieval. Previous research has shown that the simple approach of normalizing source-specific document scores is not effective. On the other side, a more effective merging method was proposed to download and translate all retrieved documents into the source language and generate the final ranked list by running a monolingual search in the search client. The latter method is more effective but is associated with a large amount of online communication and computation costs. This paper proposes an effective and efficient approach for the results merging task of multilingual ranked lists. Particularly, it downloads only a small number of documents from the individual ranked lists of each user query to calculate comparable document scores by utilizing both the query-based translation method and the document-based translation method. Then, query-specific and source-specific transformation models can be trained for individual ranked lists by using the information of these downloaded documents. These transformation models are used to estimate comparable document scores for all retrieved documents and thus the documents can be sorted into a final ranked list. This merging approach is efficient as only a subset of the retrieved documents are downloaded and translated online. Furthermore, an extensive set of experiments on the Cross-Language Evaluation Forum (CLEF) () data has demonstrated the effectiveness of the query-specific and source-specific results merging algorithm against other alternatives. The new research in this paper proposes different variants of the query-specific and source-specific results merging algorithm with different transformation models. This paper also provides thorough experimental results as well as detailed analysis. All of the work substantially extends the preliminary research in (Si and Callan, in: Peters (ed.) Results of the cross-language evaluation forum-CLEF 2005, 2005).
Hao YuanEmail:
  相似文献   

16.
In this paper we evaluate the application of data fusion or meta-search methods, combining different algorithms and XML elements, to content-oriented retrieval of XML structured data. The primary approach is the combination of a probabilistic methods using Logistic regression and the Okapi BM-25 algorithm for estimation of document relevance or XML element relevance, in conjunction with Boolean approaches for some query elements. In the evaluation we use the INEX XML test collection to examine the relative performance of individual algorithms and elements and compare these to the performance of the data fusion approaches.  相似文献   

17.
文章对ISO、IEC、ITU、CEN等主要国际标准文献检索平台进行了系统调研,并从收录内容、检索方式、著录方式及检索结果四个方面作了分析、评价与比较,以便用户选择使用。最后,针对我国标准检索平台重质量轻数量、检索字段少、检准率和检全率低、收录不全、更新慢等不足给出了相关建议,包括提高著录质量、增加检索字段、提供多语言检索和检索结果的多种排序方式、提高时效性和连续性、整合标准文献。  相似文献   

18.
This study introduces a novel framework for evaluating passage and XML retrieval. The framework focuses on a user’s effort to localize relevant content in a result document. Measuring the effort is based on a system guided reading order of documents. The effort is calculated as the quantity of text the user is expected to browse through. More specifically, this study seeks evaluation metrics for retrieval methods following a specific fetch and browse approach, where in the fetch phase documents are ranked in decreasing order according to their document score, like in document retrieval. In the browse phase, for each retrieved document, a set of non-overlapping passages representing the relevant text within the document is retrieved. In other words, the passages of the document are re-organized, so that the best matching passages are read first in sequential order. We introduce an application scenario motivating the framework, and propose sample metrics based on the framework. These metrics give a basis for the comparison of effectiveness between traditional document retrieval and passage/XML retrieval and illuminate the benefit of passage/XML retrieval.  相似文献   

19.
基于主题地图的文献组织方法研究   总被引:2,自引:1,他引:2  
吴江宁  田海燕 《情报学报》2007,26(3):323-331
网络信息的膨胀导致了文献检索的困难,而检索效率的提高则依赖于其组织方法的改善。主题地图是适应信息资源的网络化而出现的一种新的组织技术,可以较好地解决大量无序、非结构化信息的组织问题。本文提出一个基于主题地图的多层文献组织模型(TMDOM),通过从文献内容中概化出主题并定义主题之间的关联,将领域内主要的概念及其关联以合理的层次结构体现出来,以实现对文献资源的有效组织。试验结果证明了这种组织方法的优越性,通过主题之间的各种关联,实现了有效的信息导航。  相似文献   

20.
引文信息在科技文献检索中的应用   总被引:1,自引:0,他引:1  
科技文献之间的相互引证关系反映了一种科学交流活动 ,显示了科学文献之间 (甚至是学科之间 )的内在联系 ,而通过追溯文献之间的这种关系 ,可以改善传统的基于内容的科技文献检索的方法。本文应用一篇文章的被引用信息提出了文献的“重要性”的概念 ,并将它应用到基于内容的检索结果的排序中。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号