首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 843 毫秒
1.
To cope with the fact that, in the ad hoc retrieval setting, documents relevant to a query could contain very few (short) parts (passages) with query-related information, researchers proposed passage-based document ranking approaches. We show that several of these retrieval methods can be understood, and new ones can be derived, using the same probabilistic model. We use language-model estimates to instantiate specific retrieval algorithms, and in doing so present a novel passage language model that integrates information from the containing document to an extent controlled by the estimated document homogeneity. Several document-homogeneity measures that we present yield passage language models that are more effective than the standard passage model for basic document retrieval and for constructing and utilizing passage-based relevance models; these relevance models also outperform a document-based relevance model. Finally, we demonstrate the merits in using the document-homogeneity measures for integrating document-query and passage-query similarity information for document retrieval.  相似文献   

2.
We present a novel approach to re-ranking a document list that was retrieved in response to a query so as to improve precision at the very top ranks. The approach is based on utilizing a second list that was retrieved in response to the query by using, for example, a different retrieval method and/or query representation. In contrast to commonly-used methods for fusion of retrieved lists that rely solely on retrieval scores (ranks) of documents, our approach also exploits inter-document-similarities between the lists—a potentially rich source of additional information. Empirical evaluation shows that our methods are effective in re-ranking TREC runs; the resultant performance also favorably compares with that of a highly effective fusion method. Furthermore, we show that our methods can potentially help to tackle a long-standing challenge, namely, integration of document-based and cluster-based retrieved results.  相似文献   

3.
The history of the creation and development of the VINITI RAS “Geography” reference journal from 1954 to 2008 is considered. The changes in retrofunds and dynamics of the distribution of the overall quantity of documents in the reference journal/database have been followed in relation to the changes in the content contained in the issues during the period of time under consideration. The document information flow of the “ Geography ” database during 1991–2008 was analyzed statistically.  相似文献   

4.
Document clustering of scientific texts using citation contexts   总被引:3,自引:0,他引:3  
Document clustering has many important applications in the area of data mining and information retrieval. Many existing document clustering techniques use the “bag-of-words” model to represent the content of a document. However, this representation is only effective for grouping related documents when these documents share a large proportion of lexically equivalent terms. In other words, instances of synonymy between related documents are ignored, which can reduce the effectiveness of applications using a standard full-text document representation. To address this problem, we present a new approach for clustering scientific documents, based on the utilization of citation contexts. A citation context is essentially the text surrounding the reference markers used to refer to other scientific works. We hypothesize that citation contexts will provide relevant synonymous and related vocabulary which will help increase the effectiveness of the bag-of-words representation. In this paper, we investigate the power of these citation-specific word features, and compare them with the original document’s textual representation in a document clustering task on two collections of labeled scientific journal papers from two distinct domains: High Energy Physics and Genomics. We also compare these text-based clustering techniques with a link-based clustering algorithm which determines the similarity between documents based on the number of co-citations, that is in-links represented by citing documents and out-links represented by cited documents. Our experimental results indicate that the use of citation contexts, when combined with the vocabulary in the full-text of the document, is a promising alternative means of capturing critical topics covered by journal articles. More specifically, this document representation strategy when used by the clustering algorithm investigated in this paper, outperforms both the full-text clustering approach and the link-based clustering technique on both scientific journal datasets.  相似文献   

5.
To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and using information induced from these (often called) query-specific clusters for re-ranking the list. However, results concerning the effectiveness of various automatic cluster-based re-ranking methods have been inconclusive. We show that using query-specific clusters for automatic re-ranking of top-retrieved documents is effective with several methods in which clusters play different roles, among which is the smoothing of document language models. We do so by adapting previously-proposed cluster-based retrieval approaches, which are based on (static) query-independent clusters for ranking all documents in a corpus, to the re-ranking setting wherein clusters are query-specific. The best performing method that we develop outperforms both the initial document-based ranking and some previously proposed cluster-based re-ranking approaches; furthermore, this algorithm consistently outperforms a state-of-the-art pseudo-feedback-based approach. In further exploration we study the performance of cluster-based smoothing methods for re-ranking with various (soft and hard) clustering algorithms, and demonstrate the importance of clusters in providing context from the initial list through a comparison to using single documents to this end.
Oren KurlandEmail:
  相似文献   

6.
Smoothing of document language models is critical in language modeling approaches to information retrieval. In this paper, we present a novel way of smoothing document language models based on propagating term counts probabilistically in a graph of documents. A key difference between our approach and previous approaches is that our smoothing algorithm can iteratively propagate counts and achieve smoothing with remotely related documents. Evaluation results on several TREC data sets show that the proposed method significantly outperforms the simple collection-based smoothing method. Compared with those other smoothing methods that also exploit local corpus structures, our method is especially effective in improving precision in top-ranked documents through “filling in” missing query terms in relevant documents, which is attractive since most users only pay attention to the top-ranked documents in search engine applications.
ChengXiang ZhaiEmail:
  相似文献   

7.
In many probabilistic modeling approaches to Information Retrieval we are interested in estimating how well a document model “fits” the user’s information need (query model). On the other hand in statistics, goodness of fit tests are well established techniques for assessing the assumptions about the underlying distribution of a data set. Supposing that the query terms are randomly distributed in the various documents of the collection, we actually want to know whether the occurrences of the query terms are more frequently distributed by chance in a particular document. This can be quantified by the so-called goodness of fit tests. In this paper, we present a new document ranking technique based on Chi-square goodness of fit tests. Given the null hypothesis that there is no association between the query terms q and the document d irrespective of any chance occurrences, we perform a Chi-square goodness of fit test for assessing this hypothesis and calculate the corresponding Chi-square values. Our retrieval formula is based on ranking the documents in the collection according to these calculated Chi-square values. The method was evaluated over the entire test collection of TREC data, on disks 4 and 5, using the topics of TREC-7 and TREC-8 (50 topics each) conferences. It performs well, outperforming steadily the classical OKAPI term frequency weighting formula but below that of KL-Divergence from language modeling approach. Despite this, we believe that the technique is an important non-parametric way of thinking of retrieval, offering the possibility to try simple alternative retrieval formulas within goodness-of-fit statistical tests’ framework, modeling the data in various ways estimating or assigning any arbitrary theoretical distribution in terms.  相似文献   

8.
From work to text to document   总被引:1,自引:1,他引:0  
The defining trope for the humanities in the last 30 years has been typified by the move from “work” to “text.” The signature text defining this move has been Roland Barthes seminal essay, “From Work to Text.” But the current move in library, archival and information studies toward the “document” as the key term offers challenges for contemporary humanities research. In making our own movement from work to text to document, we can explicate fully the complexity of conducting archival humanistic research within disciplinary and institutional contexts in the twenty-first century. This essay calls for a complex perspective, one that demands that we understand the raw materials of scholarship are processed by disciplines, by institutions, and by the work of the scholar. When we understand our materials as constrained by disciplines, we understand them as “works.” When we understand them as constrained by the institutions of memory that preserve and grant access to them, we understand them as “documents.” And when we understand them as the ground for our own interpretive activity, we understand them as “texts.” When we understand that humanistic scholarship requires an awareness of all three perspectives simultaneously (an understanding demonstrated by case studies in historical studies of the discipline of rhetoric), we will be ready for a richer historical scholarship as well as a richer collaboration between humanists and archivists.  相似文献   

9.
This paper considers the history of the creation and development of the VINITI RAS AJ in the field of mechanics from 1953 to 2008. The changes in the back issues and dynamics of the distribution of the total number of documents in the Mechanics AJ/DB are traced. The document information flow of the “Mechanics” DB from 1953 to 2008 is statistically analyzed.  相似文献   

10.
The TREC 2009 web ad hoc and relevance feedback tasks used a new document collection, the ClueWeb09 dataset, which was crawled from the general web in early 2009. This dataset contains 1 billion web pages, a substantial fraction of which are spam—pages designed to deceive search engines so as to deliver an unwanted payload. We examine the effect of spam on the results of the TREC 2009 web ad hoc and relevance feedback tasks, which used the ClueWeb09 dataset. We show that a simple content-based classifier with minimal training is efficient enough to rank the “spamminess” of every page in the dataset using a standard personal computer in 48 hours, and effective enough to yield significant and substantive improvements in the fixed-cutoff precision (estP10) as well as rank measures (estR-Precision, StatMAP, MAP) of nearly all submitted runs. Moreover, using a set of “honeypot” queries the labeling of training data may be reduced to an entirely automatic process. The results of classical information retrieval methods are particularly enhanced by filtering—from among the worst to among the best.  相似文献   

11.
In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form “for document d i , category c′ is preferred to category c′′”; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.  相似文献   

12.
Upon reviewing thePreliminary Draft of the Report of the Working Group on Intellectual Property Rights, given the titleIntellectual Property and the National Information Infrastructure, one immediately confronts the grand ambiguity that resides in the two words: “intellectual property.” That the task force on the information infrastructure, enshrined with the acronym NII, had to locate precedent for its missioning Supreme Court Justice Story's 1841 observations on copyright issues as an area involving the “metaphysics of the law” indicates what a long reach the very notion of intellectual property entails in a democratic society. He is the author ofCommunicating Ideas: The Politics of Publishing and has published widely in the journal literature, includingScholarly Publishing; Logos; Publishing Research Quarterly; Journal of the American Society of Information Science, among others.  相似文献   

13.
The Web contains a tremendous amount of information. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach. We notice that many Web documents contain a sequence of chunks of textual information, each of which constitutes a record. This type of documents is referred to as multiple-record documents. In our categorization approach, a document is represented by a set of term frequencies of index terms, a density heuristic value, and a grouping heuristic value. We first apply the logistic regression analysis on relevant probabilities using the (i) index terms, (ii) density value, and (iii) grouping value of each training document. Hereafter, the relevant probability of each test document is interpolated from the fitting curves. Contrary to other probabilistic retrieval models, our model makes only a weak independent assumption and is capable of handling any important dependent relationships among index terms. In addition, we use logistic regression, instead of linear regression analysis, because the relevance probabilities of training documents are discrete. Using a test set of car-ads and another one for obituary Web documents, our probabilistic model achieves the averaged recall ratio of 100%, precision ratio of 83.3%, and accuracy ratio of 92.5%.  相似文献   

14.
15.
In Information Retrieval, since it is hard to identify users’ information needs, many approaches have been tried to solve this problem by expanding initial queries and reweighting the terms in the expanded queries using users’ relevance judgments. Although relevance feedback is most effective when relevance information about retrieved documents is provided by users, it is not always available. Another solution is to use correlated terms for query expansion. The main problem with this approach is how to construct the term-term correlations that can be used effectively to improve retrieval performance. In this study, we try to construct query concepts that denote users’ information needs from a document space, rather than to reformulate initial queries using the term correlations and/or users’ relevance feedback. To form query concepts, we extract features from each document, and then cluster the features into primitive concepts that are then used to form query concepts. Experiments are performed on the Associated Press (AP) dataset taken from the TREC collection. The experimental evaluation shows that our proposed framework called QCM (Query Concept Method) outperforms baseline probabilistic retrieval model on TREC retrieval.  相似文献   

16.
The self-help book in America appears to occupy a social niche roughly on a par with that of the legendary oracle at Delphi. Offering wisdom and enlightenment at discount prices, it speaks to a vast audience on a variety of topics, and provides specific directions for achieving love, health, wealth, peace of mind, and any number of practical skills. It is too prevalent and powerful a phenomenon to overlook, despite its belonging to “pop” culture. Inasmuch as self-help books are dispensing advice to millions on matters physical, psychological, and spiritual, they cannot responsibly be ignored by social scientists and health care practitioners. Questions regarding their relative merits and potential dangers deserve careful consideration. This article is an excerpt from chapter 1 ofOracle at the Supermarket, published by Transaction Publishers.  相似文献   

17.
Archivists and historians usually consider archives as repositories of historical sources and the archivist as a neutral custodian. Sociologists and anthropologists see “the archive” also as a system of collecting, categorizing, and exploiting memories. Archivists are hesitantly acknowledging their role in shaping memories. I advocate that archival fonds, archival documents, archival institutions, and archival systems contain tacit narratives which must be deconstructed in order to understand the meanings of archives. Revision of a paper presented, on the invitation of the Master's Programme in Archival Studies, Department of History, University of Manitoba, in the History Department Colloquium series of the University of Manitoba, Winnipeg, 20 February, 2001. Some of the arguments were used earlier in two papers I presented in the seminar “Archives, Documentation and the Institutions of Social Memory”, organized by the Bentley Historical Library and the International Institute of the University of Michigan, Ann Arbor, 14 February, 2001.  相似文献   

18.
The collective feedback of the users of an Information Retrieval (IR) system has been shown to provide semantic information that, while hard to extract using standard IR techniques, can be useful in Web mining tasks. In the last few years, several approaches have been proposed to process the logs stored by Internet Service Providers (ISP), Intranet proxies or Web search engines. However, the solutions proposed in the literature only partially represent the information available in the Web logs. In this paper, we propose to use a richer data structure, which is able to preserve most of the information available in the Web logs. This data structure consists of three groups of entities: users, documents and queries, which are connected in a network of relations. Query refinements correspond to separate transitions between the corresponding query nodes in the graph, while users are linked to the queries they have issued and to the documents they have selected. The classical query/document transitions, which connect a query to the documents selected by the users’ in the returned result page, are also considered. The resulting data structure is a complete representation of the collective search activity performed by the users of a search engine or of an Intranet. The experimental results show that this more powerful representation can be successfully used in several Web mining tasks like discovering semantically relevant query suggestions and Web page categorization by topic.  相似文献   

19.
Searching online information resources using mobile devices is affected by small screens which can display only a fraction of ranked search results. In this paper we investigate whether the search effort can be reduced by means of a simple user feedback: for a screenful of search results the user is encouraged to indicate a single most relevant document. In our approach we exploit the fact that, for small display sizes and limited user actions, we can construct a user decision tree representing all possible outcomes of the user interaction with the system. Examining the trees we can compute an upper limit on relevance feedback performance. In this study we consider three standard feedback algorithms: Rocchio, Robertson/Sparck-Jones (RSJ) and a Bayesian algorithm. We evaluate them in conjunction with two strategies for presenting search results: a document ranking that attempts to maximize information gain from the user’s choices and the top-D ranked documents. Experimental results indicate that for RSJ feedback which involves an explicit feature selection policy, the greedy top-D display is more appropriate. For the other two algorithms, the exploratory display that maximizes information gain produces better results. We conducted a user study to compare the performance of the relevance feedback methods with real users and compare the results with the findings from the tree analysis. This comparison between the simulations and real user behaviour indicates that the Bayesian algorithm, coupled with the sampled display, is the most effective. Extended version of “Evaluating Relevance Feedback Algorithms for Searching on Small Displays, ” Vishwa Vinay, Ingemar J. Cox, Natasa Milic-Frayling, Ken Wood published in the proceedings of ECIR 2005, David E. Losada, Juan M. Fernández-Luna (Eds.), Springer 2005, ISBN 3-540-25295-9  相似文献   

20.
Martin Amis’ novel The Information was published in paperback in May 1995. For a number of convergent reasons, the publication became, in itself, a major media event. In examining this occasion., the economic and cultural imperatives that shaped the marketing of The Information and the wider context of contemporary book publishing and its relation to other media, this paper problematises the relationship between cultural and economic value. It considers the discourse around what an author is “worth” in a late capitalist society of fiercely competitive consumer choice, and how the representation of an avowedly “literary” author is mobilised in the marketplace in ways that aim not to threaten to compromise the investment in the difference between literary and popular fiction.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号