共查询到20条相似文献,搜索用时 843 毫秒
1.
To cope with the fact that, in the ad hoc retrieval setting, documents relevant to a query could contain very few (short)
parts (passages) with query-related information, researchers proposed passage-based document ranking approaches. We show that several of
these retrieval methods can be understood, and new ones can be derived, using the same probabilistic model. We use language-model
estimates to instantiate specific retrieval algorithms, and in doing so present a novel passage language model that integrates information from the containing document to an extent controlled by the estimated document homogeneity. Several document-homogeneity measures that we present yield passage language models that are more effective than the standard
passage model for basic document retrieval and for constructing and utilizing passage-based relevance models; these relevance models also outperform a document-based relevance model. Finally, we demonstrate the merits in using the
document-homogeneity measures for integrating document-query and passage-query similarity information for document retrieval. 相似文献
2.
We present a novel approach to re-ranking a document list that was retrieved in response to a query so as to improve precision
at the very top ranks. The approach is based on utilizing a second list that was retrieved in response to the query by using,
for example, a different retrieval method and/or query representation. In contrast to commonly-used methods for fusion of retrieved lists that rely solely on retrieval scores (ranks) of documents, our approach also exploits inter-document-similarities between the lists—a potentially rich source of additional information. Empirical evaluation shows that our methods are effective
in re-ranking TREC runs; the resultant performance also favorably compares with that of a highly effective fusion method.
Furthermore, we show that our methods can potentially help to tackle a long-standing challenge, namely, integration of document-based
and cluster-based retrieved results. 相似文献
3.
K. S. Losev V. M. Efremovich E. A. Kryuchkova V. K. Lyuboshchinskaya 《Scientific and Technical Information Processing》2009,36(4):209-218
The history of the creation and development of the VINITI RAS “Geography” reference journal from 1954 to 2008 is considered. The changes in retrofunds and dynamics of the distribution of the overall
quantity of documents in the reference journal/database have been followed in relation to the changes in the content contained
in the issues during the period of time under consideration. The document information flow of the “ Geography ” database during 1991–2008 was analyzed statistically. 相似文献
4.
Document clustering of scientific texts using citation contexts 总被引:3,自引:0,他引:3
Document clustering has many important applications in the area of data mining and information retrieval. Many existing document
clustering techniques use the “bag-of-words” model to represent the content of a document. However, this representation is
only effective for grouping related documents when these documents share a large proportion of lexically equivalent terms.
In other words, instances of synonymy between related documents are ignored, which can reduce the effectiveness of applications
using a standard full-text document representation. To address this problem, we present a new approach for clustering scientific
documents, based on the utilization of citation contexts. A citation context is essentially the text surrounding the reference
markers used to refer to other scientific works. We hypothesize that citation contexts will provide relevant synonymous and
related vocabulary which will help increase the effectiveness of the bag-of-words representation. In this paper, we investigate
the power of these citation-specific word features, and compare them with the original document’s textual representation in
a document clustering task on two collections of labeled scientific journal papers from two distinct domains: High Energy
Physics and Genomics. We also compare these text-based clustering techniques with a link-based clustering algorithm which
determines the similarity between documents based on the number of co-citations, that is in-links represented by citing documents
and out-links represented by cited documents. Our experimental results indicate that the use of citation contexts, when combined
with the vocabulary in the full-text of the document, is a promising alternative means of capturing critical topics covered
by journal articles. More specifically, this document representation strategy when used by the clustering algorithm investigated
in this paper, outperforms both the full-text clustering approach and the link-based clustering technique on both scientific
journal datasets. 相似文献
5.
Oren Kurland 《Information Retrieval》2009,12(4):437-460
To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based
re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and using
information induced from these (often called) query-specific clusters for re-ranking the list. However, results concerning the effectiveness of various automatic cluster-based re-ranking methods have been inconclusive. We show that using query-specific clusters for automatic re-ranking
of top-retrieved documents is effective with several methods in which clusters play different roles, among which is the smoothing of document language models. We do so by adapting previously-proposed cluster-based retrieval approaches, which are based on (static) query-independent
clusters for ranking all documents in a corpus, to the re-ranking setting wherein clusters are query-specific. The best performing
method that we develop outperforms both the initial document-based ranking and some previously proposed cluster-based re-ranking
approaches; furthermore, this algorithm consistently outperforms a state-of-the-art pseudo-feedback-based approach. In further
exploration we study the performance of cluster-based smoothing methods for re-ranking with various (soft and hard) clustering
algorithms, and demonstrate the importance of clusters in providing context from the initial list through a comparison to
using single documents to this end.
相似文献
Oren KurlandEmail: |
6.
Smoothing of document language models is critical in language modeling approaches to information retrieval. In this paper,
we present a novel way of smoothing document language models based on propagating term counts probabilistically in a graph
of documents. A key difference between our approach and previous approaches is that our smoothing algorithm can iteratively
propagate counts and achieve smoothing with remotely related documents. Evaluation results on several TREC data sets show that the proposed method significantly outperforms the
simple collection-based smoothing method. Compared with those other smoothing methods that also exploit local corpus structures,
our method is especially effective in improving precision in top-ranked documents through “filling in” missing query terms
in relevant documents, which is attractive since most users only pay attention to the top-ranked documents in search engine
applications.
相似文献
ChengXiang ZhaiEmail: |
7.
In many probabilistic modeling approaches to Information Retrieval we are interested in estimating how well a document model
“fits” the user’s information need (query model). On the other hand in statistics, goodness of fit tests are well established
techniques for assessing the assumptions about the underlying distribution of a data set. Supposing that the query terms are
randomly distributed in the various documents of the collection, we actually want to know whether the occurrences of the query
terms are more frequently distributed by chance in a particular document. This can be quantified by the so-called goodness
of fit tests. In this paper, we present a new document ranking technique based on Chi-square goodness of fit tests. Given
the null hypothesis that there is no association between the query terms q and the document d irrespective of any chance occurrences, we perform a Chi-square goodness of fit test for assessing this hypothesis and calculate
the corresponding Chi-square values. Our retrieval formula is based on ranking the documents in the collection according to
these calculated Chi-square values. The method was evaluated over the entire test collection of TREC data, on disks 4 and
5, using the topics of TREC-7 and TREC-8 (50 topics each) conferences. It performs well, outperforming steadily the classical
OKAPI term frequency weighting formula but below that of KL-Divergence from language modeling approach. Despite this, we believe
that the technique is an important non-parametric way of thinking of retrieval, offering the possibility to try simple alternative
retrieval formulas within goodness-of-fit statistical tests’ framework, modeling the data in various ways estimating or assigning any arbitrary theoretical distribution
in terms. 相似文献
8.
From work to text to document 总被引:1,自引:1,他引:0
David Beard 《Archival Science》2008,8(3):217-226
The defining trope for the humanities in the last 30 years has been typified by the move from “work” to “text.” The signature text defining this move has been Roland Barthes seminal essay, “From Work to Text.” But the current move
in library, archival and information studies toward the “document” as the key term offers challenges for contemporary humanities research. In making our own movement from work to text to document, we can explicate fully the complexity of conducting archival humanistic research within disciplinary and institutional contexts
in the twenty-first century. This essay calls for a complex perspective, one that demands that we understand the raw materials
of scholarship are processed by disciplines, by institutions, and by the work of the scholar. When we understand our materials
as constrained by disciplines, we understand them as “works.” When we understand them as constrained by the institutions of
memory that preserve and grant access to them, we understand them as “documents.” And when we understand them as the ground
for our own interpretive activity, we understand them as “texts.” When we understand that humanistic scholarship requires
an awareness of all three perspectives simultaneously (an understanding demonstrated by case studies in historical studies
of the discipline of rhetoric), we will be ready for a richer historical scholarship as well as a richer collaboration between
humanists and archivists. 相似文献
9.
V. M. Yefremenkova M. V. Voinova E. N. Nikanorova 《Scientific and Technical Information Processing》2010,37(1):19-32
This paper considers the history of the creation and development of the VINITI RAS AJ in the field of mechanics from 1953
to 2008. The changes in the back issues and dynamics of the distribution of the total number of documents in the Mechanics
AJ/DB are traced. The document information flow of the “Mechanics” DB from 1953 to 2008 is statistically analyzed. 相似文献
10.
The TREC 2009 web ad hoc and relevance feedback tasks used a new document collection, the ClueWeb09 dataset, which was crawled
from the general web in early 2009. This dataset contains 1 billion web pages, a substantial fraction of which are spam—pages
designed to deceive search engines so as to deliver an unwanted payload. We examine the effect of spam on the results of the
TREC 2009 web ad hoc and relevance feedback tasks, which used the ClueWeb09 dataset. We show that a simple content-based classifier
with minimal training is efficient enough to rank the “spamminess” of every page in the dataset using a standard personal
computer in 48 hours, and effective enough to yield significant and substantive improvements in the fixed-cutoff precision
(estP10) as well as rank measures (estR-Precision, StatMAP, MAP) of nearly all submitted runs. Moreover, using a set of “honeypot”
queries the labeling of training data may be reduced to an entirely automatic process. The results of classical information
retrieval methods are particularly enhanced by filtering—from among the worst to among the best. 相似文献
11.
Fabio Aiolli Riccardo Cardin Fabrizio Sebastiani Alessandro Sperduti 《Information Retrieval》2009,12(5):559-580
In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between
the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which
represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization
research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose
an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact
on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification
in which the distinction between primary and secondary categories is present; these results are obtained by reformulating
the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label
multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we
improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data
expressed in preferential form, i.e., in the form “for document d
i
, category c′ is preferred to category c′′”; this allows us to distinguish between primary and secondary categories not only in the classification phase but also
in the learning phase, thus differentiating their impact on the classifiers to be generated. 相似文献
12.
Irving Louis Horowitz 《Publishing Research Quarterly》1995,11(1):40-45
Upon reviewing thePreliminary Draft of the Report of the Working Group on Intellectual Property Rights, given the titleIntellectual Property and the National Information Infrastructure, one immediately confronts the grand ambiguity that resides in the two words: “intellectual property.” That the task force
on the information infrastructure, enshrined with the acronym NII, had to locate precedent for its missioning Supreme Court
Justice Story's 1841 observations on copyright issues as an area involving the “metaphysics of the law” indicates what a long
reach the very notion of intellectual property entails in a democratic society.
He is the author ofCommunicating Ideas: The Politics of Publishing and has published widely in the journal literature, includingScholarly Publishing; Logos; Publishing Research Quarterly; Journal of the American Society of Information Science, among others. 相似文献
13.
The Web contains a tremendous amount of information. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach. We notice that many Web documents contain a sequence of chunks of textual information, each of which constitutes a record. This type of documents is referred to as multiple-record documents. In our categorization approach, a document is represented by a set of term frequencies of index terms, a density heuristic value, and a grouping heuristic value. We first apply the logistic regression analysis on relevant probabilities using the (i) index terms, (ii) density value, and (iii) grouping value of each training document. Hereafter, the relevant probability of each test document is interpolated from the fitting curves. Contrary to other probabilistic retrieval models, our model makes only a weak independent assumption and is capable of handling any important dependent relationships among index terms. In addition, we use logistic regression, instead of linear regression analysis, because the relevance probabilities of training documents are discrete. Using a test set of car-ads and another one for obituary Web documents, our probabilistic model achieves the averaged recall ratio of 100%, precision ratio of 83.3%, and accuracy ratio of 92.5%. 相似文献
14.
15.
In Information Retrieval, since it is hard to identify users’ information needs, many approaches have been tried to solve
this problem by expanding initial queries and reweighting the terms in the expanded queries using users’ relevance judgments.
Although relevance feedback is most effective when relevance information about retrieved documents is provided by users, it
is not always available. Another solution is to use correlated terms for query expansion. The main problem with this approach
is how to construct the term-term correlations that can be used effectively to improve retrieval performance. In this study,
we try to construct query concepts that denote users’ information needs from a document space, rather than to reformulate initial queries using the term correlations
and/or users’ relevance feedback. To form query concepts, we extract features from each document, and then cluster the features into primitive concepts that are then used to form
query concepts. Experiments are performed on the Associated Press (AP) dataset taken from the TREC collection. The experimental evaluation
shows that our proposed framework called QCM (Query Concept Method) outperforms baseline probabilistic retrieval model on
TREC retrieval. 相似文献
16.
Steven Starker 《Publishing Research Quarterly》1988,4(2):26-32
The self-help book in America appears to occupy a social niche roughly on a par with that of the legendary oracle at Delphi.
Offering wisdom and enlightenment at discount prices, it speaks to a vast audience on a variety of topics, and provides specific
directions for achieving love, health, wealth, peace of mind, and any number of practical skills. It is too prevalent and
powerful a phenomenon to overlook, despite its belonging to “pop” culture. Inasmuch as self-help books are dispensing advice
to millions on matters physical, psychological, and spiritual, they cannot responsibly be ignored by social scientists and
health care practitioners. Questions regarding their relative merits and potential dangers deserve careful consideration.
This article is an excerpt from chapter 1 ofOracle at the Supermarket, published by Transaction Publishers. 相似文献
17.
Eric Ketelaar 《Archival Science》1987,1(2):131-141
Archivists and historians usually consider archives as repositories of historical sources and the archivist as a neutral custodian.
Sociologists and anthropologists see “the archive” also as a system of collecting, categorizing, and exploiting memories.
Archivists are hesitantly acknowledging their role in shaping memories. I advocate that archival fonds, archival documents,
archival institutions, and archival systems contain tacit narratives which must be deconstructed in order to understand the
meanings of archives.
Revision of a paper presented, on the invitation of the Master's Programme in Archival Studies, Department of History, University
of Manitoba, in the History Department Colloquium series of the University of Manitoba, Winnipeg, 20 February, 2001. Some
of the arguments were used earlier in two papers I presented in the seminar “Archives, Documentation and the Institutions
of Social Memory”, organized by the Bentley Historical Library and the International Institute of the University of Michigan,
Ann Arbor, 14 February, 2001. 相似文献
18.
The collective feedback of the users of an Information Retrieval (IR) system has been shown to provide semantic information
that, while hard to extract using standard IR techniques, can be useful in Web mining tasks. In the last few years, several
approaches have been proposed to process the logs stored by Internet Service Providers (ISP), Intranet proxies or Web search
engines. However, the solutions proposed in the literature only partially represent the information available in the Web logs.
In this paper, we propose to use a richer data structure, which is able to preserve most of the information available in the
Web logs. This data structure consists of three groups of entities: users, documents and queries, which are connected in a
network of relations. Query refinements correspond to separate transitions between the corresponding query nodes in the graph,
while users are linked to the queries they have issued and to the documents they have selected. The classical query/document
transitions, which connect a query to the documents selected by the users’ in the returned result page, are also considered.
The resulting data structure is a complete representation of the collective search activity performed by the users of a search
engine or of an Intranet. The experimental results show that this more powerful representation can be successfully used in
several Web mining tasks like discovering semantically relevant query suggestions and Web page categorization by topic. 相似文献
19.
Searching online information resources using mobile devices is affected by small screens which can display only a fraction
of ranked search results. In this paper we investigate whether the search effort can be reduced by means of a simple user
feedback: for a screenful of search results the user is encouraged to indicate a single most relevant document. In our approach
we exploit the fact that, for small display sizes and limited user actions, we can construct a user decision tree representing
all possible outcomes of the user interaction with the system. Examining the trees we can compute an upper limit on relevance
feedback performance. In this study we consider three standard feedback algorithms: Rocchio, Robertson/Sparck-Jones (RSJ)
and a Bayesian algorithm. We evaluate them in conjunction with two strategies for presenting search results: a document ranking
that attempts to maximize information gain from the user’s choices and the top-D ranked documents. Experimental results indicate
that for RSJ feedback which involves an explicit feature selection policy, the greedy top-D display is more appropriate. For
the other two algorithms, the exploratory display that maximizes information gain produces better results. We conducted a
user study to compare the performance of the relevance feedback methods with real users and compare the results with the findings
from the tree analysis. This comparison between the simulations and real user behaviour indicates that the Bayesian algorithm,
coupled with the sampled display, is the most effective.
Extended version of “Evaluating Relevance Feedback Algorithms for Searching on Small Displays, ” Vishwa Vinay, Ingemar J.
Cox, Natasa Milic-Frayling, Ken Wood published in the proceedings of ECIR 2005, David E. Losada, Juan M. Fernández-Luna (Eds.),
Springer 2005, ISBN 3-540-25295-9 相似文献
20.
Juliet Gardiner 《Publishing Research Quarterly》2000,16(1):63-76
Martin Amis’ novel The Information was published in paperback in May 1995. For a number of convergent reasons, the publication became, in itself, a major media
event. In examining this occasion., the economic and cultural imperatives that shaped the marketing of The Information and the wider context of contemporary book publishing and its relation to other media, this paper problematises the relationship
between cultural and economic value. It considers the discourse around what an author is “worth” in a late capitalist society
of fiercely competitive consumer choice, and how the representation of an avowedly “literary” author is mobilised in the marketplace
in ways that aim not to threaten to compromise the investment in the difference between literary and popular fiction. 相似文献