期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A new unsupervised method for document clustering by using WordNet lexical and conceptual relations

Diego Reforgiato Recupero 《Information Retrieval》2007,10(6):563-579

Text document clustering provides an effective and intuitive navigation mechanism to organize a large amount of retrieval results by grouping documents in a small number of meaningful classes. Many well-known methods of text clustering make use of a long list of words as vector space which is often unsatisfactory for a couple of reasons: first, it keeps the dimensionality of the data very high, and second, it ignores important relationships between terms like synonyms or antonyms. Our unsupervised method solves both problems by using ANNIE and WordNet lexical categories and WordNet ontology in order to create a well structured document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step we have chosen the bisecting k-means and the Multipole tree, a modified version of the Antipole tree data structure for, respectively, their accuracy and speed.

Diego Reforgiato RecuperoEmail:

相似文献

2.

Methods for automatically evaluating answers to complex questions 总被引：1，自引：0，他引：1

Jimmy Lin Dina Demner-Fushman 《Information Retrieval》2006,9(5):565-587

Evaluation is a major driving force in advancing the state of the art in language technologies. In particular, methods for automatically assessing the quality of machine output is the preferred method for measuring progress, provided that these metrics have been validated against human judgments. Following recent developments in the automatic evaluation of machine translation and document summarization, we present a similar approach, implemented in a measure called POURPRE, an automatic technique for evaluating answers to complex questions based on n-gram co-occurrences between machine output and a human-generated answer key. Until now, the only way to assess the correctness of answers to such questions involves manual determination of whether an information “nugget” appears in a system's response. The lack of automatic methods for scoring system output is an impediment to progress in the field, which we address with this work. Experiments with the TREC 2003, TREC 2004, and TREC 2005 QA tracks indicate that rankings produced by our metric correlate highly with official rankings, and that POURPRE outperforms direct application of existing metrics.

Dina Demner-FushmanEmail:

相似文献

3.

On information retrieval metrics designed for evaluation with incomplete relevance assessments 总被引：1，自引：0，他引：1

Tetsuya Sakai Noriko Kando 《Information Retrieval》2008,11(5):447-470

Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, where many documents exist that were never examined by the relevance assessors, is receiving a lot of attention. This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs—the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. Following previous work, we artificially reduce the original relevance data to simulate IR evaluation environments with extremely incomplete relevance data. We then investigate the effect of this reduction on discriminative power, which we define as the proportion of system pairs with a statistically significant difference for a given probability of Type I Error, and on Kendall’s rank correlation, which reflects the overall resemblance of two system rankings according to two different metrics or two different relevance data sets. According to these experiments, Q′, nDCG′ and AP′ proposed by Sakai are superior to bpref proposed by Buckley and Voorhees and to Rank-Biased Precision proposed by Moffat and Zobel. We also point out some weaknesses of bpref and Rank-Biased Precision by examining their formal definitions.

Noriko KandoEmail:

相似文献

4.

Chinese Publishing Industry Going Global: Background and Performance

Lifang Xu Qing Fang 《Publishing Research Quarterly》2008,24(1):64-72

To put an end to the large copyright trade deficit, both Chinese government agencies and publishing houses have been striving for entering the international publication market. The article analyzes the background of the going-global strategy, and sums up the performance of both Chinese administrations and publishers.

Qing Fang (Corresponding author)Email:

相似文献

5.

How to manage an information state: Jean-Baptiste Colbert’s archives and the education of his son

Jacob Soll 《Archival Science》2007,7(4):331-342

This article examines the archival methods developed by Colbert to train his son in state administration. Based on Colbert’s correspondence with his son, it reveals the practices Colbert thought necessary to collect and manage information in his state encyclopedic archive during the last half of the 17th century.

Jacob SollEmail:

相似文献

6.

Nearest-biclusters collaborative filtering based on constant and coherent values

Panagiotis Symeonidis Alexandros Nanopoulos Apostolos N. Papadopoulos Yannis Manolopoulos 《Information Retrieval》2008,11(1):51-75

Collaborative Filtering (CF) Systems have been studied extensively for more than a decade to confront the “information overload” problem. Nearest-neighbor CF is based either on similarities between users or between items, to form a neighborhood of users or items, respectively. Recent research has tried to combine the two aforementioned approaches to improve effectiveness. Traditional clustering approaches (k-means or hierarchical clustering) has been also used to speed up the recommendation process. In this paper, we use biclustering to disclose this duality between users and items, by grouping them in both dimensions simultaneously. We propose a novel nearest-biclusters algorithm, which uses a new similarity measure that achieves partial matching of users’ preferences. We apply nearest-biclusters in combination with two different types of biclustering algorithms—Bimax and xMotif—for constant and coherent biclustering, respectively. Extensive performance evaluation results in three real-life data sets are provided, which show that the proposed method improves substantially the performance of the CF process.

Yannis ManolopoulosEmail:

相似文献

7.

Re-ranking search results using language models of query-specific clusters

Oren Kurland 《Information Retrieval》2009,12(4):437-460

To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and using information induced from these (often called) query-specific clusters for re-ranking the list. However, results concerning the effectiveness of various automatic cluster-based re-ranking methods have been inconclusive. We show that using query-specific clusters for automatic re-ranking of top-retrieved documents is effective with several methods in which clusters play different roles, among which is the smoothing of document language models. We do so by adapting previously-proposed cluster-based retrieval approaches, which are based on (static) query-independent clusters for ranking all documents in a corpus, to the re-ranking setting wherein clusters are query-specific. The best performing method that we develop outperforms both the initial document-based ranking and some previously proposed cluster-based re-ranking approaches; furthermore, this algorithm consistently outperforms a state-of-the-art pseudo-feedback-based approach. In further exploration we study the performance of cluster-based smoothing methods for re-ranking with various (soft and hard) clustering algorithms, and demonstrate the importance of clusters in providing context from the initial list through a comparison to using single documents to this end.

Oren KurlandEmail:

相似文献

8.

The Identification of Digital Book Content

Andy Weissberg 《Publishing Research Quarterly》2008,24(4):255-260

This article analyzes current industry practices toward the identification of digital book content. It highlights key technology trends, workflow considerations and supply chain behaviors, and examines the implications of these trends and behaviors on the production, discoverability, purchasing and consumption of digital book products.

Andy WeissbergEmail:

相似文献

9.

International Investments and Acquisitions in India: Tax and Regulatory Aspects

Sandeep Chaufla 《Publishing Research Quarterly》2008,24(3):187-201

A review and analysis of the rules and regulations including the tax aspects of making an investment in India is presented. The full range from Foreign Direct Investment to different forms of doing business with specific examples from the publishing industry is explored to help understand current policies and regulations.

Sandeep ChauflaEmail: Email:

相似文献

10.

Chinese Children’s Book Market and the German Experiences in Cooperation with Chinese Publishers

Bartz Jing 《Publishing Research Quarterly》2008,24(1):73-78

A summary overview of the children’s and young adult publishing industry in China with a focus on the size of the market, ten major publishing houses, copyright and trends. Special emphasis has been placed on specific transaction for the sale of translation rights from German language publishers to China and minimal activities of German rights sold to Chinese publishers.

Jing BartzEmail:

相似文献

11.

Participatory archive: towards decentralised curation,radical user orientation,and broader contextualisation of records management 总被引：2，自引：2，他引：0

Isto Huvila 《Archival Science》2008,8(1):15-36

User perspective and user studies have received noticeably little practical attention in archives and archival science. The purpose of this article is to address the issues of communication and user participation in archival contexts. Two action research projects-based digital archives are discussed. The insights gained during the research and development work are used to formulate a new approach to a participatory archive. In spite of the historical nature of the archives discussed, the suggested ways of interacting with an archive are not specific to historical records. The fundamental characteristics of the proposed approach are decentralised curation, radical user orientation, and contextualisation of both records and the entire archival process.

Isto HuvilaEmail:

相似文献

12.

Canadian Social Science and Humanities Online Journal Publishing,the Synergies Project,and the Creation and Representation of Knowledge

Rowland Lorimer John Maxwell 《Publishing Research Quarterly》2007,23(3):175-193

相似文献

13.

On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages 总被引：1，自引：1，他引：0

Jakub Piskorski Karol Wieloch Marcin Sydow 《Information Retrieval》2009,12(3):275-299

Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was on English and few other major languages. This article reports on knowledge-poor methods for tackling person name matching and lemmatization in Polish, a highly inflectional language with complex person name declension paradigm. These methods apply mainly well-established string distance metrics, some new variants thereof, automatically acquired simple suffix-based lemmatization patterns and some combinations of the aforementioned techniques. Furthermore, we also carried out some initial experiments on deploying techniques that utilize the context, in which person names appear. Results of numerous experiments are presented. The evaluation carried out on a data set extracted from a corpus of on-line news articles revealed that achieving lemmatization accuracy figures greater than 90% seems to be difficult, whereas combining string distance metrics with suffix-based patterns results in 97.6–99% accuracy for the name matching task. Interestingly, no significant additional gain could be achieved through integrating some basic techniques, which try to exploit the local context the names appear in. Although our explorations were focused on Polish, we believe that the work presented in this article constitutes practical guidelines for tackling the same problem for other highly inflectional languages with similar phenomena.

Marcin SydowEmail:

相似文献

14.

The ADA approach: retro-archiving data in an academic environment

Marjan Balkestein Heiko Tjalsma 《Archival Science》2007,7(1):89-105

This article concentrates on the retro-archiving of older digital research data. The ADA approach was developed and used to retro-archive older data files, most of which were between 10 and 30 years old. The origin and main characteristics of the ADA approach are described in the second section of the article. The third section discusses two recent data-archiving pilot projects that were conducted in the Netherlands. The first of these projects, the ADA project, laid the foundation for the ADA approach, which was subsequently applied and tested again in the second project, eDNA, which focused on archaeological data. The final section of the article provides a comparison of the results of these two projects.

Heiko TjalsmaEmail:

相似文献

15.

Publishing in Scotland: Reviewing the Fragile Revival

Alistair McCleery Marion Sinclair Linda Gunn 《Publishing Research Quarterly》2008,24(2):87-97

A comparison of analyses of the Scottish publishing industry carried out in 1992, 2002 and 2007 underscores the fragility of the sector within a small country within the English-language community. A number of indices reveal either stability or stagnation and the picture emerges of the remarkable tenacity of publishing in Scotland. Although there is already a significant and vital element of state support for publishing in Scotland, further intervention will be necessary to ensure fulfilment of its potential.

Alistair McCleeryEmail:

相似文献

16.

Consumer Magazines in Argentina: A Market to Recover

Ethel Alejandra Pis Diez 《Publishing Research Quarterly》2007,23(3):194-209

相似文献

17.

Australian Small and Independent Publishing: The Freeth Report

Nathan Hollier 《Publishing Research Quarterly》2008,24(3):165-174

This article provides a summary of and commentary on ‘A Lovely Kind of Madness: Small and Independent Publishing in Australia’, an unpublished report by Kate Freeth, commissioned by the Small Press Underground Networking Community (SPUNC), the representative body for small and independent publishers in Australia, and released in November 2007. Freeth’s 14,000 word report constitutes the most detailed and comprehensive study of Australian small and independent publishing since the second volume of Michael Denholm’s Small Press Publishing in Australia (1991) and provides much primary material for policy makers, scholars, and people working in and around the publishing industry.

Nathan HollierEmail:

相似文献

18.

On rank-based effectiveness measures and optimization 总被引：1，自引：0，他引：1

Stephen Robertson Hugo Zaragoza 《Information Retrieval》2007,10(3):321-339

Many current retrieval models and scoring functions contain free parameters which need to be set—ideally, optimized. The process of optimization normally involves some training corpus of the usual document-query-relevance judgement type, and some choice of measure that is to be optimized. The paper proposes a way to think about the process of exploring the space of parameter values, and how moving around in this space might be expected to affect different measures. One result, concerning local optima, is demonstrated for a range of rank-based evaluation measures.

Hugo ZaragozaEmail:

相似文献

19.

Smoothing document language models with probabilistic term count propagation

Azadeh Shakery ChengXiang Zhai 《Information Retrieval》2008,11(2):139-164

Smoothing of document language models is critical in language modeling approaches to information retrieval. In this paper, we present a novel way of smoothing document language models based on propagating term counts probabilistically in a graph of documents. A key difference between our approach and previous approaches is that our smoothing algorithm can iteratively propagate counts and achieve smoothing with remotely related documents. Evaluation results on several TREC data sets show that the proposed method significantly outperforms the simple collection-based smoothing method. Compared with those other smoothing methods that also exploit local corpus structures, our method is especially effective in improving precision in top-ranked documents through “filling in” missing query terms in relevant documents, which is attractive since most users only pay attention to the top-ranked documents in search engine applications.

ChengXiang ZhaiEmail:

相似文献

20.

Multilingual phrase-based concordance generation in real-time

Kumiko Tanaka-Ishii Yuichiro Ishii 《Information Retrieval》2007,10(3):275-295

We present software that generates phrase-based concordances in real-time based on Internet searching. When a user enters a string of words for which he wants to find concordances, the system sends this string as a query to a search engine and obtains search results for the string. The concordances are extracted by performing statistical analysis on search results and then fed back to the user. Unlike existing tools, this concordance consultation tool is language-independent, so concordances can be obtained even in a language for which there are no well-established analytical methods. Our evaluation has revealed that concordances can be obtained more effectively than by only using a search engine directly.

Yuichiro IshiiEmail:

相似文献