Document clustering of scientific texts using citation contexts |
| |
Authors: | Bader Aljaber Nicola Stokes James Bailey Jian Pei |
| |
Institution: | (1) Department of Computer Science and Software Engineering, The University of Melbourne, Melbourne, Australia;(2) School of Computer Science and Informatics, University College Dublin, Dublin, Ireland;(3) NICTA Victoria Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, Melbourne, Australia;(4) School of Computing Science, Simon Fraser University, Burnaby, Canada |
| |
Abstract: | Document clustering has many important applications in the area of data mining and information retrieval. Many existing document
clustering techniques use the “bag-of-words” model to represent the content of a document. However, this representation is
only effective for grouping related documents when these documents share a large proportion of lexically equivalent terms.
In other words, instances of synonymy between related documents are ignored, which can reduce the effectiveness of applications
using a standard full-text document representation. To address this problem, we present a new approach for clustering scientific
documents, based on the utilization of citation contexts. A citation context is essentially the text surrounding the reference
markers used to refer to other scientific works. We hypothesize that citation contexts will provide relevant synonymous and
related vocabulary which will help increase the effectiveness of the bag-of-words representation. In this paper, we investigate
the power of these citation-specific word features, and compare them with the original document’s textual representation in
a document clustering task on two collections of labeled scientific journal papers from two distinct domains: High Energy
Physics and Genomics. We also compare these text-based clustering techniques with a link-based clustering algorithm which
determines the similarity between documents based on the number of co-citations, that is in-links represented by citing documents
and out-links represented by cited documents. Our experimental results indicate that the use of citation contexts, when combined
with the vocabulary in the full-text of the document, is a promising alternative means of capturing critical topics covered
by journal articles. More specifically, this document representation strategy when used by the clustering algorithm investigated
in this paper, outperforms both the full-text clustering approach and the link-based clustering technique on both scientific
journal datasets. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|