An unsupervised heuristic-based approach for bibliographic metadata deduplication期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

An unsupervised heuristic-based approach for bibliographic metadata deduplication

Authors:	Eduardo N Borges Moisés G de Carvalho Renata Galante Marcos André Gonçalves Alberto HF Laender

Institution:	1. Institute of Informatics, Federal University of Rio Grande do Sul, Porto Alegre, Brazil;2. Computer Science Dept., Federal University of Minas Gerais, Belo Horizonte, Brazil

Abstract:	Digital libraries of scientific articles contain collections of digital objects that are usually described by bibliographic metadata records. These records can be acquired from different sources and be represented using several metadata standards. These metadata standards may be heterogeneous in both, content and structure. All of this implies that many records may be duplicated in the repository, thus affecting the quality of services, such as searching and browsing. In this article we present an approach that identifies duplicated bibliographic metadata records in an efficient and effective way. We propose similarity functions especially designed for the digital library domain and experimentally evaluate them. Our results show that the proposed functions improve the quality of metadata deduplication up to 188% compared to four different baselines. We also show that our approach achieves statistical equivalent results when compared to a state-of-the-art method for replica identification based on genetic programming, without the burden and cost of any training process.

Keywords:	Digital libraries Metadata Deduplication Similarity
本文献已被 ScienceDirect 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏