首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Inverted file compression through document identifier reassignment
Institution:1. Department of Laboratory Medicine, The Hospital Of Hangzhou Dianzi University, Hangzhou, Zhejiang, China;2. Department of Pathology, The First People''s Hospital of Yancheng City, Yancheng, Jiangsu, China;3. Department of Pathology, Yancheng Maternity and Child Health Care Hospital, Affiliated of Yangzhou University Medical College, Yancheng, China;4. Department of General Surgery, The Hospital Of Hangzhou Dianzi University, Hangzhou, Zhejiang, China;5. Department of Pathology, The People''s Hospital of Tinghu District, Yancheng, China;6. Department of Laboratory Medicine, Yancheng TCM Hospital Affiliated to Nanjing University of Chinese Medicine, Yancheng, Jiangsu, China;1. Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA;2. Center of Toxins, Immune-response and Cell Signaling (CeTICS), LECC, Instituto Butantan, São Paulo, Brazil;3. Center for Bioinformatics and Genomics Systems Engineering, TEES, College Station, TX, USA;4. Institute of Mathematics and Statistics, University of São Paulo, São Paulo, Brazil
Abstract:The inverted file is the most popular indexing mechanism for document search in an information retrieval system. Compressing an inverted file can greatly improve document search rate. Traditionally, the d-gap technique is used in the inverted file compression by replacing document identifiers with usually much smaller gap values. However, fluctuating gap values cannot be efficiently compressed by some well-known prefix-free codes. To smoothen and reduce the gap values, we propose a document-identifier reassignment algorithm. This reassignment is based on a similarity factor between documents. We generate a reassignment order for all documents according to the similarity to reassign closer identifiers to the documents having closer relationships. Simulation results show that the average gap values of sample inverted files can be reduced by 30%, and the compression rate of d-gapped inverted file with prefix-free codes can be improved by 15%.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号