期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Interpolative coding of integer sequences supporting log-time random access

J. Teuhola 《Information processing & management》2011

Sequences of integers are common data types, occurring either as primary data or ancillary structures. The sizes of sequences can be large, making compression an interesting option. Effective compression presupposes variable-length coding, which destroys the regular alignment of values. Yet it would often be desirable to access only a small subset of the entries, either by position (ordinal number) or by content (element value), without having to decode most of the sequence from the start. Here such a random access technique for compressed integers is described, with the special feature that no auxiliary index is needed. The solution applies a method called interpolative coding, which is one of the most efficient non-statistical codes for integers. Indexing is avoided by address calculation guaranteeing sufficient space for codes even in the worst case. The additional redundancy, compared to regular interpolative coding, is only about 1 bit per source integer for uniform distribution. The time complexity of random access is logarithmic with respect to the source size for both position-based and content-based retrieval. According to experiments, random access is faster than full decoding when the number of accessed integers is not more than approximately 0.75 · n/log₂n for sequence length n. The tests also confirm that the method is quite competitive with other approaches to random access coding, suggested in the literature. 相似文献

2.

Inverted file compression through document identifier reassignment

《Information processing & management》2003,39(1):117-131

The inverted file is the most popular indexing mechanism for document search in an information retrieval system. Compressing an inverted file can greatly improve document search rate. Traditionally, the d-gap technique is used in the inverted file compression by replacing document identifiers with usually much smaller gap values. However, fluctuating gap values cannot be efficiently compressed by some well-known prefix-free codes. To smoothen and reduce the gap values, we propose a document-identifier reassignment algorithm. This reassignment is based on a similarity factor between documents. We generate a reassignment order for all documents according to the similarity to reassign closer identifiers to the documents having closer relationships. Simulation results show that the average gap values of sample inverted files can be reduced by 30%, and the compression rate of d-gapped inverted file with prefix-free codes can be improved by 15%. 相似文献

3.

"Julia曲线"与分形图像压缩编码

朱志良赵德平朱伟勇《中国科学院研究生院学报》2002,19(2):177-181

将 Julia曲线"按正方形形状以多种方式进行量化,并将量化的 Julia曲线 "用于分形图像压缩编码,改变了分形图像压缩编码以变化的压缩编码字典进行编码的缺点。此外,还建立了一个小型的常用字典,用以加速分形图像的压缩编码。实验结果表明, Julia曲线 "能很好地拼贴所要编码的图像,并具有分形图像的解码优点。相似文献

4.

Compression of large inverted files with hyperbolic term distribution

E. J. Schuegraf 《Information processing & management》1976,12(6):377-384

The storage requirements for retrieval systems utilizing inverted files are calculated assuming different storage modes. Various methods for compression of these large files are analyzed. Binary vectors compressed by run-length coding as well as lists of document numbers were found to be suitable. The problem of minimal storage requirements for the inverted file is solved for different assumptions about index term distributions. A representation combining run-length coded binary vectors with list of document numbers was found to be the most economical. Parameter values for this minimum storage form are calculated and specified in tables as well as displayed graphically. 相似文献

5.

Comparing inverted files and signature files for searching a large lexicon

《Information processing & management》2005,41(3):613-633

Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium. 相似文献

6.

Hybrid compression of inverted lists for reordered document collections

Diego Arroyuelo Mauricio Oyarzún Senén González Victor Sepulveda 《Information processing & management》2018,54(6):1308-1324

相似文献

7.

The effectiveness of query-specific hierarchic clustering in information retrieval

《Information processing & management》2002,38(4):559-582

Hierarchic document clustering has been widely applied to information retrieval (IR) on the grounds of its potential improved effectiveness over inverted file search (IFS). However, previous research has been inconclusive as to whether clustering does bring improvements. In this paper we take the view that if hierarchic clustering is applied to search results (query-specific clustering), then it has the potential to increase the retrieval effectiveness compared both to that of static clustering and of conventional IFS. We conducted a number of experiments using five document collections and four hierarchic clustering methods. Our results show that the effectiveness of query-specific clustering is indeed higher, and suggest that there is scope for its application to IR. 相似文献

8.

A statistics-based approach to incrementally update inverted files

《Information processing & management》2005,41(2):275-288

Many information retrieval systems use the inverted file as indexing structure. The inverted file, however, requires inefficient reorganization when new documents are to be added to an existing collection. Most studies suggest dealing with this problem by sparing free space in an inverted file for incremental updates. In this paper, we propose a run-time statistics-based approach to allocate the spare space. This approach estimates the space requirements in an inverted file using only a little most recent statistical data on space usage and document update request rate. For best indexing speed and space efficiency, the amount of the spare space to be allocated is determined by adaptively balancing the trade-offs between reorganization reduction and space utilization. Experiment results show that the proposed space-sparing approach significantly avoids reorganization in updating an inverted file, and in the meantime, unused free space can be well controlled such that the file access speed is not affected. 相似文献

9.

Inverted file organization in the information retrieval system based on thesaurus with weights

Zygmunt Mazur 《Information processing & management》1979,15(5):227-234

相似文献

10.

面向空间通信不平等差错保护的联合信源信道编解码方法

刘晓敏涂国防《中国科学院研究生院学报》2009,26(5):663-669

提出了一种面向空间通信不平等差错保护的联合信源信道编解码方法. 该方法采用一种可逆编解码方法对直流系数进行重点保护,即对直流系数编码输出码流进行二进制游程编码,并采用可逆变长码对游程长度进行编码. 对重要性较低的交流(AC)系数采用可变长编解码. 仿真实验结果和分析表明,该方法有效地解决了错误扩散,提高了直流系数的抗差错性能,改善了图像传输质量. 相似文献

11.

Efficient immediate-access dynamic indexing

《Information processing & management》2023,60(3):103248

In a dynamic retrieval system, documents must be ingested as they arrive, and be immediately findable by queries. Our purpose in this paper is to describe an index structure and processing regime that accommodates that requirement for immediate access, seeking to make the ingestion process as streamlined as possible, while at the same time seeking to make the growing index as small as possible, and seeking to make term-based querying via the index as efficient as possible. We describe a new compression operation and a novel approach to extensible lists which together facilitate that triple goal. In particular, the structure we describe provides incremental document-level indexing using as little as two bytes per posting and only a small amount more for word-level indexing; provides fast document insertion; supports immediate and continuous queryability; provides support for fast conjunctive queries and similarity score-based ranked queries; and facilitates fast conversion of the dynamic index to a “normal” static compressed inverted index structure. Measurement of our new mechanism confirms that in-memory dynamic document-level indexes for collections into the gigabyte range can be constructed at a rate of two gigabytes/minute using a typical server architecture, that multi-term conjunctive Boolean queries can be resolved in just a few milliseconds each on average even while new documents are being concurrently ingested, and that the net memory space required for all of the required data structures amounts to an average of as little as two bytes per stored posting, less than half the space required by the best previous mechanism. 相似文献

12.

A fast procedure for the calculation of similarity coefficients in automatic classification

Peter Willett 《Information processing & management》1981,17(2):53-60

A fast algorithm is described for comparing the lists of terms representing documents in automatic classification experiments. The speed of the procedure arises from the fact that all of the non-zero-valued coefficients for a given document are identified together, using an inverted file to the terms in the document collection. The complexity and running time of the algorithm are compared with previously described procedures. 相似文献

13.

数字视频精细可扩展性编码的研究

瞿晓陈伟《科技通报》2008,24(6)

随着Internet和无线通信的发展,大量视频数据需要通过网络传输,使得视频压缩编码的目标从传统的面向存储转变为面向传输。然而面对网络带宽变化和传输中的包错误等两个主要问题,压缩编码需要有自适应能力。提供完全可伸缩的增强层码流,它可以在任意地点截断,具有很强的网络带宽适应能力。本文主要对精细可伸缩编码(fine granular scalable coding,FGS)进行了分析、对比、研究,实验表明FGS具有编码效率较高,图像质量好,自适应能力强的优点。相似文献

14.

Distributed search based on self-indexed compressed text

Diego Arroyuelo Veronica Gil-Costa Senén González Mauricio Marin Mauricio Oyarzún 《Information processing & management》2012

Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed-memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text and posting lists caching. We show that a text self-index (i.e., an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space. 相似文献

15.

一种新型快速图像分形映射压缩算法

赵德平李金朱伟勇苏畅陈绯《中国科学院研究生院学报》1999,(2)

针对分形编码算法编码时间太长、精度控制需要细分等缺点提出对编码图像进行分级逼近的新的分形编码算法．对这一思想的可行性在理论上进行了有益的探索,给出了该算法成立的理论基础,并得出了任给一图像,都可以找出一组压缩变换,使得从任意图像出发,经该组变换压缩迭代后重构原始图像的新的构造性证明．给出一个新的具体实现分形编码的算法．实验表明,在提高压缩比和图像恢复质量的同时,运算时间也大大缩短相似文献

16.

A prefix trie index for inverted files

Michael J. Nelson 《Information processing & management》1997,33(6):739-744

A prefix trie index (originally called trie hashing) is applied to the problem of providing fast search times, fast load times and fast update properties in a bibliographic or full text retrieval system. For all but the largest dictionaries a single key search in the dictionary under trie hashing takes exactly one disk read. Front compression of search keys is used to enhance performance. Partial combining of the postings into the dictionary is analyzed as a method to give both faster retrieval and improved update properties for the trie hashing inverted file. Statistics are given for a test database consisting of an online catalog at the Graduate School of Library and Information Science Library of the University of Western Ontario. The effect of changing various parameters of prefix tries are tested in this application. 相似文献

17.

具有运动补偿的多值量化块截断编码

王业奎涂国防《中国科学院研究生院学报》1999,(1)

报道了一种新的活动图象编码算法,称为具有运动补偿的多值量化块截断编码．模拟结果表明：这种新算法比现有的其它同类算法有更好的性能,压缩比为７５,信噪比为３５ｄＢ,在现有个人计算机上可以用软件实时实现．相似文献

18.

Neural embedding-based indices for semantic search

Fatemeh Lashkari Ebrahim Bagheri Ali A. Ghorbani 《Information processing & management》2019,56(3):733-755

Traditional information retrieval techniques that primarily rely on keyword-based linking of the query and document spaces face challenges such as the vocabulary mismatch problem where relevant documents to a given query might not be retrieved simply due to the use of different terminology for describing the same concepts. As such, semantic search techniques aim to address such limitations of keyword-based retrieval models by incorporating semantic information from standard knowledge bases such as Freebase and DBpedia. The literature has already shown that while the sole consideration of semantic information might not lead to improved retrieval performance over keyword-based search, their consideration enables the retrieval of a set of relevant documents that cannot be retrieved by keyword-based methods. As such, building indices that store and provide access to semantic information during the retrieval process is important. While the process for building and querying keyword-based indices is quite well understood, the incorporation of semantic information within search indices is still an open challenge. Existing work have proposed to build one unified index encompassing both textual and semantic information or to build separate yet integrated indices for each information type but they face limitations such as increased query process time. In this paper, we propose to use neural embeddings-based representations of term, semantic entity, semantic type and documents within the same embedding space to facilitate the development of a unified search index that would consist of these four information types. We perform experiments on standard and widely used document collections including Clueweb09-B and Robust04 to evaluate our proposed indexing strategy from both effectiveness and efficiency perspectives. Based on our experiments, we find that when neural embeddings are used to build inverted indices; hence relaxing the requirement to explicitly observe the posting list key in the indexed document: (a) retrieval efficiency will increase compared to a standard inverted index, hence reduces the index size and query processing time, and (b) while retrieval efficiency, which is the main objective of an efficient indexing mechanism improves using our proposed method, retrieval effectiveness also retains competitive performance compared to the baseline in terms of retrieving a reasonable number of relevant documents from the indexed corpus. 相似文献

19.

Bidirectional delta files

Dana Shapira Michael Kats 《Information processing & management》2012

A bidirectional delta file is a novel concept, introduced in this paper, for a two way delta file. Previous work focuses on single way differential compression called forwards and backwards delta files. Here we suggest to efficiently combine them into a single file so that the combined file is smaller than the combination of the two individual ones. Given the bidirectional delta file of two files S and T and the original file S, one can decode it in order to produce T. The same bidirectional delta file is used together with the file T in order to reconstruct S. This paper presents two main strategies for producing an efficient bidirectional delta file in terms of the memory storage it requires; a quadratic time, optimal, dynamic programming algorithm, and a linear time, greedy algorithm. Although the dynamic programming algorithm often produces better results than the greedy algorithm, it is impractical for large files, and it is only used for theoretical comparisons. Experiments between the implemented algorithms and the traditional way of using both forwards and backwards delta files are presented, comparing their processing time and their compression performance. These experiments show memory storage savings of about 25% using this bidirectional delta approach as compared to the compressed delta file constructed using the traditional way, while preserving approximately the same processing time for decoding. 相似文献

20.

基于改进SPIHT算法的SAR图像压缩

王璐张志敏《中国科学院研究生院学报》2012,29(2):220-226

小波变换和多级树集合分裂算法(SPIHT)在合成孔径雷达(SAR)图像压缩方面取得了良好的效果,但SPIHT编码方法的复杂性制约了压缩速率的提高.针对SPIHT编码速度慢和占用内存大的问题,提出一种改进的无链表SPIHT算法,以提高编码运算速度,减少资源占用量,使其适于硬件实现.实验结果表明,该方法能达到与原算法相同的压缩效果,而运算速度大大提高,适于实时实现. 相似文献