首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
李伟 《高校图书馆工作》2005,25(4):29-31,87
文章介绍了一种基于Web Services的分布式图情数据库集成查询系统的设计及其实现。系统通过中间代理层向用户提供了统一的数据库视图,并采用了一种基于关键字的倒排索引方法以提高系统的查询性能。参考文献4。  相似文献   

2.
一种支持高效检索的实时更新倒排索引策略   总被引:5,自引:0,他引:5  
李栋  史晓东 《情报学报》2006,25(1):16-20
最近的研究使得搜索引擎中搜取的网页文档与万维网的变化越来越同步。为使用户通过搜索引擎获取网络上的最新信息,必须加快倒排索引的更新。本文介绍了使用界标和增加/删除网页文档两种典型的倒排索引更新策略,并分析了它们的优缺点,提出了一种支持高效检索的实时更新倒排索引策略。这种策略综合了减少更新操作、加快实时更新和缩短用户查询响应时间等方面的优点,较好地适应了当前网络内容变化的特点。最后通过实验对这种策略进行了验证。  相似文献   

3.
Compressing Inverted Files   总被引:2,自引:0,他引:2  
Research into inverted file compression has focused on compression ratio—how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search.The premise smaller is better may not be true. To truly build faster indexes it is often necessary to forfeit compression. For inverted lists consisting of only 128 occurrences compression may only add overhead. Perhaps the inverted list could be stored in 128 bytes in place of 128 words, but it must still be stored on disk. If the minimum disk sector read size is 512 bytes and the word size is 4 bytes, then both the compressed and raw postings would require one disk seek and one disk sector read. A less efficient compression technique may increase the file size, but decrease load/decompress time, thereby increasing throughput.Examined here are five compression techniques, Golomb, Elias gamma, Elias delta, Variable Byte Encoding and Binary Interpolative Coding. The effect on file size, file seek time, and file read time are all measured as is decompression time. A quantitative measure of throughput is developed and the performance of each method is determined.  相似文献   

4.
Binary Interpolative Coding for Effective Index Compression   总被引:5,自引:1,他引:4  
Information retrieval systems contain large volumes of text, and currently have typical sizes into the gigabyte range. Inverted indexes are one important method for providing search facilities into these collections, but unless compressed require a great deal of space. In this paper we introduce a new method for compressing inverted indexes that yields excellent compression, fast decoding, and exploits clustering—the tendency for words to appear relatively frequently in some parts of the collection and infrequently in others. We also describe two other quite separate applications for the same compression method: representing the MTF list positions generated by the Burrows-Wheeler Block Sorting transformation; and transmitting the codebook for semi-static block-based minimum-redundancy coding.  相似文献   

5.
为提高多关键词查询的效率并减少多关键词查询的开销,提出一种基于语义聚类的多关键词查询算法——MKQBSC。该算法使得语义相似的节点聚为一类,节点加入、退出或节点的语义改变时,聚类将相应改变。查询请求在相邻的语义聚类之间转发,直至到达语义相似的聚类。仿真实验结果表明:与传统的基于对倒排表求交集的多关键词查询算法相比,MKQBSC算法所需的路由跳数和所产生的消息数更少。  相似文献   

6.
针对信息检索角度的XML的结构化检索问题,利用基于倒排文件的方法,使用NEXI作为检索语言,在基于XML的数字图书馆检索实验系统WHU-XML上对其进行实现,并具体分析查询语言的解析方法以及所采用的结构化检索算法。  相似文献   

7.
Index maintenance strategies employed by dynamic text retrieval systems based on inverted files can be divided into two categories: merge-based and in-place update strategies. Within each category, individual update policies can be distinguished based on whether they store their on-disk posting lists in a contiguous or in a discontiguous fashion. Contiguous inverted lists, in general, lead to higher query performance, by minimizing the disk seek overhead at query time, while discontiguous inverted lists lead to higher update performance, requiring less effort during index maintenance operations. In this paper, we focus on retrieval systems with high query load, where the on-disk posting lists have to be stored in a contiguous fashion at all times. We discuss a combination of re-merge and in-place index update, called Hybrid Immediate Merge. The method performs strictly better than the re-merge baseline policy used in our experiments, as it leads to the same query performance, but substantially better update performance. The actual time savings achievable depend on the size of the text collection being indexed; a larger collection results in greater savings. In our experiments, variations of Hybrid Immediate Merge were able to reduce the total index update overhead by up to 73% compared to the re-merge baseline.
Stefan BüttcherEmail:
  相似文献   

8.
倒排文档是信息检索系统中最普遍使用的索引机制,而索引文件的压缩能大大提高检索速度和节约磁盘空间。倒排文件压缩的传统做法是文档(标识号)间距法(d-gaps)。然而,剧烈变化的间距值并不能被著名的前缀自由代码有效编码压缩。为了使间距值得到有效的压缩,本文设计了一个文档标识号重置法。模拟试验表明能更有效压缩d-gaps倒排文档。  相似文献   

9.
This paper explores the performance of top k document retrieval with score-at-a-time query evaluation on impact-ordered indexes in main memory. To better understand execution efficiency in the context of modern processor architectures, we examine the role of index compression on query evaluation latency. Experiments include compressing postings with variable byte encoding, Simple-8b, variants of the QMX compression scheme, as well as a condition that is less often considered—no compression. Across four web test collections, we find that the highest query evaluation speed is achieved by simply leaving the postings lists uncompressed, although the performance advantage over a state-of-the-art compression scheme is relatively small and the index is considerably larger. We explain this finding in terms of the design of modern processor architectures: Index segments with high impact scores are usually short and inherently benefit from cache locality. Index segments with lower impact scores may be quite long, but modern architectures have sufficient memory bandwidth (coupled with prefetching) to “keep up” with the processor. Our results highlight the importance of “architecture affinity” when designing high-performance search engines.  相似文献   

10.
Adding Compression to Block Addressing Inverted Indexes   总被引:8,自引:1,他引:7  
Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, low-overhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of its original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning.In this work we combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text and uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches.  相似文献   

11.
Word-based byte-oriented compression has succeeded on large natural language text databases, by providing competitive compression ratios, fast random access, and direct sequential searching. We show that by just rearranging the target symbols of the compressed text into a tree-shaped structure, and using negligible additional space, we obtain a new implicitly indexed representation of the compressed text, where search times are drastically improved. The occurrences of a word can be listed directly, without any text scanning, and in general any inverted-index-like capability, such as efficient phrase searches, can be emulated without storing any inverted list information. We experimentally show that our proposal performs not only much more efficiently than sequential searches over compressed text, but also than explicit inverted indexes and other types of indexes, when using little extra space. Our representation is especially successful when searching for single words and short phrases.  相似文献   

12.
在分词技术、索引技术、结构化查询语言技术的基础上,提出了一个基于XML文档数据库的信息检索系统,这一系统模型主要由分词模块、索引模块及查询模块组成。  相似文献   

13.
论述基于内容的图像检索(CBIR)的3类标准:图像压缩标准、查询语言和信息内容描述标准,分析其存在的问题并提出解决方法,最后重点剖析多媒体信息描述国际标准MPEG-7的内容及其在CBIR中的应用。  相似文献   

14.
A pipelined architecture for distributed text query evaluation   总被引:1,自引:0,他引:1  
Two principal query-evaluation methodologies have been described for cluster-based implementation of distributed information retrieval systems: document partitioning and term partitioning. In a document-partitioned system, each of the processors hosts a subset of the documents in the collection, and executes every query against its local sub-collection. In a term-partitioned system, each of the processors hosts a subset of the inverted lists that make up the index of the collection, and serves them to a central machine as they are required for query evaluation. In this paper we introduce a pipelined query-evaluation methodology, based on a term-partitioned index, in which partially evaluated queries are passed amongst the set of processors that host the query terms. This arrangement retains the disk read benefits of term partitioning, but more effectively shares the computational load. We compare the three methodologies experimentally, and show that term distribution is inefficient and scales poorly. The new pipelined approach offers efficient memory utilization and efficient use of disk accesses, but suffers from problems with load balancing between nodes. Until these problems are resolved, document partitioning remains the preferred method. Alistair Moffat was supported by the Australian Research Council, the ARC Special Research Center for Perceptive and Intelligent Machines in Complex Environments, and the NICTA Victoria Laboratory. William Webber and Justin Zobel were supported by the Australian Research Council. Ricardo Baeza-Yates was supported by Grant P01-029-F from Millennium Initiative of Mideplan, Chile; and by the University of Melbourne as a visiting scholar at the time this project was undertaken.  相似文献   

15.
在信息技术中,将每条记录的中文可检词或标题、内容经过独特处理,做成“全息压缩码”做为唯一标识。该技术运用在计算机文献检索系统中可大大提高检索速度,优化空间配置,广泛用于倒排文档、查重、记录对比等多个领域。  相似文献   

16.
We consider the following autocompletion search scenario: imagine a user of a search engine typing a query; then with every keystroke display those completions of the last query word that would lead to the best hits, and also display the best such hits. The following problem is at the core of this feature: for a fixed document collection, given a set D of documents, and an alphabetical range W of words, compute the set of all word-in-document pairs (w, d) from the collection such that w W and d ∈ D. We present a new data structure with the help of which such autocompletion queries can be processed, on the average, in time linear in the input plus output size, independent of the size of the underlying document collection. At the same time, our data structure uses no more space than an inverted index. Actual query processing times on a large test collection correlate almost perfectly with our theoretical bound.
Ingmar WeberEmail:
  相似文献   

17.
[目的/意义]了解、分析和识别用户学术搜索时所表达的信息需求是优化查询结果、提高学术搜索引擎用户体验的首要步骤,而用户进行学术搜索时通过查询表达式所表达的用户表意信息需求及潜在信息需求可称之为学术查询意图。本文总结学术查询意图类目体系有助于学术查询意图识别和检索结果页面的呈现。[方法/过程]在A.Broder的查询意图类目体系的基础上,结合百度学术搜索查询日志中查询表达式实例,构建学术查询意图的类目体系。以此为基础,总结不同类别的学术查询意图,并分析不同类别学术查询意图下查询表达式的特点。[结果/结论]学术查询意图主要分为学术文献类、学术实体类、学术探索类、知识问答类和非学术文献类五大类;得出不同类别学术查询意图在学术搜索中的大致比例;给出每类学术查询意图的查询表达式特征、查询情景和查询结果页。  相似文献   

18.
In the patent domain significant efforts are invested to assist researchers in formulating better queries, preferably via automated query expansion. Currently, automatic query expansion in patent search is mostly limited to computing co-occurring terms for the searchable features of the invention. Additional query terms are extracted automatically from patent documents based on entropy measures. Learning synonyms in the patent domain for automatic query expansion has been a difficult task. No dedicated sources providing synonyms for the patent domain, such as patent domain specific lexica or thesauri, are available. In this paper we focus on the highly professional search setting of patent examiners. In particular, we use query logs to learn synonyms for the patent domain. For automatic query expansion, we create term networks based on the query logs specifically for several USPTO patent classes. Experiments show good performance in automatic query expansion using these automatically generated term networks. Specifically, with a larger number of query logs for a specific patent US class available the performance of the learned term networks increases.  相似文献   

19.
信息搜寻中用户查询重构研究综述   总被引:1,自引:0,他引:1  
李纲  胡蓉 《图书情报工作》2014,58(11):123-129
基于信息搜寻中人机交互行为,从查询重构类型与模式、查询重构绩效、查询重构影响因素及查询式扩展技术4个方面综述国内外关于用户查询重构的研究。得出结论:基于查询重构模式研究,可获悉用户偏向使用的查询重构序列;结合查询重构影响因素和重构序列,可向不同群体用户推荐高概率的查询词,但该研究也存在一定的局限;尽管从检索系统角度,查询重构得到不少学者的广泛关注,但关于用户如何重构查询的研究在中文文献中尚未见到。  相似文献   

20.
The query language OVAL which is intended for the integration with the database programming language based on C++ is proposed in this paper. The work addresses the impedance mismatch problem [1] between the syntax and the semantics of the programming and query language. The query language OVAL is based on the functional query language FQL [3] extending it for the manipulation of complex objects. The salient features of the OVAL query language are: (i) functional nature of the query language, which makes the language suitable for the integration with the procedural programming languages and provides modular style of query definition, (ii) the use of schema information for expressing queries and (iii) recursive evaluation of the algebraic operations on set structured complex objects.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号