期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A five-level static cache architecture for web search engines

Rifat Ozcan I. Sengor Altingovde B. Barla Cambazoglu Flavio P. Junqueira Özgür Ulusoy 《Information processing & management》2012

Caching is a crucial performance component of large-scale web search engines, as it greatly helps reducing average query response times and query processing workloads on backend search clusters. In this paper, we describe a multi-level static cache architecture that stores five different item types: query results, precomputed scores, posting lists, precomputed intersections of posting lists, and documents. Moreover, we propose a greedy heuristic to prioritize items for caching, based on gains computed by using items’ past access frequencies, estimated computational costs, and storage overheads. This heuristic takes into account the inter-dependency between individual items when making its caching decisions, i.e., after a particular item is cached, gains of all items that are affected by this decision are updated. Our simulations under realistic assumptions reveal that the proposed heuristic performs better than dividing the entire cache space among particular item types at fixed proportions. 相似文献

2.

Efficient immediate-access dynamic indexing

《Information processing & management》2023,60(3):103248

In a dynamic retrieval system, documents must be ingested as they arrive, and be immediately findable by queries. Our purpose in this paper is to describe an index structure and processing regime that accommodates that requirement for immediate access, seeking to make the ingestion process as streamlined as possible, while at the same time seeking to make the growing index as small as possible, and seeking to make term-based querying via the index as efficient as possible. We describe a new compression operation and a novel approach to extensible lists which together facilitate that triple goal. In particular, the structure we describe provides incremental document-level indexing using as little as two bytes per posting and only a small amount more for word-level indexing; provides fast document insertion; supports immediate and continuous queryability; provides support for fast conjunctive queries and similarity score-based ranked queries; and facilitates fast conversion of the dynamic index to a “normal” static compressed inverted index structure. Measurement of our new mechanism confirms that in-memory dynamic document-level indexes for collections into the gigabyte range can be constructed at a rate of two gigabytes/minute using a typical server architecture, that multi-term conjunctive Boolean queries can be resolved in just a few milliseconds each on average even while new documents are being concurrently ingested, and that the net memory space required for all of the required data structures amounts to an average of as little as two bytes per stored posting, less than half the space required by the best previous mechanism. 相似文献

3.

Hybrid compression of inverted lists for reordered document collections

Diego Arroyuelo Mauricio Oyarzún Senén González Victor Sepulveda 《Information processing & management》2018,54(6):1308-1324

相似文献

4.

Neural embedding-based indices for semantic search

Fatemeh Lashkari Ebrahim Bagheri Ali A. Ghorbani 《Information processing & management》2019,56(3):733-755

Traditional information retrieval techniques that primarily rely on keyword-based linking of the query and document spaces face challenges such as the vocabulary mismatch problem where relevant documents to a given query might not be retrieved simply due to the use of different terminology for describing the same concepts. As such, semantic search techniques aim to address such limitations of keyword-based retrieval models by incorporating semantic information from standard knowledge bases such as Freebase and DBpedia. The literature has already shown that while the sole consideration of semantic information might not lead to improved retrieval performance over keyword-based search, their consideration enables the retrieval of a set of relevant documents that cannot be retrieved by keyword-based methods. As such, building indices that store and provide access to semantic information during the retrieval process is important. While the process for building and querying keyword-based indices is quite well understood, the incorporation of semantic information within search indices is still an open challenge. Existing work have proposed to build one unified index encompassing both textual and semantic information or to build separate yet integrated indices for each information type but they face limitations such as increased query process time. In this paper, we propose to use neural embeddings-based representations of term, semantic entity, semantic type and documents within the same embedding space to facilitate the development of a unified search index that would consist of these four information types. We perform experiments on standard and widely used document collections including Clueweb09-B and Robust04 to evaluate our proposed indexing strategy from both effectiveness and efficiency perspectives. Based on our experiments, we find that when neural embeddings are used to build inverted indices; hence relaxing the requirement to explicitly observe the posting list key in the indexed document: (a) retrieval efficiency will increase compared to a standard inverted index, hence reduces the index size and query processing time, and (b) while retrieval efficiency, which is the main objective of an efficient indexing mechanism improves using our proposed method, retrieval effectiveness also retains competitive performance compared to the baseline in terms of retrieving a reasonable number of relevant documents from the indexed corpus. 相似文献

5.

Analyzing imbalance among homogeneous index servers in a web search system

C.S. Badue R. Baeza-Yates B. Ribeiro-Neto A. Ziviani N. Ziviani 《Information processing & management》2007

The performance of parallel query processing in a cluster of index servers is crucial for modern web search systems. In such a scenario, the response time basically depends on the execution time of the slowest server to generate a partial ranked answer. Previous approaches investigate performance issues in this context using simulation, analytical modeling, experimentation, or a combination of them. Nevertheless, these approaches simply assume balanced execution times among homogeneous servers (by uniformly distributing the document collection among them, for instance)—a scenario that we did not observe in our experimentation. On the contrary, we found that even with a balanced distribution of the document collection among index servers, correlations between the frequency of a term in the query log and the size of its corresponding inverted list lead to imbalances in query execution times at these same servers, because these correlations affect disk caching behavior. Further, the relative sizes of the main memory at each server (with regard to disk space usage) and the number of servers participating in the parallel query processing also affect imbalance of local query execution times. These are relevant findings that have not been reported before and that, we understand, are of interest to the research community. 相似文献

6.

Query expansion with terms selected using lexical cohesion analysis of documents

Olga Vechtomova Murat Karamuftuoglu 《Information processing & management》2007

We present new methods of query expansion using terms that form lexical cohesive links between the contexts of distinct query terms in documents (i.e., words surrounding the query terms in text). The link-forming terms (link-terms) and short snippets of text surrounding them are evaluated in both interactive and automatic query expansion (QE). We explore the effectiveness of snippets in providing context in interactive query expansion, compare query expansion from snippets vs. whole documents, and query expansion following snippet selection vs. full document relevance judgements. The evaluation, conducted on the HARD track data of TREC 2005, suggests that there are considerable advantages in using link-terms and their surrounding short text snippets in QE compared to terms selected from full-texts of documents. 相似文献

7.

Efficient online index maintenance for contiguous inverted lists

Nicholas Lester Justin Zobel Hugh Williams 《Information processing & management》2006

Search engines and other text retrieval systems use high-performance inverted indexes to provide efficient text query evaluation. Algorithms for fast query evaluation and index construction are well-known, but relatively little has been published concerning update. In this paper, we experimentally evaluate the two main alternative strategies for index maintenance in the presence of insertions, with the constraint that inverted lists remain contiguous on disk for fast query evaluation. The in-place and re-merge strategies are benchmarked against the baseline of a complete re-build. Our experiments with large volumes of web data show that re-merge is the fastest approach if large buffers are available, but that even a simple implementation of in-place update is suitable when the rate of insertion is low or memory buffer size is limited. We also show that with careful design of aspects of implementation such as free-space management, in-place update can be improved by around an order of magnitude over a naïve implementation. 相似文献

8.

On document relevance and lexical cohesion between query terms

Olga Vechtomova Murat Karamuftuoglu Stephen E. Robertson 《Information processing & management》2006

Lexical cohesion is a property of text, achieved through lexical-semantic relations between words in text. Most information retrieval systems make use of lexical relations in text only to a limited extent. In this paper we empirically investigate whether the degree of lexical cohesion between the contexts of query terms’ occurrences in a document is related to its relevance to the query. Lexical cohesion between distinct query terms in a document is estimated on the basis of the lexical-semantic relations (repetition, synonymy, hyponymy and sibling) that exist between there collocates – words that co-occur with them in the same windows of text. Experiments suggest significant differences between the lexical cohesion in relevant and non-relevant document sets exist. A document ranking method based on lexical cohesion shows some performance improvements. 相似文献

9.

Performance of query processing implementations in ranking-based text retrieval systems using inverted indices

B. Barla Cambazoglu Cevdet Aykanat 《Information processing & management》2006

Similarity calculations and document ranking form the computationally expensive parts of query processing in ranking-based text retrieval. In this work, for these calculations, 11 alternative implementation techniques are presented under four different categories, and their asymptotic time and space complexities are investigated. To our knowledge, six of these techniques are not discussed in any other publication before. Furthermore, analytical experiments are carried out on a 30 GB document collection to evaluate the practical performance of different implementations in terms of query processing time and space consumption. Advantages and disadvantages of each technique are illustrated under different querying scenarios, and several experiments that investigate the scalability of the implementations are presented. 相似文献

10.

Towards a unified approach to document similarity search using manifold-ranking of blocks

Xiaojun Wan Jianwu YangJianguo Xiao 《Information processing & management》2008

Document similarity search (i.e. query by example) aims to retrieve a ranked list of documents similar to a query document in a text corpus or on the Web. Most existing approaches to similarity search first compute the pairwise similarity score between each document and the query using a retrieval function or similarity measure (e.g. Cosine), and then rank the documents by the similarity scores. In this paper, we propose a novel retrieval approach based on manifold-ranking of document blocks (i.e. a block of coherent text about a subtopic) to re-rank a small set of documents initially retrieved by some existing retrieval function. The proposed approach can make full use of the intrinsic global manifold structure of the document blocks by propagating the ranking scores between the blocks on a weighted graph. First, the TextTiling algorithm and the VIPS algorithm are respectively employed to segment text documents and web pages into blocks. Then, each block is assigned with a ranking score by the manifold-ranking algorithm. Lastly, a document gets its final ranking score by fusing the scores of its blocks. Experimental results on the TDT data and the ODP data demonstrate that the proposed approach can significantly improve the retrieval performances over baseline approaches. Document block is validated to be a better unit than the whole document in the manifold-ranking process. 相似文献

11.

Document replication strategies for geographically distributed web search engines

Enver Kayaaslan B. Barla Cambazoglu Cevdet Aykanat 《Information processing & management》2013

Large-scale web search engines are composed of multiple data centers that are geographically distant to each other. Typically, a user query is processed in a data center that is geographically close to the origin of the query, over a replica of the entire web index. Compared to a centralized, single-center search engine, this architecture offers lower query response times as the network latencies between the users and data centers are reduced. However, it does not scale well with increasing index sizes and query traffic volumes because queries are evaluated on the entire web index, which has to be replicated and maintained in all data centers. As a remedy to this scalability problem, we propose a document replication framework in which documents are selectively replicated on data centers based on regional user interests. Within this framework, we propose three different document replication strategies, each optimizing a different objective: reducing the potential search quality loss, the average query response time, or the total query workload of the search system. For all three strategies, we consider two alternative types of capacity constraints on index sizes of data centers. Moreover, we investigate the performance impact of query forwarding and result caching. We evaluate our strategies via detailed simulations, using a large query log and a document collection obtained from the Yahoo! web search engine. 相似文献

12.

Compression of large inverted files with hyperbolic term distribution

E. J. Schuegraf 《Information processing & management》1976,12(6):377-384

The storage requirements for retrieval systems utilizing inverted files are calculated assuming different storage modes. Various methods for compression of these large files are analyzed. Binary vectors compressed by run-length coding as well as lists of document numbers were found to be suitable. The problem of minimal storage requirements for the inverted file is solved for different assumptions about index term distributions. A representation combining run-length coded binary vectors with list of document numbers was found to be the most economical. Parameter values for this minimum storage form are calculated and specified in tables as well as displayed graphically. 相似文献

13.

Effective top-k computation with term-proximity support

Mingjie Zhu Shuming Shi Mingjing Li Ji-Rong Wen 《Information processing & management》2009

Modern web search engines are expected to return the top-k results efficiently. Although many dynamic index pruning strategies have been proposed for efficient top-k computation, most of them are prone to ignoring some especially important factors in ranking functions, such as term-proximity (the distance relationship between query terms in a document). In our recent work [Zhu, M., Shi, S., Li, M., & Wen, J. (2007). Effective top-k computation in retrieving structured documents with term-proximity support. In Proceedings of 16th CIKM conference (pp. 771–780)], we demonstrated that, when term-proximity is incorporated into ranking functions, most existing index structures and top-k strategies become quite inefficient. To solve this problem, we built the inverted index based on web page structure and proposed the query processing strategies accordingly. The experimental results indicate that the proposed index structures and query processing strategies significantly improve the top-k efficiency. In this paper, we study the possibility of adopting additional techniques to further improve top-k computation efficiency. We propose a Proximity-Probe Heuristic to make our top-k algorithms more efficient. We also test the efficiency of our approaches on various settings (linear or non-linear ranking functions, exact or approximate top-k processing, etc.). 相似文献

14.

Incorporating compactness to generate term-association view snippets for ontology search

Weiyi Ge Gong Cheng Huiying Li Yuzhong Qu 《Information processing & management》2013

A query-relevant snippet for ontology search is useful for deciding if an ontology fits users’ needs. In this paper, we illustrate a good snippet in a keyword-based ontology search engine should be with term-association view and compact, and propose an approach to generate it. To obtain term-association view snippets, a model of term association graph for ontology is proposed, and a concept of maximal r-radius subgraph is introduced to decompose the term association graph into connected subgraphs, which preserve close relations between terms. To achieve compactness, in a query-relevant maximal r-radius subgraph, a connected subgraph thereof with a small graph weight is extracted as a sub-snippet. Finally, a greedy method is used to select sub-snippets to form a snippet in consideration of query relevance and compactness without violating the length constraint. An empirical study on our implementation shows that our approach is feasible. An evaluation on effectiveness shows that the term-association view snippet is favored by users, and the compactness helps reading and judgment. 相似文献

15.

Lexical cohesion and term proximity in document ranking

Olga Vechtomova Murat Karamuftuoglu 《Information processing & management》2008

We demonstrate effective new methods of document ranking based on lexical cohesive relationships between query terms. The proposed methods rely solely on the lexical relationships between original query terms, and do not involve query expansion or relevance feedback. Two types of lexical cohesive relationship information between query terms are used in document ranking: short-distance collocation relationship between query terms, and long-distance relationship, determined by the collocation of query terms with other words. The methods are evaluated on TREC corpora, and show improvements over baseline systems. 相似文献

16.

基于Native-XML数据库倒排索引算法研究

王宏宇《情报科学》2006,24(7):1062-1065

本文简单介绍了一种基于Native-XML数据库的全文检索技术，它是以XML文档内容作为索引对象，定义文档和文档属性等，采用BACI倒排算法对信息构建索引，实现基于Web的混合检索。为全文数据库的底层实现提供了技术参考。相似文献

17.

Document indexing: a concept-based approach to term weight estimation

《Information processing & management》2005,41(5):1065-1080

Traditional index weighting approaches for information retrieval from texts depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in extracting semantically exact indexes that represent the semantic content of a document. To address this issue, we developed a new indexing formalism that considers not only the terms in a document, but also the concepts. In this approach, concept clusters are defined and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document. Through an experiment on the TREC collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the few highest-ranked documents. Moreover, the index term dimension was 80% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment. 相似文献

18.

浅谈公文格式中的“常见病”

胡家琼许兴阳陈群《科教文汇》2012,(23):78-79

公文格式是在公文制发过程中逐渐形成的,它体现了国家行政公文的特点与权威性,本文从公文的眉首、主体、版记三个部分来分析公文格式中存在的＂常见病＂,从而使公文格式更能准确地表达发文机关的发文意图。相似文献

19.

Using query expansion in graph-based approach for query-focused multi-document summarization

Lin Zhao Lide Wu Xuanjing Huang 《Information processing & management》2009

This paper presents a novel query expansion method, which is combined in the graph-based algorithm for query-focused multi-document summarization, so as to resolve the problem of information limit in the original query. Our approach makes use of both the sentence-to-sentence relations and the sentence-to-word relations to select the query biased informative words from the document set and use them as query expansions to improve the sentence ranking result. Compared to previous query expansion approaches, our approach can capture more relevant information with less noise. We performed experiments on the data of document understanding conference (DUC) 2005 and DUC 2006, and the evaluation results show that the proposed query expansion method can significantly improve the system performance and make our system comparable to the state-of-the-art systems. 相似文献

20.

Adapting information retrieval to query contexts 总被引：1，自引：0，他引：1

Jing Bai Jian-Yun Nie 《Information processing & management》2008,44(6):1901

In current IR approaches documents are retrieved only according to the terms specified in the query. The same answers are returned for the same query whatever the user and the search goal are. In reality, many other contextual factors strongly influence document’s relevance and they should be taken into account in IR operations. This paper proposes a method, based on language modeling, to integrate several contextual factors so that document ranking will be adapted to the specific query contexts. We will consider three contextual factors in this paper: the topic domain of the query, the characteristics of the document collection, as well as context words within the query. Each contextual factor is used to generate a new query language model to specify some aspect of the information need. All these query models are then combined together to produce a more complete model for the underlying information need. Our experiments on TREC collections show that each contextual factor can positively influence the IR effectiveness and the combined model results in the highest effectiveness. This study shows that it is both beneficial and feasible to integrate more contextual factors in the current IR practice. 相似文献