首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Pseudo-relevance feedback (PRF) is a classical technique to improve search engine retrieval effectiveness, by closing the vocabulary gap between users’ query formulations and the relevant documents. While PRF is typically applied on the same target corpus as the final retrieval, in the past, external expansion techniques have sometimes been applied to obtain a high-quality pseudo-relevant feedback set using the external corpus. However, such external expansion approaches have only been studied for sparse (BoW) retrieval methods, and its effectiveness for recent dense retrieval methods remains under-investigated. Indeed, dense retrieval approaches such as ANCE and ColBERT, which conduct similarity search based on encoded contextualised query and document embeddings, are of increasing importance. Moreover, pseudo-relevance feedback mechanisms have been proposed to further enhance dense retrieval effectiveness. In particular, in this work, we examine the application of dense external expansion to improve zero-shot retrieval effectiveness, i.e. evaluation on corpora without further training. Zero-shot retrieval experiments with six datasets, including two TREC datasets and four BEIR datasets, when applying the MSMARCO passage collection as external corpus, indicate that obtaining external feedback documents using ColBERT can significantly improve NDCG@10 for the sparse retrieval (by upto 28%) and the dense retrieval (by upto 12%). In addition, using ANCE on the external corpus brings upto 30% NDCG@10 improvements for the sparse retrieval and upto 29% for the dense retrieval.  相似文献   

2.
This paper presents a robust and comprehensive graph-based rank aggregation approach, used to combine results of isolated ranker models in retrieval tasks. The method follows an unsupervised scheme, which is independent of how the isolated ranks are formulated. Our approach is able to combine arbitrary models, defined in terms of different ranking criteria, such as those based on textual, image or hybrid content representations.We reformulate the ad-hoc retrieval problem as a document retrieval based on fusion graphs, which we propose as a new unified representation model capable of merging multiple ranks and expressing inter-relationships of retrieval results automatically. By doing so, we claim that the retrieval system can benefit from learning the manifold structure of datasets, thus leading to more effective results. Another contribution is that our graph-based aggregation formulation, unlike existing approaches, allows for encapsulating contextual information encoded from multiple ranks, which can be directly used for ranking, without further computations and post-processing steps over the graphs. Based on the graphs, a novel similarity retrieval score is formulated using an efficient computation of minimum common subgraphs. Finally, another benefit over existing approaches is the absence of hyperparameters.A comprehensive experimental evaluation was conducted considering diverse well-known public datasets, composed of textual, image, and multimodal documents. Performed experiments demonstrate that our method reaches top performance, yielding better effectiveness scores than state-of-the-art baseline methods and promoting large gains over the rankers being fused, thus demonstrating the successful capability of the proposal in representing queries based on a unified graph-based model of rank fusions.  相似文献   

3.
The estimation of query model is an important task in language modeling (LM) approaches to information retrieval (IR). The ideal estimation is expected to be not only effective in terms of high mean retrieval performance over all queries, but also stable in terms of low variance of retrieval performance across different queries. In practice, however, improving effectiveness can sacrifice stability, and vice versa. In this paper, we propose to study this tradeoff from a new perspective, i.e., the bias–variance tradeoff, which is a fundamental theory in statistics. We formulate the notion of bias–variance regarding retrieval performance and estimation quality of query models. We then investigate several estimated query models, by analyzing when and why the bias–variance tradeoff will occur, and how the bias and variance can be reduced simultaneously. A series of experiments on four TREC collections have been conducted to systematically evaluate our bias–variance analysis. Our approach and results will potentially form an analysis framework and a novel evaluation strategy for query language modeling.  相似文献   

4.
Croplands are the single largest anthropogenic source of nitrous oxide (N2O) globally, yet their estimates remain difficult to verify when using Tier 1 and 3 methods of the Intergovernmental Panel on Climate Change (IPCC). Here, we re-evaluate global cropland-N2O emissions in 1961–2014, using N-rate-dependent emission factors (EFs) upscaled from 1206 field observations in 180 global distributed sites and high-resolution N inputs disaggregated from sub-national surveys covering 15593 administrative units. Our results confirm IPCC Tier 1 default EFs for upland crops in 1990–2014, but give a ∼15% lower EF in 1961–1989 and a ∼67% larger EF for paddy rice over the full period. Associated emissions (0.82 ± 0.34 Tg N yr–1) are probably one-quarter lower than IPCC Tier 1 global inventories but close to Tier 3 estimates. The use of survey-based gridded N-input data contributes 58% of this emission reduction, the rest being explained by the use of observation-based non-linear EFs. We conclude that upscaling N2O emissions from site-level observations to global croplands provides a new benchmark for constraining IPCC Tier 1 and 3 methods. The detailed spatial distribution of emission data is expected to inform advancement towards more realistic and effective mitigation pathways.  相似文献   

5.
This paper focuses on personalized outfit generation, aiming to generate compatible fashion outfits catering to given users. Personalized recommendation by generating outfits of compatible items is an emerging task in the recommendation community with great commercial value but less explored. The task requires to explore both user-outfit personalization and outfit compatibility, any of which is challenging due to the huge learning space resulted from large number of items, users, and possible outfit options. To specify the user preference on outfits and regulate the outfit compatibility modeling, we propose to incorporate coordination knowledge in fashion. Inspired by the fact that users might have coordination preference in terms of category combination, we first define category combinations as templates and propose to model user-template relationship to capture users’ coordination preferences. Moreover, since a small number of templates can cover the majority of fashion outfits, leveraging templates is also promising to guide the outfit generation process. In this paper, we propose Template-guided Outfit Generation (TOG) framework, which unifies the learning of user-template interaction, user–item interaction and outfit compatibility modeling. The personal preference modeling and outfit generation are organically blended together in our problem formulation, and therefore can be achieved simultaneously. Furthermore, we propose new evaluation protocols to evaluate different models from both the personalization and compatibility perspectives. Extensive experiments on two public datasets have demonstrated that the proposed TOG can achieve preferable performance in both evaluation perspectives, namely outperforming the most competitive baseline BGN by 7.8% and 10.3% in terms of personalization precision on iFashion and Polyvore datasets, respectively, and improving the compatibility of the generated outfits by over 2%.  相似文献   

6.
GPS-enabled devices and social media popularity have created an unprecedented opportunity for researchers to collect, explore, and analyze text data with fine-grained spatial and temporal metadata. In this sense, text, time and space are different domains with their own representation scales and methods. This poses a challenge on how to detect relevant patterns that may only arise from the combination of text with spatio-temporal elements. In particular, spatio-temporal textual data representation has relied on feature embedding techniques. This can limit a model’s expressiveness for representing certain patterns extracted from the sequence structure of textual data. To deal with the aforementioned problems, we propose an Acceptor recurrent neural network model that jointly models spatio-temporal textual data. Our goal is to focus on representing the mutual influence and relationships that can exist between written language and the time-and-place where it was produced. We represent space, time, and text as tuples, and use pairs of elements to predict a third one. This results in three predictive tasks that are trained simultaneously. We conduct experiments on two social media datasets and on a crime dataset; we use Mean Reciprocal Rank as evaluation metric. Our experiments show that our model outperforms state-of-the-art methods ranging from a 5.5% to a 24.7% improvement for location and time prediction.  相似文献   

7.
The inverted file is the most popular indexing mechanism for document search in an information retrieval system. Compressing an inverted file can greatly improve document search rate. Traditionally, the d-gap technique is used in the inverted file compression by replacing document identifiers with usually much smaller gap values. However, fluctuating gap values cannot be efficiently compressed by some well-known prefix-free codes. To smoothen and reduce the gap values, we propose a document-identifier reassignment algorithm. This reassignment is based on a similarity factor between documents. We generate a reassignment order for all documents according to the similarity to reassign closer identifiers to the documents having closer relationships. Simulation results show that the average gap values of sample inverted files can be reduced by 30%, and the compression rate of d-gapped inverted file with prefix-free codes can be improved by 15%.  相似文献   

8.
Recent developments have shown that entity-based models that rely on information from the knowledge graph can improve document retrieval performance. However, given the non-transitive nature of relatedness between entities on the knowledge graph, the use of semantic relatedness measures can lead to topic drift. To address this issue, we propose a relevance-based model for entity selection based on pseudo-relevance feedback, which is then used to systematically expand the input query leading to improved retrieval performance. We perform our experiments on the widely used TREC Web corpora and empirically show that our proposed approach to entity selection significantly improves ad hoc document retrieval compared to strong baselines. More concretely, the contributions of this work are as follows: (1) We introduce a graphical probability model that captures dependencies between entities within the query and documents. (2) We propose an unsupervised entity selection method based on the graphical model for query entity expansion and then for ad hoc retrieval. (3) We thoroughly evaluate our method and compare it with the state-of-the-art keyword and entity based retrieval methods. We demonstrate that the proposed retrieval model shows improved performance over all the other baselines on ClueWeb09B and ClueWeb12B, two widely used Web corpora, on the [email protected], and [email protected] metrics. We also show that the proposed method is most effective on the difficult queries. In addition, We compare our proposed entity selection with a state-of-the-art entity selection technique within the context of ad hoc retrieval using a basic query expansion method and illustrate that it provides more effective retrieval for all expansion weights and different number of expansion entities.  相似文献   

9.
Re-using research resources is essential for advancing knowledge and developing repeatable, empirically solid experiments in scientific fields, including interactive information retrieval (IIR). Despite recent efforts on standardizing research re-use and documentation, how to quantitatively measure the reusability of IIR resources still remains an open challenge. Inspired by the reusability evaluations on Cranfield experiments, our work proactively explores the problem of measuring IIR test collection reusability and makes threefold contributions: (1) constructing a novel usefulness-oriented framework with specific analytical methods for evaluating the reusability of IIR test collections consisting of query sets, document/page sets, and sets of task-document usefulness (tuse); (2) explaining the potential impacts of varying IIR-specific factors (e.g. search tasks, sessions, user characteristics) on test collection reusability; (3) proposing actionable methods for building reusable test collections in IIR and thereby amortizing the true cost of user-oriented evaluations. The Cranfield-inspired reusability assessment framework serves as an initial step towards accurately evaluating the reusability of IIR research resources and measuring the reproducibility of IIR evaluation results. It also demonstrates an innovative approach to integrating the insights from individual heterogeneous user studies with the evaluation techniques developed in standardized ad hoc retrieval experiments, which will facilitate the maturation of IIR fields and eventually benefits both sides of research.  相似文献   

10.
11.
Diversification of web search results aims to promote documents with diverse content (i.e., covering different aspects of a query) to the top-ranked positions, to satisfy more users, enhance fairness and reduce bias. In this work, we focus on the explicit diversification methods, which assume that the query aspects are known at the diversification time, and leverage supervised learning methods to improve their performance in three different frameworks with different features and goals. First, in the LTRDiv framework, we focus on applying typical learning to rank (LTR) algorithms to obtain a ranking where each top-ranked document covers as many aspects as possible. We argue that such rankings optimize various diversification metrics (under certain assumptions), and hence, are likely to achieve diversity in practice. Second, in the AspectRanker framework, we apply LTR for ranking the aspects of a query with the goal of more accurately setting the aspect importance values for diversification. As features, we exploit several pre- and post-retrieval query performance predictors (QPPs) to estimate how well a given aspect is covered among the candidate documents. Finally, in the LmDiv framework, we cast the diversification problem into an alternative fusion task, namely, the supervised merging of rankings per query aspect. We again use QPPs computed over the candidate set for each aspect, and optimize an objective function that is tailored for the diversification goal. We conduct thorough comparative experiments using both the basic systems (based on the well-known BM25 matching function) and the best-performing systems (with more sophisticated retrieval methods) from previous TREC campaigns. Our findings reveal that the proposed frameworks, especially AspectRanker and LmDiv, outperform both non-diversified rankings and two strong diversification baselines (i.e., xQuAD and its variant) in terms of various effectiveness metrics.  相似文献   

12.
This work reports experimental and theoretical studies of hydrodynamic behaviour of deformable objects such as droplets and cells in a microchannel. Effects of mechanical properties including size and viscosity of these objects on their deformability, mobility, and induced hydrodynamic resistance are investigated. The experimental results revealed that the deformability of droplets, which is quantified in terms of deformability index (D.I.), depends on the droplet-to-channel size ratio ρ and droplet-to-medium viscosity ratio λ. Using a large set of experimental data, for the first time, we provide a mathematical formula that correlates induced hydrodynamic resistance of a single droplet ΔRd with the droplet size ρ and viscosity λ. A simple theoretical model is developed to obtain closed form expressions for droplet mobility ? and ΔRd. The predictions of the theoretical model successfully confront the experimental results in terms of the droplet mobility ? and induced hydrodynamic resistance ΔRd. Numerical simulations are carried out using volume-of-fluid model to predict droplet generation and deformation of droplets of different size ratio ρ and viscosity ratio λ, which compare well with that obtained from the experiments. In a novel effort, we performed experiments to measure the bulk induced hydrodynamic resistance ΔR of different biological cells (yeast, L6, and HEK 293). The results reveal that the bulk induced hydrodynamic resistance ΔR is related to the cell concentration and apparent viscosity of the cells.  相似文献   

13.
Opinion summarization can facilitate user’s decision-making by mining the salient review information. However, due to the lack of sufficient annotated data, most of the early works are based on extractive methods, which restricts the performance of opinion summarization. In this work, we aim to improve the informativeness of opinion summarization to provide better guidance to users. We consider the setting with only reviews without corresponding summaries, and propose an aspect-augmented model for unsupervised abstractive opinion summarization, denoted as AsU-OSum. We first employ an aspect-based sentiment analysis system to extract opinion phrases from reviews. Then, we construct a heterogeneous graph consisting of reviews and opinion clusters as nodes, which is used to enhance the Transformer-based encoder–decoder framework. Furthermore, we design a novel cascaded attention mechanism to prompt the decoder to pay more attention to the aspects that are more likely to appear in summary. During training, we introduce a sentiment accuracy reward that further enhances the learning ability of our model. We conduct comprehensive experiments on the Yelp, Amazon, and Rotten Tomatoes datasets. Automatic evaluation results show that our model is competitive and performs better than the state-of-the-art (SOTA) models on some ROUGE metrics. Human evaluation results further verify that our model can generate more informative summaries and reduce redundancy.  相似文献   

14.
Traditional information retrieval techniques that primarily rely on keyword-based linking of the query and document spaces face challenges such as the vocabulary mismatch problem where relevant documents to a given query might not be retrieved simply due to the use of different terminology for describing the same concepts. As such, semantic search techniques aim to address such limitations of keyword-based retrieval models by incorporating semantic information from standard knowledge bases such as Freebase and DBpedia. The literature has already shown that while the sole consideration of semantic information might not lead to improved retrieval performance over keyword-based search, their consideration enables the retrieval of a set of relevant documents that cannot be retrieved by keyword-based methods. As such, building indices that store and provide access to semantic information during the retrieval process is important. While the process for building and querying keyword-based indices is quite well understood, the incorporation of semantic information within search indices is still an open challenge. Existing work have proposed to build one unified index encompassing both textual and semantic information or to build separate yet integrated indices for each information type but they face limitations such as increased query process time. In this paper, we propose to use neural embeddings-based representations of term, semantic entity, semantic type and documents within the same embedding space to facilitate the development of a unified search index that would consist of these four information types. We perform experiments on standard and widely used document collections including Clueweb09-B and Robust04 to evaluate our proposed indexing strategy from both effectiveness and efficiency perspectives. Based on our experiments, we find that when neural embeddings are used to build inverted indices; hence relaxing the requirement to explicitly observe the posting list key in the indexed document: (a) retrieval efficiency will increase compared to a standard inverted index, hence reduces the index size and query processing time, and (b) while retrieval efficiency, which is the main objective of an efficient indexing mechanism improves using our proposed method, retrieval effectiveness also retains competitive performance compared to the baseline in terms of retrieving a reasonable number of relevant documents from the indexed corpus.  相似文献   

15.
Learning low dimensional dense representations of the vocabularies of a corpus, known as neural embeddings, has gained much attention in the information retrieval community. While there have been several successful attempts at integrating embeddings within the ad hoc document retrieval task, yet, no systematic study has been reported that explores the various aspects of neural embeddings and how they impact retrieval performance. In this paper, we perform a methodical study on how neural embeddings influence the ad hoc document retrieval task. More specifically, we systematically explore the following research questions: (i) do methods solely based on neural embeddings perform competitively with state of the art retrieval methods with and without interpolation? (ii) are there any statistically significant difference between the performance of retrieval models when based on word embeddings compared to when knowledge graph entity embeddings are used? and (iii) is there significant difference between using locally trained neural embeddings compared to when globally trained neural embeddings are used? We examine these three research questions across both hard and all queries. Our study finds that word embeddings do not show competitive performance to any of the baselines. In contrast, entity embeddings show competitive performance to the baselines and when interpolated, outperform the best baselines for both hard and soft queries.  相似文献   

16.
Content-based image retrieval (CBIR) with global features is notoriously noisy, especially for image queries with low percentages of relevant images in a collection. Moreover, CBIR typically ranks the whole collection, which is inefficient for large databases. We experiment with a method for image retrieval from multimedia databases, which improves both the effectiveness and efficiency of traditional CBIR by exploring secondary media. We perform retrieval in a two-stage fashion: first rank by a secondary medium, and then perform CBIR only on the top-K items. Thus, effectiveness is improved by performing CBIR on a ‘better’ subset. Using a relatively ‘cheap’ first stage, efficiency is also improved via the fewer CBIR operations performed. Our main novelty is that K is dynamic, i.e. estimated per query to optimize a predefined effectiveness measure. We show that our dynamic two-stage method can be significantly more effective and robust than similar setups with static thresholds previously proposed. In additional experiments using local feature derivatives in the visual stage instead of global, such as the emerging visual codebook approach, we find that two-stage does not work very well. We attribute the weaker performance of the visual codebook to the enhanced visual diversity produced by the textual stage which diminishes codebook’s advantage over global features. Furthermore, we compare dynamic two-stage retrieval to traditional score-based fusion of results retrieved visually and textually. We find that fusion is also significantly more effective than single-medium baselines. Although, there is no clear winner between two-stage and fusion, the methods exhibit different robustness features; nevertheless, two-stage retrieval provides efficiency benefits over fusion.  相似文献   

17.
XML is a pervasive technology for representing and accessing semi-structured data. XPath is the standard language for navigational queries on XML documents and there is a growing demand for its efficient processing.In order to increase the efficiency in executing four navigational XML query primitives, namely descendants, ancestors, children and parent, we introduce a new paradigm where traditional approaches based on the efficient traversing of nodes and edges to reconstruct the requested subtrees are replaced by a brand new one based on basic set operations which allow us to directly return the desired subtree, avoiding to create it passing through nodes and edges.Our solution stems from the NEsted SeTs for Object hieRarchies (NEASTOR) formal model, which makes use of set-inclusion relations for representing and providing access to hierarchical data. We define in-memory efficient data structures to implement NESTOR, we develop algorithms to perform the descendants, ancestors, children and parent query primitives and we study their computational complexity.We conduct an extensive experimental evaluation by using several datasets: digital archives (EAD collections), INEX 2009 Wikipedia collection, and two widely-used synthetic datasets (XMark and XGen). We show that NESTOR-based data structures and query primitives consistently outperform state-of-the-art solutions for XPath processing at execution time and they are competitive in terms of both memory occupation and pre-processing time.  相似文献   

18.
One of the best known measures of information retrieval (IR) performance is the F-score, the harmonic mean of precision and recall. In this article we show that the curve of the F-score as a function of the number of retrieved items is always of the same shape: a fast concave increase to a maximum, followed by a slow decrease. In other words, there exists a single maximum, referred to as the tipping point, where the retrieval situation is ‘ideal’ in terms of the F-score. The tipping point thus indicates the optimal number of items to be retrieved, with more or less items resulting in a lower F-score. This empirical result is found in IR and link prediction experiments and can be partially explained theoretically, expanding on earlier results by Egghe. We discuss the implications and argue that, when comparing F-scores, one should compare the F-score curves’ tipping points.  相似文献   

19.
Cell migration is an essential process involved in the development and maintenance of multicellular organisms. Electric fields (EFs) are one of the many physical and chemical factors known to affect cell migration, a phenomenon termed electrotaxis or galvanotaxis. In this paper, a microfluidics chip was developed to study the migration of cells under different electrical and chemical stimuli. This chip is capable of providing four different strengths of EFs in combination with two different chemicals via one simple set of agar salt bridges and Ag/AgCl electrodes. NIH 3T3 fibroblasts were seeded inside this chip to study their migration and reactive oxygen species (ROS) production in response to different EF strengths and the presence of β-lapachone. We found that both the EF and β-lapachone level increased the cell migration rate and the production of ROS in an EF-strength-dependent manner. A strong linear correlation between the cell migration rate and the amount of intracellular ROS suggests that ROS are an intermediate product by which EF and β-lapachone enhance cell migration. Moreover, an anti-oxidant, α-tocopherol, was found to quench the production of ROS, resulting in a decrease in the migration rate.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号