首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Traditional information retrieval techniques that primarily rely on keyword-based linking of the query and document spaces face challenges such as the vocabulary mismatch problem where relevant documents to a given query might not be retrieved simply due to the use of different terminology for describing the same concepts. As such, semantic search techniques aim to address such limitations of keyword-based retrieval models by incorporating semantic information from standard knowledge bases such as Freebase and DBpedia. The literature has already shown that while the sole consideration of semantic information might not lead to improved retrieval performance over keyword-based search, their consideration enables the retrieval of a set of relevant documents that cannot be retrieved by keyword-based methods. As such, building indices that store and provide access to semantic information during the retrieval process is important. While the process for building and querying keyword-based indices is quite well understood, the incorporation of semantic information within search indices is still an open challenge. Existing work have proposed to build one unified index encompassing both textual and semantic information or to build separate yet integrated indices for each information type but they face limitations such as increased query process time. In this paper, we propose to use neural embeddings-based representations of term, semantic entity, semantic type and documents within the same embedding space to facilitate the development of a unified search index that would consist of these four information types. We perform experiments on standard and widely used document collections including Clueweb09-B and Robust04 to evaluate our proposed indexing strategy from both effectiveness and efficiency perspectives. Based on our experiments, we find that when neural embeddings are used to build inverted indices; hence relaxing the requirement to explicitly observe the posting list key in the indexed document: (a) retrieval efficiency will increase compared to a standard inverted index, hence reduces the index size and query processing time, and (b) while retrieval efficiency, which is the main objective of an efficient indexing mechanism improves using our proposed method, retrieval effectiveness also retains competitive performance compared to the baseline in terms of retrieving a reasonable number of relevant documents from the indexed corpus.  相似文献   

2.
一种基于本体的语义标引方法   总被引:4,自引:0,他引:4  
传统的采用主题词和关键词对文档进行标引的方法,由于不能提供语义推理而越来越不适合目前的网络环境。由于本体具有良好的概念层次结构和对逻辑推理的支持,在信息检索领域将有很大的应用价值。本文首先介绍本体的基本概念和领域本体的组成部分,然后提出了一种基于领域本体的语义标引方法,采用本体中的概念对文档进行语义层面的标引,为检索的智能推理提供基础。  相似文献   

3.
孟旭阳  白海燕  梁冰  王莉 《情报杂志》2021,40(3):125-131,7
[目的/意义]资源数字化时代文献服务向知识服务方向转变,高质量的文献自动标引是文献知识服务能力提升的基础和关键,针对目前英文科技文献自动标引准确率不高的问题,提出了基于语义感知的概念遴选优化方法。[方法/过程]基于知识组织系统的自动主题标引,采用自然语言处理中的神经网络词向量技术,对概念和英文文献内容语义进行表示并进行语义感知与评估,实现概念标引结果在语义层面的遴选。该方法采用基于知识组织系统与自然语言处理技术相结合的方法,弥补了在语义层面上的不足,从而进一步降低不相关概念的影响,提高概念标引结果的准确率。[结果/结论]实验结果表明,该方法具有较好的语义感知性能,在概念遴选上有效降低了不相关概念,大大提高了标引结果的文献相关性,为科技文献资源知识化服务建设和相关研究提供有价值的参考和支持。  相似文献   

4.
In this paper, we present a well-defined general matrix framework for modelling Information Retrieval (IR). In this framework, collections, documents and queries correspond to matrix spaces. Retrieval aspects, such as content, structure and semantics, are expressed by matrices defined in these spaces and by matrix operations applied on them. The dualities of these spaces are identified through the application of frequency-based operations on the proposed matrices and through the investigation of the meaning of their eigenvectors. This allows term weighting concepts used for content-based retrieval, such as term frequency and inverse document frequency, to translate directly to concepts for structure-based retrieval. In addition, concepts such as pagerank, authorities and hubs, determined by exploiting the structural relationships between linked documents, can be defined with respect to the semantic relationships between terms. Moreover, this mathematical framework can be used to express classical and alternative evaluation measures, involving, for instance, the structure of documents, and to further explain and relate IR models and theory. The high level of reusability and abstraction of the framework leads to a logical layer for IR that makes system design and construction significantly more efficient, and thus, better and increasingly personalised systems can be built at lower costs.  相似文献   

5.
Pseudo-relevance feedback (PRF) is a well-known method for addressing the mismatch between query intention and query representation. Most current PRF methods consider relevance matching only from the perspective of terms used to sort feedback documents, thus possibly leading to a semantic gap between query representation and document representation. In this work, a PRF framework that combines relevance matching and semantic matching is proposed to improve the quality of feedback documents. Specifically, in the first round of retrieval, we propose a reranking mechanism in which the information of the exact terms and the semantic similarity between the query and document representations are calculated by bidirectional encoder representations from transformers (BERT); this mechanism reduces the text semantic gap by using the semantic information and improves the quality of feedback documents. Then, our proposed PRF framework is constructed to process the results of the first round of retrieval by using probability-based PRF methods and language-model-based PRF methods. Finally, we conduct extensive experiments on four Text Retrieval Conference (TREC) datasets. The results show that the proposed models outperform the robust baseline models in terms of the mean average precision (MAP) and precision P at position 10 (P@10), and the results also highlight that using the combined relevance matching and semantic matching method is more effective than using relevance matching or semantic matching alone in terms of improving the quality of feedback documents.  相似文献   

6.
A new dictionary-based text categorization approach is proposed to classify the chemical web pages efficiently. Using a chemistry dictionary, the approach can extract chemistry-related information more exactly from web pages. After automatic segmentation on the documents to find dictionary terms for document expansion, the approach adopts latent semantic indexing (LSI) to produce the final document vectors, and the relevant categories are finally assigned to the test document by using the k-NN text categorization algorithm. The effects of the characteristics of chemistry dictionary and test collection on the categorization efficiency are discussed in this paper, and a new voting method is also introduced to improve the categorization performance further based on the collection characteristics. The experimental results show that the proposed approach has the superior performance to the traditional categorization method and is applicable to the classification of chemical web pages.  相似文献   

7.
An integrated information retrieval system generally contains multiple databases that are inconsistent in terms of their content and indexing. This paper proposes a rough set-based transfer (RST) model for integration of the concepts of document databases using various indexing languages, so that users can search through the multiple databases using any of the current indexing languages. The RST model aims to effectively create meaningful transfer relations between the terms of two indexing languages, provided a number of documents are indexed with them in parallel. In our experiment, the indexing concepts of two databases respectively using the Thesaurus of Social Science (IZ) and the Schlagwortnormdatei (SWD) are integrated by means of the RST model. Finally, this paper compares the results achieved with a cross-concordance method, a conditional probability based method and the RST model.  相似文献   

8.
针对图书、期刊论文等数字文献文本特征较少而导致特征向量语义表达不够准确、分类效果差的问题,本文提出一种基于特征语义扩展的数字文献分类方法。该方法首先利用TF-IDF方法获取对数字文献文本表示能力较强、具有较高TF-IDF值的核心特征词;其次分别借助知网(Hownet)语义词典以及开放知识库维基百科(Wikipedia)对核心特征词集进行语义概念的扩展,以构建维度较低、语义丰富的概念向量空间;最后采用MaxEnt、SVM等多种算法构造分类器实现对数字文献的自动分类。实验结果表明:相比传统基于特征选择的短文本分类方法,该方法能有效地实现对短文本特征的语义扩展,提高数字文献分类的分类性能。  相似文献   

9.
In this paper, we propose a document reranking method for Chinese information retrieval. The method is based on a term weighting scheme, which integrates local and global distribution of terms as well as document frequency, document positions and term length. The weight scheme allows randomly setting a larger portion of the retrieved documents as relevance feedback, and lifts off the worry that very fewer relevant documents appear in top retrieved documents. It also helps to improve the performance of maximal marginal relevance (MMR) in document reranking. The method was evaluated by MAP (mean average precision), a recall-oriented measure. Significance tests showed that our method can get significant improvement against standard baselines, and outperform relevant methods consistently.  相似文献   

10.
In ad hoc querying of document collections, current approaches to ranking primarily rely on identifying the documents that contain the query terms. Methods such as query expansion, based on thesaural information or automatic feedback, are used to add further terms, and can yield significant though usually small gains in effectiveness. Another approach to adding terms, which we investigate in this paper, is to use natural language technology to annotate - and thus disambiguate - key terms by the concept they represent. Using biomedical research documents, we quantify the potential benefits of tagging users’ targeted concepts in queries and documents in domain-specific information retrieval. Our experiments, based on the TREC Genomics track data, both on passage and full-text retrieval, found no evidence that automatic concept recognition in general is of significant value for this task. Moreover, the issues raised by these results suggest that it is difficult for such disambiguation to be effective.  相似文献   

11.
微出版及其应用探析   总被引:1,自引:0,他引:1  
牛丽慧 《现代情报》2018,38(6):86-92
本文对语义出版中的一种代表性出版模式——微出版(Micropublication)进行了介绍和分析。首先介绍了微出版物的概念及其本体;然后对微出版的应用现状进行述评;最后,尝试将微出版应用于心理学领域,以一篇心理学科学文献为例对其利用微出版模型进行语义化描述,并在此基础上对微出版的应用特点进行了分析。研究结果表明:微出版模型是一种以论证为基础,对科学文献中以文献结论为论点,以陈述、数据、方法等作为证据的论证过程进行语义化表示的语义出版模型,但微出版模型无法表示对科学文献内的具体组块,需要结合其他概念模型实现对科学文献不同程度的语义化描述。  相似文献   

12.
13.
14.
Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed-memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text and posting lists caching. We show that a text self-index (i.e., an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space.  相似文献   

15.
Existing pseudo-relevance feedback (PRF) methods often divide an original query into individual terms for processing and select expansion terms based on the term frequency, proximity, position, etc. This process may lose some contextual semantic information from the original query. In this work, based on the classic Rocchio model, we propose a probabilistic framework that incorporates sentence-level semantics via Bidirectional Encoder Representations from Transformers (BERT) into PRF. First, we obtain the importance of terms at the term level. Then, we use BERT to interactively encode the query and sentences in the feedback document to acquire the semantic similarity score of a sentence and the query. Next, the semantic scores of different sentences are summed as the term score at the sentence level. Finally, we balance the term-level and sentence-level weights by adjusting factors and combine the terms with the top-k scores to form a new query for the next-round processing. We apply this method to three Rocchio-based models (Rocchio, PRoc2, and KRoc). A series of experiments are conducted based on six official TREC data sets. Various evaluation indicators suggest that the improved models achieve a significant improvement over the corresponding baseline models. Our proposed models provide a promising avenue for incorporating sentence-level semantics into PRF, which is feasible and robust. Through comparison and analysis of a case study, expansion terms obtained from the proposed models are shown to be more semantically consistent with the query.  相似文献   

16.
相关概念的关联参照检索是概念检索的重要研究内容。本文提出了一种基于主题的语义关联的参照检索模型,通过融合语义网、本体论的相关知识及信息提取等语言处理技术,提取关于特定主题的文档的主题概念及概念之间的关联构成该主题的语义关联模型,并辅助于参照检索过程。  相似文献   

17.
廖开际  杨彬彬 《情报杂志》2012,31(7):182-186
基于词频统计思想的传统文本相似度算法,往往只考虑特征项在文本中的权重,而忽视了特征项之间的语义关系.综合考虑了特征项在文本中的重要程度以及特征项之间的语义关系,提出构建文本特征项的加权语义网模型来计算文本之间的相似度,并在模型构建的过程中,对特征项的选取、权值计算做了适当的改进.最后用实验验证了基于加权语义网的文本相似度算法相较于传统的算法,相似度计算的精确度有了进一步的提高.  相似文献   

18.
基于术语间本体关联度的文档相关度研究   总被引:1,自引:0,他引:1  
提出了一种基于术语间本体关联度的文档相关度计算方法,该方法利用树状本体结构计算术语间基于本体的关联关系,通过术语组间的本体关联度得到两组词语的本体关联关系,最后结合文档标引词的权重计算两个文档的相关度。新方法从本体的角度将语义信息融入传统向量空间模型,提高了文档相关度计算的准确性。实验选取计算机领域本体作为实验数据,对新方法和传统方法进行综合对比评测,实验结果验证了新方法的有效性和合理性。  相似文献   

19.
【目的/意义】通过概念层次关系自动抽取可以快速地在大数据集上进行细粒度的概念语义层次自动划分, 为后续领域本体的精细化构建提供参考。【方法/过程】首先,在由复合术语和关键词组成的术语集上,通过词频、篇 章频率和语义相似度进行筛选,得到学术论文评价领域概念集;其次,考虑概念共现关系和上下文语义信息,前者 用文献-概念矩阵和概念共现矩阵表达,后者用word2vec词向量表示,通过余弦相似度进行集成,得到概念相似度 矩阵;最后,以关联度最大的概念为聚类中心,利用谱聚类对相似度矩阵进行聚类,得到学术论文评价领域概念层 次体系。【结果/结论】经实验验证,本研究提出的模型有较高的准确率,构建的领域概念层次结构合理。【创新/局限】 本文提出了一种基于词共现与词向量的概念层次关系自动抽取模型,可以实现概念层次关系的自动抽取,但类标 签确定的方法比较简单,可以进一步探究。  相似文献   

20.
Most previous information retrieval (IR) models assume that terms of queries and documents are statistically independent from each other. However, conditional independence assumption is obviously and openly understood to be wrong, so we present a new method of incorporating term dependence into a probabilistic retrieval model by adapting a dependency structured indexing system using a dependency parse tree and Chow Expansion to compensate the weakness of the assumption. In this paper, we describe a theoretic process to apply the Chow Expansion to the general probabilistic models and the state-of-the-art 2-Poisson model. Through experiments on document collections in English and Korean, we demonstrate that the incorporation of term dependences using Chow Expansion contributes to the improvement of performance in probabilistic IR systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号