首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 406 毫秒
1.
Unknown words such as proper nouns, abbreviations, and acronyms are a major obstacle in text processing. Abbreviations, in particular, are difficult to read/process because they are often domain specific. In this paper, we propose a method for automatic expansion of abbreviations by using context and character information. In previous studies dictionaries were used to search for abbreviation expansion candidates (candidates words for original form of abbreviations) to expand abbreviations. We use a corpus with few abbreviations from the same field instead of a dictionary. We calculate the adequacy of abbreviation expansion candidates based on the similarity between the context of the target abbreviation and that of its expansion candidate. The similarity is calculated using a vector space model in which each vector element consists of words surrounding the target abbreviation and those of its expansion candidate. Experiments using approximately 10,000 documents in the field of aviation showed that the accuracy of the proposed method is 10% higher than that of previously developed methods.  相似文献   

2.
基于JATS数据标准的全文文献管理   总被引:1,自引:1,他引:0  
[目的]为科技期刊电子文档交换和存储标准的制定和使用提供借鉴,促进文献的全文管理。[方法]介绍JATS(Journal Article Tag Suite)标准的特点及实践,分析比较其三组标签集的不同应用场景。[结果]出版集团、数据仓储、图书馆以及文章作者可以根据其使用需求选择一组JATS标签来完成文档的转换、存储及管理。根据JATS标准管理中文文献,实现了文献的全文阅读、个性化标记和全文内容的搜索等功能。[结论] JATS标准根据不同应用场景制定了三种标签类型。基于JATS标准的全文文献管理及医学图书管理为JATS标准的本地化推广与应用提供了可行性依据。  相似文献   

3.
Various keyword network methods are used to map scientific fields, but few studies have considered the semantic roles of keywords in such networks. This study proposes a term function–aware keyword citation network to fill this research limitation. Specifically, we first used a term function identification method to identify research questions and methods from scientific articles. Then, we constructed a question-method term citation network to represent the correlation structure of keywords. Next, we explored the topology characteristics, question-method bipartite network, and knowledge community structure of the generated network to validate its superiority in science mapping analysis. A dataset of 299,567 conference proceedings collected from the Association for Computing Machinery (ACM) digital library is used to evaluate the effectiveness of our methods. The results show that the term function identification model based on Bidirectional Encoder Representations from Transformers (BERT) achieves a score of 0.90 F1. And the question-method term citation network outperforms existing keyword citation methods in revealing association patterns between scientific knowledge and improving the interpretability of the knowledge structure of the computing field. We believe that our work expands the methodology of keyword citation network and science mapping analysis and provides guidance for considering the term function in various scenarios.  相似文献   

4.
Entity disambiguation is a fundamental task of semantic Web annotation. Entity Linking (EL) is an essential procedure in entity disambiguation, which aims to link a mention appearing in a plain text to a structured or semi-structured knowledge base, such as Wikipedia. Existing research on EL usually annotates the mentions in a text one by one and treats entities independent to each other. However this might not be true in many application scenarios. For example, if two mentions appear in one text, they are likely to have certain intrinsic relationships. In this paper, we first propose a novel query expansion method for candidate generation utilizing the information of co-occurrences of mentions. We further propose a re-ranking model which can be iteratively adjusted based on the prediction in the previous round. Experiments on real-world data demonstrate the effectiveness of our proposed methods for entity disambiguation.  相似文献   

5.
Query translation is a viable method for cross-language information retrieval (CLIR), but it suffers from translation ambiguities caused by multiple translations of individual query terms. Previous research has employed various methods for disambiguation, including the method of selecting an individual target query term from multiple candidates by comparing their statistical associations with the candidate translations of other query terms. This paper proposes a new method where we examine all combinations of target query term translations corresponding to the source query terms, instead of looking at the candidates for each query term and selecting the best one at a time. The goodness value for a combination of target query terms is computed based on the association value between each pair of the terms in the combination. We tested our method using the NTCIR-3 English–Korean CLIR test collection. The results show some improvements regardless of the association measures we used.  相似文献   

6.
Measuring the similarity between the semantic relations that exist between words is an important step in numerous tasks in natural language processing such as answering word analogy questions, classifying compound nouns, and word sense disambiguation. Given two word pairs (AB) and (CD), we propose a method to measure the relational similarity between the semantic relations that exist between the two words in each word pair. Typically, a high degree of relational similarity can be observed between proportional analogies (i.e. analogies that exist among the four words, A is to B such as C is to D). We describe eight different types of relational symmetries that are frequently observed in proportional analogies and use those symmetries to robustly and accurately estimate the relational similarity between two given word pairs. We use automatically extracted lexical-syntactic patterns to represent the semantic relations that exist between two words and then match those patterns in Web search engine snippets to find candidate words that form proportional analogies with the original word pair. We define eight types of relational symmetries for proportional analogies and use those as features in a supervised learning approach. We evaluate the proposed method using the Scholastic Aptitude Test (SAT) word analogy benchmark dataset. Our experimental results show that the proposed method can accurately measure relational similarity between word pairs by exploiting the symmetries that exist in proportional analogies. The proposed method achieves an SAT score of 49.2% on the benchmark dataset, which is comparable to the best results reported on this dataset.  相似文献   

7.
【目的/意义】本文基于颜色、纹理等外部特征与局部视觉特征构成的底层语义特征数据并采用随机森林的 方法对医学图像信息进行语义自动标注,为医务工作者提供临床决策参考,便于普通公众理解医学知识和了解个 人健康情况,也可以在大数据环境下扩展图书情报领域研究人员对信息组织与处理的范围,促进学科交叉与融合, 提升智慧医学的发展,为健康中国战略提供智力与技术支持。【方法/过程】融合图书情报领域知识与医学知识,将 图像语义标注看作为一个多类分类问题,首先,抽取颜色、纹理等外部特征及局部视觉特征等底层语义特征;然后, 运用随机森林的方法,设计了基于随机森林的医学图像自动标注方案。【结果/结论】融合底层语义特征的医学图像 信息自动标注的方案与随机树标注方案相比较,具有较好的效果。【创新/局限】将视觉语义词典作为医学图像的底 层语义特征引入到图像标注中;运用随机森林构建的医学图像标注方案;局限在于仅采用BreaKHis数据集为实验 数据。  相似文献   

8.
王颖  于改红  谢靖 《情报科学》2021,39(8):67-77
【目的/意义】通过对学术资源进行深度挖掘与语义化组织,实现学术资源及其内部知识之间的关联发现。 【方法/过程】本文提出基于全文知识网络的学术资源关联发现方法,设计了全文知识网络的模型和构建流程,以 Pubmed Central数据库中拟南芥(Arabidopsis)相关的520篇期刊论文全文数据为实验对象,通过全文解析和挖掘将 其分解为细粒度的知识,形成全文知识网络。然后利用SPARQL查询和RelFinder可视化工具从数字资源层、知识 单元层和知识对象层三个层次开展关联发现实验。【结果/结论】本文构建全文知识网络对学术资源进行细粒度组 织和挖掘,有助于发现不同学术资源及其内部知识之间的潜在关联,对学术资源的深度利用具有重要的意义。【创 新/局限】本文创新之处在于通过构建全文知识网络对学术资源进行细粒度揭示和组织并进一步发现潜在关联,局 限在于尚未开展大规模应用实践。  相似文献   

9.
将大量中英文对照的专利文本作为平行语料库,提出一种自动抽取中英文词典的方法。先利用外部语义资源维基百科构建种子双语词典,再通过计算点互信息获得中英文词对的候补,并设置阈值筛选出用于补充种子词典的词对。实验结果表明:对英语文档进行单词的短语化有助于提高自动抽取结果的综合性能;另一方面,虽然通过句对齐方式可以提高自动抽取结果的正确率,但会对抽取结果的召回率产生负面影响。通过所述方法构建的专利双语词典能够在构建多语言版本的技术知识图谱中起到积极作用。  相似文献   

10.
基于共词分析的学科主题演化方法改进研究   总被引:2,自引:0,他引:2  
学科主题演化是情报分析人员采用一定的信息技术方法观察主题在时间维度上的发展、变化趋势以及不同主题之间的交互作用,它已成为情报研究的一项重要内容。基于词频或共现词频的共词分析方法难以反映主题词对间更层次的语义关系,针对这一情况,提出一种改进的共词分析方法,该方法体现主题词、主题和文档间的层次语义关系,以更微观、精确的语义层面展现主题演化过程。  相似文献   

11.
The effectiveness of query expansion methods depends essentially on identifying good candidates, or prospects, semantically related to query terms. Word embeddings have been used recently in an attempt to address this problem. Nevertheless query disambiguation is still necessary as the semantic relatedness of each word in the corpus is modeled, but choosing the right terms for expansion from the standpoint of the un-modeled query semantics remains an open issue. In this paper we propose a novel query expansion method using word embeddings that models the global query semantics from the standpoint of prospect vocabulary terms. The proposed method allows to explore query-vocabulary semantic closeness in such a way that new terms, semantically related to more relevant topics, are elicited and added in function of the query as a whole. The method includes candidates pooling strategies that address disambiguation issues without using exogenous resources. We tested our method with three topic sets over CLEF corpora and compared it across different Information Retrieval models and against another expansion technique using word embeddings as well. Our experiments indicate that our method achieves significant results that outperform the baselines, improving both recall and precision metrics without relevance feedback.  相似文献   

12.
Dictionary-based query translation for cross-language information retrieval often yields various translation candidates having different meanings for a source term in the query. This paper examines methods for solving the ambiguity of translations based on only the target document collections. First, we discuss two kinds of disambiguation technique: (1) one is a method using term co-occurrence statistics in the collection, and (2) a technique based on pseudo-relevance feedback. Next, these techniques are empirically compared using the CLEF 2003 test collection for German to Italian bilingual searches, which are executed by using English language as a pivot. The experiments showed that a variation of term co-occurrence based techniques, in which the best sequence algorithm for selecting translations is used with the Cosine coefficient, is dominant, and that the PRF method shows comparable high search performance, although statistical tests did not sufficiently support these conclusions. Furthermore, we repeat the same experiments for the case of French to Italian (pivot) and English to Italian (non-pivot) searches on the same CLEF 2003 test collection in order to verity our findings. Again, similar results were observed except that the Dice coefficient outperforms slightly the Cosine coefficient in the case of disambiguation based on term co-occurrence for English to Italian searches.  相似文献   

13.
李慧 《现代情报》2015,35(4):172-177
词语相似度计算方法在信息检索、词义消歧、机器翻译等自然语言处理领域有着广泛的应用。现有的词语相似度算法主要分为基于统计和基于语义资源两类方法,前者是从大规模的语料中统计与词语共现的上下文信息以计算其相似度,而后者利用人工构建的语义词典或语义网络计算相似度。本文比较分析了两类词语相似度算法,重点介绍了基于Web语料库和基于维基百科的算法,并总结了各自的特点和不足之处。最后提出,在信息技术的影响下,基于维基百科和基于混合技术的词语相似度算法以及关联数据驱动的相似性计算具有潜在的发展趋势。  相似文献   

14.
Towards mapping library and information science   总被引:3,自引:1,他引:3  
In an earlier study by the authors, full-text analysis and traditional bibliometric methods were combined to map research papers published in the journal Scientometrics. The main objective was to develop appropriate techniques of full-text analysis and to improve the efficiency of the individual methods in the mapping of science. The number of papers was, however, rather limited. In the present study, we extend the quantitative linguistic part of the previous studies to a set of five journals representing the field of Library and Information Science (LIS). Almost 1000 articles and notes published in the period 2002–2004 have been selected for this exercise. The optimum solution for clustering LIS is found for six clusters. The combination of different mapping techniques, applied to the full text of scientific publications, results in a characteristic tripod pattern. Besides two clusters in bibliometrics, one cluster in information retrieval and one containing general issues, webometrics and patent studies are identified as small but emerging clusters within LIS. The study is concluded with the analysis of cluster representations by the selected journals.  相似文献   

15.
Authorship disambiguation is an urgent issue that affects the quality of digital library services and for which supervised solutions have been proposed, delivering state-of-the-art effectiveness. However, particular challenges such as the prohibitive cost of labeling vast amounts of examples (there are many ambiguous authors), the huge hypothesis space (there are several features and authors from which many different disambiguation functions may be derived), and the skewed author popularity distribution (few authors are very prolific, while most appear in only few citations), may prevent the full potential of such techniques. In this article, we introduce an associative author name disambiguation approach that identifies authorship by extracting, from training examples, rules associating citation features (e.g., coauthor names, work title, publication venue) to specific authors. As our main contribution we propose three associative author name disambiguators: (1) EAND (Eager Associative Name Disambiguation), our basic method that explores association rules for name disambiguation; (2) LAND (Lazy Associative Name Disambiguation), that extracts rules on a demand-driven basis at disambiguation time, reducing the hypothesis space by focusing on examples that are most suitable for the task; and (3) SLAND (Self-Training LAND), that extends LAND with self-training capabilities, thus drastically reducing the amount of examples required for building effective disambiguation functions, besides being able to detect novel/unseen authors in the test set. Experiments demonstrate that all our disambigutators are effective and that, in particular, SLAND is able to outperform state-of-the-art supervised disambiguators, providing gains that range from 12% to more than 400%, being extremely effective and practical.  相似文献   

16.
语义知识网络的结构分析与构建   总被引:1,自引:0,他引:1  
结合知识网络和本体理论,对语义知识网络进行形式化的定义和结构分析,并利用本体技术实现网络构建。该网络将知识与其隶属对象分开考虑,包含本体层和知识网络层,涵盖多个层面的知识。通过该网络,可以应用本体的检索、推理等功能对组织知识进行挖掘、分析和管理。  相似文献   

17.
占泚  熊回香  蒋武轩  李琰 《情报科学》2022,39(1):121-129
【目的/意义】在线健康信息的有效组织对提升全民身体素质具有重要的社会价值。【方法/过程】在分析健 康信息主题、关联关系和资源标引的基础上,构建基于主题图的在线健康信息标签语义挖掘模型,从而构建了健康 信息标签主题图并实现了其可视化导航、浏览和检索等功能。【结果/结论】基于主题图的在线健康信息标签语义挖 掘模型能够准确的发现在线健康信息与信息标签间的深层关系,可以更好地揭示在线健康信息标签的语义关联, 为用户提供信息的可视化浏览和导航功能、提升健康信息的组织效果,帮助用户健康信息获取。【创新/局限】本文 将主题图与健康信息标签相结合,提高了健康信息的检索效率和利用效率,但本文也存在着不足,例如标签样本量 和样本范围较小,缺乏专业医学研究者的参与。  相似文献   

18.
In ad hoc querying of document collections, current approaches to ranking primarily rely on identifying the documents that contain the query terms. Methods such as query expansion, based on thesaural information or automatic feedback, are used to add further terms, and can yield significant though usually small gains in effectiveness. Another approach to adding terms, which we investigate in this paper, is to use natural language technology to annotate - and thus disambiguate - key terms by the concept they represent. Using biomedical research documents, we quantify the potential benefits of tagging users’ targeted concepts in queries and documents in domain-specific information retrieval. Our experiments, based on the TREC Genomics track data, both on passage and full-text retrieval, found no evidence that automatic concept recognition in general is of significant value for this task. Moreover, the issues raised by these results suggest that it is difficult for such disambiguation to be effective.  相似文献   

19.
Bibliometric mapping of computer and information ethics   总被引:1,自引:0,他引:1  
This paper presents the first bibliometric mapping analysis of the field of computer and information ethics (C&IE). It provides a map of the relations between 400 key terms in the field. This term map can be used to get an overview of concepts and topics in the field and to identify relations between information and communication technology concepts on the one hand and ethical concepts on the other hand. To produce the term map, a data set of over thousand articles published in leading journals and conference proceedings in the C&IE field was constructed. With the help of various computer algorithms, key terms were identified in the titles and abstracts of the articles and co-occurrence frequencies of these key terms were calculated. Based on the co-occurrence frequencies, the term map was constructed. This was done using a computer program called VOSviewer. The term map provides a visual representation of the C&IE field and, more specifically, of the organization of the field around three main concepts, namely privacy, ethics, and the Internet.  相似文献   

20.
齐虹 《情报科学》2021,39(6):177-184
【目的/意义】电子健康记录信息作为医学健康信息资源的重要组成部分,其共享整合问题一直是当前克服 医疗信息资源“孤岛化”现象、实现医学知识服务的重点和难点,对国外电子健康记录语义整合研究进展进行分析 综述,旨在为我国后续研究提供借鉴和参考。【方法/过程】本文运用文献调查法,以电子健康记录信息资源转化为 知识资源为主线,梳理出电子健康记录语义整合的主题框架及发展态势。【结果/结论】电子健康记录语义整合是 一个集专业性和社会性高度融合的动态知识组织过程,未来研究可能关注的问题有:医学细分专业的互操作标准 建设、语义关联方式的开放透明问题、按需提供知识服务的模式研究以及开放获取与隐私保护的利益平衡问题 等。【创新/局限】对国外近期EHR语义整合研究主题进行分析和评述,并提出研究进展中的重点和趋势。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号