首页 | 本学科首页   官方微博 | 高级检索  
     检索      

基于时间序列聚类算法的叙词表新术语遴选研究
引用本文:雷晓,常春,刘伟.基于时间序列聚类算法的叙词表新术语遴选研究[J].情报科学,2021,39(1):135-141.
作者姓名:雷晓  常春  刘伟
作者单位:中国科学技术信息研究所资源中心
基金项目:国家社会科学基金项目“面向知识组织系统的新术语抽取研究”(16BTQ087);国家科技图书文献中心“下一代国家科技创新知识服务开放系统”先期研发任务课题“STKOS超级科技词表内容建设机制与发展研究(工学部分)”(XQYF0101-2)。
摘    要:【目的/意义】为保证叙词表术语收录的完整性,需要及时将领域出现但未收录的新术语补充收录到叙词表 中,结合候选词的时间及文档词频特征,从时间序列角度探索新术语的分布情况以指导新术语遴选是值得研究的 问题。【方法/过程】文章主要对词汇文档词频对应的时间序列进行研究,将时间序列进行词频归一化及时间等长预 处理,引入k-means聚类算法,对候选词汇进行基于时间序列趋势变化的聚类,探索术语以及非术语趋势变化的规 律,进而总结新术语应该满足的趋势变化特征。【结果/结论】通过聚类研究,总结得出新术语普遍处于增长趋势。 实证将处于增长状态的候选词汇遴选出来,经过专家判断,该方法可以有效从候选词汇中遴选出其中能补充到叙 词表中的新术语,该方法有比较高的准确率。【创新/局限】创新之处表现为叙词表新术语的遴选中同时考虑了时间 变化和文档词频因素,局限于数据处理规模,实证中只统计了论文关键词的词频数据。

关 键 词:叙词表  新术语遴选  文档词频  时间序列聚类  K-MEANS

New Term Selection of Thesaurus Based on Time Series Clustering
LEI Xiao,CHANG Chun,LIU Wei.New Term Selection of Thesaurus Based on Time Series Clustering[J].Information Science,2021,39(1):135-141.
Authors:LEI Xiao  CHANG Chun  LIU Wei
Institution:(Resource Center,Institute of Scientific and Technical Information of China,Beijing 100038,China)
Abstract:【Purpose/significance】To ensure the integrity of the thesaurus,it is necessary to timely include new terminology that appears in the field but not included in the thesaurus is combined with the time of the candidate words and the word frequency characteristics of the document.It is worthwhile to explore the distribution of new terms from the perspective of time series to guide the selection of new terms.【Method/process】This paper mainly studies the time series corresponding to the document frequency of words,and performs time-frequency normalization and time-equalization preprocessing,and introduces k-means clustering algorithm to cluster the words based on time series trend change.The general rules of terminology and non-terminology trends,in turn,summarize the trend-changing characteristics that new terms should satisfy.【Result/conclusion】Through clustering research,it is concluded that new terms are generally in a growing trend.The selection of candidate words in the growth state,after expert judgment,the method can effectively select new terms from the candidate vocabulary that can be added to the thesaurus,the method has higher accuracy.【Innovation/limitation】The innovation lies in the selection of new terms in thesaurus,which takes into account both the time variation and the word frequency of documents,which is limited to the scale of data processing.In the empirical study,only the word frequency data of key words in the paper are counted.
Keywords:thesaurus  new term selection  word frequency of document  time series clustering  k-means
本文献已被 维普 等数据库收录!
点击此处可从《情报科学》浏览原始摘要信息
点击此处可从《情报科学》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号