首页 | 本学科首页   官方微博 | 高级检索  
     检索      

专利文本主题建模中领域停用词自动选取研究
引用本文:俞琰,赵乃瑄.专利文本主题建模中领域停用词自动选取研究[J].图书情报工作,2018,62(11):120-126.
作者姓名:俞琰  赵乃瑄
作者单位:1. 南京工业大学信息服务部 南京 210009; 2. 东南大学成贤学院电子与计算机学院 南京 211816
基金项目:*本文系教育部人文社科规划项目项目"大数据时代技能知识图谱构建研究"(项目编号:16YJAZH073)和国家社会科学基金一般规划项目"大数据时代支持创新设计的多维度多层次专利文本挖掘研究"(项目编号:17BTQ059)研究成果之一。
摘    要:目的/意义]针对专利文本主题建模中领域停用词自动选取尚未有充分研究的问题,提出一种新的领域停用词自动选取方法,用于专利文本主题模型分析,以提高专利主题模型的区分度与建模质量。方法/过程]领域停用词本质上是信息比较少,在不同类别专利文本中区分度低的词。因此,引入辅助专利文本集,使用类别熵衡量词的分布情况,然后依据词的类别熵进行排序,选取类别熵最大的若干词作为领域停用词。结果/结论]实验通过专利文本数据,验证了该方法的可行性与有效性,能够有效地提高专利主题模型的区分度。

关 键 词:专利文本  主题建模  领域停用词  自动选取  
收稿时间:2017-11-09
修稿时间:2018-03-05

Automatic Selection of Domain-Specific Stopwords in Topic Model of Patent Text
Yu Yan,Zhao Nianxuan.Automatic Selection of Domain-Specific Stopwords in Topic Model of Patent Text[J].Library and Information Service,2018,62(11):120-126.
Authors:Yu Yan  Zhao Nianxuan
Institution:1. Information Service Department, Nanjing Tech University, Nanjing 211816; 2. Computer Science department, Southeast University Chengxian College, Nanjing 211816
Abstract:Purpose/significance] Because the research that automatic selection of domain-specific stopwords in topic model of patent text is insufficient, this paper proposes a new method of automatic selection of domain-specific stopwords, for patent text topic model analysis, in order to improve the differentiation and modeling quality of the patent topic model. Method/process] In essence, domain-specific stopwords are less important words which contain relatively less information,such words are poorly differentiated in different kinds of patent. Therefore, this paper introduced the auxiliary multi-category patent text dataset and measured the distributions of words through the category entropy. Then, according to the category entropy of words. It chose some words that have the maximum category entropy as the domain-specific stopwords. Result/conclusion] Experimental results show the feasibility and validity of the method proposed in this paper, which can improve the differentiation and quality of topic model for patent text analysis.
Keywords:patent text  topic model  domain-specific stopword  automatic selection  
点击此处可从《图书情报工作》浏览原始摘要信息
点击此处可从《图书情报工作》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号