首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
王博  刘盛博  丁堃  刘则渊 《科研管理》2015,36(3):111-117
主题模型是一种有效提取大规模文本隐含主题的建模方法。本文将Latent Dirichlet Allocation(LDA)主题模型引入专利内容分析领域,实现专利主题划分,解决以往专利主题分类过于粗泛、时效性差、缺乏科学性等问题。并在原有模型基础上构建LDA机构-主题模型,对专利知识主体和客体联合建模,实现专利主题和机构之间内在关系分析。最后,以通信产业LTE技术领域为例,验证该模型可以有效用于专利主题划分,实现各主题下专利知识主体竞争态势测度。  相似文献   

2.
Towards mapping library and information science   总被引:3,自引:1,他引:3  
In an earlier study by the authors, full-text analysis and traditional bibliometric methods were combined to map research papers published in the journal Scientometrics. The main objective was to develop appropriate techniques of full-text analysis and to improve the efficiency of the individual methods in the mapping of science. The number of papers was, however, rather limited. In the present study, we extend the quantitative linguistic part of the previous studies to a set of five journals representing the field of Library and Information Science (LIS). Almost 1000 articles and notes published in the period 2002–2004 have been selected for this exercise. The optimum solution for clustering LIS is found for six clusters. The combination of different mapping techniques, applied to the full text of scientific publications, results in a characteristic tripod pattern. Besides two clusters in bibliometrics, one cluster in information retrieval and one containing general issues, webometrics and patent studies are identified as small but emerging clusters within LIS. The study is concluded with the analysis of cluster representations by the selected journals.  相似文献   

3.
Automated legal text classification is a prominent research topic in the legal field. It lays the foundation for building an intelligent legal system. Current literature focuses on international legal texts, such as Chinese cases, European cases, and Australian cases. Little attention is paid to text classification for U.S. legal texts. Deep learning has been applied to improving text classification performance. Its effectiveness needs further exploration in domains such as the legal field. This paper investigates legal text classification with a large collection of labeled U.S. case documents through comparing the effectiveness of different text classification techniques. We propose a machine learning algorithm using domain concepts as features and random forests as the classifier. Our experiment results on 30,000 full U.S. case documents in 50 categories demonstrated that our approach significantly outperforms a deep learning system built on multiple pre-trained word embeddings and deep neural networks. In addition, applying only the top 400 domain concepts as features for building the random forests could achieve the best performance. This study provides a reference to select machine learning techniques for building high-performance text classification systems in the legal domain or other fields.  相似文献   

4.
针对专利技术主题识别方法存在缺少语义语境、可解释性弱和主题界定模糊等问题,提出了一种融合专利结构数据和文本语义的技术主题识别分析方法一并解决上述问题,有助于领域人员把握技术研究内容,为研发决策提供科学支持。方法以专利IPC作为结构数据改进纯文本主题建模获取由IPC和专家分类意见指导的主题词向量,同时使用word2vec获取语义词向量,将结果进行向量拼接,进而获得易于解释的精准技术主题,满足细粒度分析要求。最后,以非小细胞肺癌治疗领域作为实证研究,证实了该方法的科学性和实用性。  相似文献   

5.
LDA模型在专利文本分类中的应用   总被引:1,自引:0,他引:1  
对传统专利文本自动分类方法中,使用向量空间模型文本表示方法存在的问题,提出一种基于LDA模型专利文本分类方法。该方法利用LDA主题模型对专利文本语料库建模,提取专利文本的文档-主题和主题-特征词矩阵,达到降维目的和提取文档间的语义联系,引入类的类-主题矩阵,为类进行主题语义拓展,使用主题相似度构造层次分类,小类采用KNN分类方法。实验结果:与基于向量空间文本表示模型的KNN专利文本分类方法对比,此方法能够获得更高的分类评估指数。  相似文献   

6.
[目的/意义] 运用概率主题模型全面研究专利文献主题演化,分析专利技术发展过程及趋势。[方法/过程] LDA模型按时间窗口对专利文本建模,困惑度确定最优主题数,按专利文本结构特性提取主题向量,采用JS散度度量主题之间的关联,引入IPC分类号度量技术主题强度,最后实现主题强度、主题内容和技术主题强度3方面的演化研究。[结果/结论] 实验结果表明:该方法能够深入挖掘专利文献的主题,可以较好地分析专利技术随时间的演化规律,帮助相关从业人员了解专利技术的演化过程及趋势。  相似文献   

7.
技术演进研究可用于梳理技术领域的发展脉络和内部技术活动的发展历史及现状,对政府和企业的科技战略管理具有重要意义。专利引文分析在技术演进研究中存在难以准确判断专利的技术主题相似度、分析的范围和潜在信息的丰富性有限等缺陷,而文本挖掘方法可以对专利的文本内容进行深度分析,能在一定程度上弥补专利引文分析的缺陷,因此探索将专利引文分析与文本挖掘方法相结合,在专利引用关系矩阵和专利文本相似度矩阵的基础上创建C-T(Citation-Text)专利网络,并对C-T专利网络进行聚类分析和可视化展示来研究技术的演进过程,旨在进行技术演进研究方法的创新,丰富技术演进研究的方法体系。  相似文献   

8.
In this era, the proliferating role of social media in our lives has popularized the posting of the short text. The short texts contain limited context with unique characteristics which makes them difficult to handle. Every day billions of short texts are produced in the form of tags, keywords, tweets, phone messages, messenger conversations social network posts, etc. The analysis of these short texts is imperative in the field of text mining and content analysis. The extraction of precise topics from large-scale short text documents is a critical and challenging task. The conventional approaches fail to obtain word co-occurrence patterns in topics due to the sparsity problem in short texts, such as text over the web, social media like Twitter, and news headlines. Therefore, in this paper, the sparsity problem is ameliorated by presenting a novel fuzzy topic modeling (FTM) approach for short text through fuzzy perspective. In this research, the local and global term frequencies are computed through a bag-of-words (BOW) model. To remove the negative impact of high dimensionality on the global term weighting, the principal component analysis is adopted; thereafter the fuzzy c-means algorithm is employed to retrieve the semantically relevant topics from the documents. The experiments are conducted over the three real-world short text datasets: the snippets dataset is in the category of small dataset whereas the other two datasets, Twitter and questions, are the bigger datasets. Experimental results show that the proposed approach discovered the topics more precisely and performed better as compared to other state-of-the-art baseline topic models such as GLTM, CSTM, LTM, LDA, Mix-gram, BTM, SATM, and DREx+LDA. The performance of FTM is also demonstrated in classification, clustering, topic coherence and execution time. FTM classification accuracy is 0.95, 0.94, 0.91, 0.89 and 0.87 on snippets dataset with 50, 75, 100, 125 and 200 number of topics. The classification accuracy of FTM on questions dataset is 0.73, 0.74, 0.70, 0.68 and 0.78 with 50, 75, 100, 125 and 200 number of topics. The classification accuracies of FTM on snippets and questions datasets are higher than state-of-the-art baseline topic models.  相似文献   

9.
Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.  相似文献   

10.
11.
[目的/意义]研发商业化机会(R&BD)是指通过整合市场和创新来开发技术以创造有价值技术的过程,基于商标和专利的不同映射情景识别不同的R&BD机会并制定不同的策略,不仅使商标这一重要的知识产权战略得以充分应用,而且可以为不同情景下的R&BD机会制定具有针对性的策略。[方法/过程]为了降低技术商业化的失败率,提出一个以商标和专利数据为依据,利用文本挖掘技术识别商标和专利空白,使用文本相似度算法和逆映射寻找商标和专利的对应关系,结合情景分析法分情景识别潜在R&BD机会的研究框架。以远程医疗领域商标和专利作为分析对象,预测该领域的潜在R&BD机会。[结果/结论]实证结果表明:所选技术领域共发现29个潜在的商标空白点,通过商标和专利的逆映射确定了11个商标空白点与现有专利对应,同时确定了24个具有商业化潜力的现有专利技术,通过LDA主题模型以及人工分析进行主题概括,可以发现该情景下远程医疗领域潜在的R&BD机会主要集中在4个方面;确定了7个商标空白点与3个专利空白点对应,该情景下归纳出3个潜在R&BD机会;确定了11个商标空白点与现有专利和专利空白点均无对应,该情景在现阶段无数据支撑,无法做具体研究。  相似文献   

12.
Addressed here is the issue of ‘topic analysis’ which is used to determine a text’s topic structure, a representation indicating what topics are included in a text and how those topics change within the text. Topic analysis consists of two main tasks: topic identification and text segmentation. While topic analysis would be extremely useful in a variety of text processing applications, no previous study has so far sufficiently addressed it. A statistical learning approach to the issue is proposed in this paper. More specifically, topics here are represented by means of word clusters, and a finite mixture model, referred to as a stochastic topic model (STM), is employed to represent a word distribution within a text. In topic analysis, a given text is segmented by detecting significant differences between STMs, and topics are identified by means of estimation of STMs. Experimental results indicate that the proposed method significantly outperforms methods that combine existing techniques.  相似文献   

13.
A methodology for automatically identifying and clustering semantic features or topics in a heterogeneous text collection is presented. Textual data is encoded using a low rank nonnegative matrix factorization algorithm to retain natural data nonnegativity, thereby eliminating the need to use subtractive basis vector and encoding calculations present in other techniques such as principal component analysis for semantic feature abstraction. Existing techniques for nonnegative matrix factorization are reviewed and a new hybrid technique for nonnegative matrix factorization is proposed. Performance evaluations of the proposed method are conducted on a few benchmark text collections used in standard topic detection studies.  相似文献   

14.
刘鑫  余翔 《科研管理》2016,37(11):150-158
本文在梳理概括了国内外专利文本挖掘技术研究进展基础上,探索建立一种基于对专利文本中特定动宾(AO)结构进行挖掘分析的专利功能分析方法,并通过专利功能的定义、提取和分析将专利技术与相关产业进行对接,实现了从专利文本中识别产业化的潜在领域。更为重要的是,本文提出了描述专利技术功能效用的S曲线和S指数,完善和改进了专利技术产业化适用性量化评价模型,并定义了该模型中的S指数、专利功能的绝对重要性指数(AI)和专利功能的相对重要性指数(RI)三个评价指标。最后,以具备"reduce PM2.5"功能的专利为例,验证了基于功能分析的专利技术产业化适用性评价模型的可行性,为中国专利技术产业化路径选择提供了新思路。  相似文献   

15.
The paper proposes a new approach to create a patent classification system to replace the IPC or UPC system for conducting patent analysis and management. The new approach is based on co-citation analysis of bibliometrics. The traditional approach for management of patents, which is based on either the IPC or UPC, is too general to meet the needs of specific industries. In addition, some patents are placed in incorrect categories, making it difficult for enterprises to carry out R&D planning, technology positioning, patent strategy-making and technology forecasting. Therefore, it is essential to develop a patent classification system that is adaptive to the characteristics of a specific industry. The analysis of this approach is divided into three phases. Phase I selects appropriate databases to conduct patent searches according to the subject and objective of this study and then select basic patents. Phase II uses the co-cited frequency of the basic patent pairs to assess their similarity. Phase III uses factor analysis to establish a classification system and assess the efficiency of the proposed approach. The main contribution of this approach is to develop a patent classification system based on patent similarities to assist patent manager in understanding the basic patents for a specific industry, the relationships among categories of technologies and the evolution of a technology category.  相似文献   

16.
Text categorization is an important research area and has been receiving much attention due to the growth of the on-line information and of Internet. Automated text categorization is generally cast as a multi-class classification problem. Much of previous work focused on binary document classification problems. Support vector machines (SVMs) excel in binary classification, but the elegant theory behind large-margin hyperplane cannot be easily extended to multi-class text classification. In addition, the training time and scaling are also important concerns. On the other hand, other techniques naturally extensible to handle multi-class classification are generally not as accurate as SVM. This paper presents a simple and efficient solution to multi-class text categorization. Classification problems are first formulated as optimization via discriminant analysis. Text categorization is then cast as the problem of finding coordinate transformations that reflects the inherent similarity from the data. While most of the previous approaches decompose a multi-class classification problem into multiple independent binary classification tasks, the proposed approach enables direct multi-class classification. By using generalized singular value decomposition (GSVD), a coordinate transformation that reflects the inherent class structure indicated by the generalized singular values is identified. Extensive experiments demonstrate the efficiency and effectiveness of the proposed approach.  相似文献   

17.
专利信息分析可以分为外部信息(著录项信息)分析与内部信息(专利技术文本)分析。基于数据挖掘理论,运用IPC分类、专利名称、摘要字段、设计人和专利权人等专利著录信息的五维属性,设计一种新的专利信息识别分类方法,并以船舶工业技术进行实证测试,取得了较好的分析效果。  相似文献   

18.
Through the recent NTCIR workshops, patent retrieval casts many challenging issues to information retrieval community. Unlike newspaper articles, patent documents are very long and well structured. These characteristics raise the necessity to reassess existing retrieval techniques that have been mainly developed for structure-less and short documents such as newspapers. This study investigates cluster-based retrieval in the context of invalidity search task of patent retrieval. Cluster-based retrieval assumes that clusters would provide additional evidence to match user’s information need. Thus far, cluster-based retrieval approaches have relied on automatically-created clusters. Fortunately, all patents have manually-assigned cluster information, international patent classification codes. International patent classification is a standard taxonomy for classifying patents, and has currently about 69,000 nodes which are organized into a five-level hierarchical system. Thus, patent documents could provide the best test bed to develop and evaluate cluster-based retrieval techniques. Experiments using the NTCIR-4 patent collection showed that the cluster-based language model could be helpful to improving the cluster-less baseline language model.  相似文献   

19.
周丽霞  杨志和 《现代情报》2016,36(3):115-120
文章对专利研究文献进行统计分析, 总结出我国专利研究的3个发展阶段:1970年以前的准备阶段, 1971-1999年的缓慢增长阶段和2000年以后的绝对增长阶段, 并对每一阶段我国专利法制建设进展进行了深入探讨;分析专利研究的行业分布, 包括科学研究管理在内的十大领域, 研究主题集中在专利分析、专利战略、专利制度等方面;专利研究的学科以管理学、法学、经济学和图书情报学为代表, 专利研究文献的来源期刊以知识产权类、科学管理类和图书情报类居多;而博硕士论文选题相对具体, 包括专利侵权、专利保护等。  相似文献   

20.
The vector space model (VSM) is a textual representation method that is widely used in documents classification. However, it remains to be a space-challenging problem. One attempt to alleviate the space problem is by using dimensionality reduction techniques, however, such techniques have deficiencies such as losing some important information. In this paper, we propose a novel text classification method that neither uses VSM nor dimensionality reduction techniques. The proposed method is a space efficient method that utilizes the first order Markov model for hierarchical Arabic text classification. For each category and sub-category, a Markov chain model is prepared based on the neighboring characters sequences. The prepared models are then used for scoring documents for classification purposes. For evaluation, we used a hierarchical Arabic text data collection that contains 11,191 documents that belong to eight topics distributed into 3-levels. The experimental results show that the Markov chains based method significantly outperforms the baseline system that employs the latent semantic indexing (LSI) method. That is, the proposed method enhances the F1-measure by 3.47%. The novelty of this work lies on the idea of decomposing words into sequences of characters, which found to be a promising approach in terms of space and accuracy. Based on our best knowledge, this is the first attempt to conduct research for hierarchical Arabic text classification with such relatively large data collection.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号