首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
本文通过大量的文献调研,对近5年来国内外知识发现研究的进展进行了系统的梳理,从知识发现研究过程中涉及的关联挖掘算法、中间词/目标词排序方法、过滤方法几个角度进行了方法总结,最后提出了未来知识发现的改进和发展方向。  相似文献   

2.
文章提出年度作者关键词事务序列这一概念,并将其作为新的研究内容.阐述了基于年度作者关键词事务序列集进行序列模式挖掘的意义;给出挖掘的主要步骤,其中包括将作者关键词泛化为能够代表某个研究方向的上位词的预处理过程;以中国情报学领域为例,利用CSSCI收录的1998-2011年度的来源文献数据进行实证研究,挖掘学术个体切换研究方向的序列模式;从情报学学科角度对挖掘结果进行解释,并对提出的方法进行总结.  相似文献   

3.
基于文献内容和链接融合的知识结构划分方法研究进展   总被引:1,自引:0,他引:1  
知识结构划分是确定研究主题、发现新兴研究领域和识别重点研究内容的一个重要方法。本文把从文献内容特征和文献链接信息融合的新知识结构划分方法的研究进行了梳理,从数据库扩展的原始级信息融合、文本挖掘和文献计量方法结合、词汇引用图和词参考文献共现4个层面对当前的知识结构融合方法进行了综述,分析了它们的主要研究内容和方法,并对目前知识结构划分的聚类方法进行了综合研究,以期从更加新的视角对知识结构划分研究提供新的研究思路。  相似文献   

4.
针对英文科技文献的特征,提出一种规则和统计相结合的关键内容识别方法。该方法首先通过对源文档进行特征标识,将其转换成更易于处理的中间文档;然后利用特征还原、线索词匹配、主题识别和临近分析等,从中间文档抽取代表文本的主要信息,生成目标文档。该方法能够有效地辅助科研人员阅读大量的英文科技文献,提高阅读效率。  相似文献   

5.
【目的/意义】大数据时代文本主题挖掘在情报分析领域中的作用日趋重要,通过特征比较共词分析和 LDA模型分析两种主流文本主题挖掘方法,研究两者的具体特点,为相关人员合理地运用文本主题挖掘方法处理 数据提供一定的参考。【方法/过程】本文分两种情况对比研究:第一、两者挖掘不同时段同一种类文本数据的主题 分布信息和主题演化信息的能力;第二、两者挖掘同一时段不同种类文本数据的提取正确主题的能力。【结果/结 论】在不同时段LDA模型分析与共词分析相比挖掘主题分布信息的能力可不断提升,并且其可挖掘出更加细化的 主题演化信息;在同一时段LDA模型分析对语义关系模糊逻辑结构粗糙的文本提取正确主题的效果明显优于共词 分析。  相似文献   

6.
王佳敏  吴乐艳  李鹏程  熊资  陆伟  杜佳 《情报科学》2021,39(11):173-179
【目的/意义】本文构建了一个大规模学术文献致谢功能数据集,并提出一种基于SciBERT的致谢功能识别 模型,为致谢文本的挖掘和分析提供高质量的数据支持和有效的识别方法。【方法/过程】采用人工的方式扩展和完 善致谢功能分类规则,生成学术文献致谢功能自动标引规则模板,对1,750,275条致谢文本进行功能标引。在此基 础上,采用 SciBERT 模型对致谢文本句进行向量表达,引入 Softmax 回归模型实现致谢功能自动分类,采用 warmup策略进行模型调优,并与基准实验进行对比。【结果/结论】得到一个大规模、高质量的学术文献致谢功能数 据集,经人工检验准确率达到93%;基于SciBERT的识别模型比基准模型表现更好,在扩展数据集上的F1值高于 98%,在各个类别上的预测结果也有不同程度的提升。【创新/局限】致谢功能识别模型缺少对致谢文本独有特征的 考虑和融合。  相似文献   

7.
为了实现个性化的主动信息服务,网络信息挖掘(Web Mining)技术成为近年来一个新的研究课题。挖掘通常涉及输入文本的处理过程,中文分词是中文信息处理的基础,汉语文本基于单字,汉语的书面表达方式也是以汉字作为最小单位,词与词之间没有显性的分界标志,中文分词的准确与否,常常直接影响到对搜索结果的相关度排序,因此分词成为汉语文本分析处理中首要解决的问题。就中文分词技术进行讨论,并以2-gram模型为例,研究用JA-VA实现中文分词的过程。  相似文献   

8.
基于多因素方差分析的文本向量特征挖掘算法   总被引:2,自引:0,他引:2  
文本向量特征挖掘应用于信息资源组织和管理领域,在大数据挖掘领域具有较大应用价值,传统算法精度不好。提出一种基于多因素方差分析的文本向量特征挖掘算法。使用多因素方差分析方法得到多种语料库的特征挖掘规律,结合蚁群算法,根据蚁群适应度概率正则训练迁移法则,得到种群进化最近时刻获得的数据集有效特征概率最大值,基于最优划分的K-means初始聚类中心选取算法,先对数据样本进行划分,然后根据样本分布特点来确定初始聚类中心,提高文本特征挖掘性能。仿真结果表明,该算法提高了文本向量特征的聚类效果,进而提高了特征挖掘性能,具有较高的数据特征召回率和检测率,时间耗时较少,在数据挖掘等领域应用价值较大。  相似文献   

9.
将文本挖掘和共词分析方法相结合应用到专利文献的研究中去,以期通过对专利的内容分析更深层次地了解不同技术主题的研究现状及发展趋势.以射频识别(Radio Frequency Identification,RFID)技术领域为研究对象,对此领域专利的摘要进行文本挖掘,从中提取能够反映此技术领域特征的关键词,根据关键词之间的共现关系,对其进行聚类分析,得到目前RFID领域的六个技术主题,并借助战略坐标图对这六个技术主题进行分析,探寻每个技术主题的发展趋势,为企业技术创新活动和产业发展战略的制定提供决策参考.  相似文献   

10.
[目的/意义]基于文本挖掘技术自动发现更具代表性的文献内容主题词,通过定位主题词在章节中的具体位置,并基于可视化技术进行主题标引,帮助读者直观高效发现文献主题间的潜在关系。[方法/过程]基于文本挖掘技术深入文献内容层挖掘主题词,并利用可视化工具直观呈现所获信息,在此基础上尝试构建可视化主题自动标引系统,并在格萨尔领域的多个主题中对该系统的自动标引效果进行验证。[结果/结论]研究结果显示,该标引方法在格萨尔领域实现了文献内容级的可视化主题自动标引,快速精准地定位到章节、段落和句子。标引相关信息获取过程直观可视,并且具有交互性,可提升用户体验和参与度。文章以《英雄格萨尔》为例完成系统验证,但该标引方法技术本身无领域限定,可应用于其他领域的文献。  相似文献   

11.
Latent Semantic Indexing (LSI) uses the singular value decomposition to reduce noisy dimensions and improve the performance of text retrieval systems. Preliminary results have shown modest improvements in retrieval accuracy and recall, but these have mainly explored small collections. In this paper we investigate text retrieval on a larger document collection (TREC) and focus on distribution of word norm (magnitude). Our results indicate the inadequacy of word representations in LSI space on large collections. We emphasize the query expansion interpretation of LSI and propose an LSI term normalization that achieves better performance on larger collections.  相似文献   

12.
As text documents are explosively increasing in the Internet, the process of hierarchical document clustering has been proven to be useful for grouping similar documents for versatile applications. However, most document clustering methods still suffer from challenges in dealing with the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels. In this paper, we will present an effective Fuzzy Frequent Itemset-Based Hierarchical Clustering (F2IHC) approach, which uses fuzzy association rule mining algorithm to improve the clustering accuracy of Frequent Itemset-Based Hierarchical Clustering (FIHC) method. In our approach, the key terms will be extracted from the document set, and each document is pre-processed into the designated representation for the following mining process. Then, a fuzzy association rule mining algorithm for text is employed to discover a set of highly-related fuzzy frequent itemsets, which contain key terms to be regarded as the labels of the candidate clusters. Finally, these documents will be clustered into a hierarchical cluster tree by referring to these candidate clusters. We have conducted experiments to evaluate the performance based on Classic4, Hitech, Re0, Reuters, and Wap datasets. The experimental results show that our approach not only absolutely retains the merits of FIHC, but also improves the accuracy quality of FIHC.  相似文献   

13.
鉴于在目前的技术机会识别中存在研判的创新路径往往较为抽象和模糊,并在很多情况下需领域专家参与解读的问题,以冷库技术为例,研究构建基于文本挖掘、机器学习算法及多维空间专利地图的技术创新路径识别模型。首先,构建技术创新路径识别框架对相关专利文献进行分词、清洗等预处理并建立知识图谱;其次,采用融合词频-逆文档频率(TF-IDF)文本挖掘方法对专利文档提取关键词,继而采用隐含狄利克雷分布(LDA)算法对主题聚类降维并萃取创新维度;再次,依据目标技术问题和目标优选创新法则耦合变换于多维空间专利地图并具象出具有现实意义、有价值前景的创新路径;最后,利用可拓学计算各创新路径综合关联度评级优选。以期减少创新成本、提高创新效率,为企业精准开展技术创新、不断提升核心竞争力提供决策参考。  相似文献   

14.
Automatic text classification is the task of organizing documents into pre-determined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the words in other words terms are cut from their finer context i.e. their location in a sentence or in a document. Only the broader context of document is used with some type of term frequency information in the vector space. Consequently, semantics of words that can be inferred from the finer context of its location in a sentence and its relations with neighboring words are usually ignored. However, meaning of words, semantic connections between words, documents and even classes are obviously important since methods that capture semantics generally reach better classification performances. Several surveys have been published to analyze diverse approaches for the traditional text classification methods. Most of these surveys cover application of different semantic term relatedness methods in text classification up to a certain degree. However, they do not specifically target semantic text classification algorithms and their advantages over the traditional text classification. In order to fill this gap, we undertake a comprehensive discussion of semantic text classification vs. traditional text classification. This survey explores the past and recent advancements in semantic text classification and attempts to organize existing approaches under five fundamental categories; domain knowledge-based approaches, corpus-based approaches, deep learning based approaches, word/character sequence enhanced approaches and linguistic enriched approaches. Furthermore, this survey highlights the advantages of semantic text classification algorithms over the traditional text classification algorithms.  相似文献   

15.
Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed-memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text and posting lists caching. We show that a text self-index (i.e., an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space.  相似文献   

16.
低利用率文献递交储存库最佳时间研究   总被引:2,自引:0,他引:2  
本文在研究低利用率文献利用现状的基础上,从馆藏文献利用的实证角度分析了不同学科低利用率文献在馆藏中的分布状态,以及在分析馆藏更新率的基础上提出了低利用率文献送交储存库保存的最优模型。  相似文献   

17.
闫盛枫 《情报科学》2021,39(9):146-154
【目的/意义】探测特定领域政策文本语义主题,揭示我国政策部署领域与未来发展趋势。【方法/过程】提出 一种融合词向量语义增强和DTM模型的公共政策文本时序建模与可视化方法,采用DTM模型实现政策文本的时 序切割和主题建模,利用深度学习Word2vec算法中Skip-gram词嵌入技术可以对上下文词汇进行有效预测,增强 其语义表达性和政策解释性,以更为准确地揭示我国公共政策的部署重点。【结果/结论】实验表明本文提出的方法 对于公共政策主题识别和政策文本量化具有更好的知识抽取和语义表达能力,对我国公共政策挖掘和信息揭示具 有良好的揭示。【创新/局限】提出融合词向量语义增强和DTM模型的公共政策文本时序建模方法,一定程度上提 升了政策文本的主题语义表达,未来考虑利用深度学习技术如LSTM算法、BERT模型等识别政策中的领域知识单 元和语法结构。  相似文献   

18.
针对图书、期刊论文等数字文献文本特征较少而导致特征向量语义表达不够准确、分类效果差的问题,本文提出一种基于特征语义扩展的数字文献分类方法。该方法首先利用TF-IDF方法获取对数字文献文本表示能力较强、具有较高TF-IDF值的核心特征词;其次分别借助知网(Hownet)语义词典以及开放知识库维基百科(Wikipedia)对核心特征词集进行语义概念的扩展,以构建维度较低、语义丰富的概念向量空间;最后采用MaxEnt、SVM等多种算法构造分类器实现对数字文献的自动分类。实验结果表明:相比传统基于特征选择的短文本分类方法,该方法能有效地实现对短文本特征的语义扩展,提高数字文献分类的分类性能。  相似文献   

19.
[目的/意义]准确把握公众微博评论中所反映的公众观点并总结舆论焦点,有助于及时获取和引导社会舆情态势,对政府公信力、快速响应能力及执行力提升具有支撑作用。[方法/过程]文章针对当前政府微博评论社会功能发挥的现实要求和其文本特征挖掘的技术需求,从基于深度学习的文本智能语义理解和挖掘出发,提出了适用的细粒度四元组标注策略,构建了政府微博评论观点抽取与焦点呈现的深度学习模型POF-BiLSTM-CRF,即通过细粒度标注策略确定、Word2vec训练词向量、BiLSTM评论特征学习进行标签及其概率输出、CRF学习上下文实现微博评论标注优化,以及观点聚类和主题词提取后最终呈现舆论焦点。[结果/结论]针对"中国警方在线"微博评论的实验表明,文章所提研究框架和模型能够有效进行舆论观点的智能化提取,为快速把握公众观点及为政府决策提供了参考。  相似文献   

20.
Social emotion refers to the emotion evoked to the reader by a textual document. In contrast to the emotion cause extraction task which analyzes the cause of the author's sentiments based on the expressions in text, identifying the causes of social emotion evoked to the reader from text has not been explored previously. Social emotion mining and its cause analysis is not only an important research topic in Web-based social media analytics and text mining but also has a number of applications in multiple domains. As the focus of social emotion cause identification is on analyzing the causes of the reader's emotions elicited by a text that are not explicitly or implicitly expressed, it is a challenging task fundamentally different from the previous research. To tackle this, it also needs a deeper level understanding of the cognitive process underlying the inference of social emotion and its cause analysis. In this paper, we propose the new task of social emotion cause identification (SECI). Inspired by the cognitive structure of emotions (OCC) theory, we present a Cognitive Emotion model Enhanced Sequential (CogEES) method for SECI. Specifically, based on the implications of the OCC model, our method first establishes the correspondence between words/phrases in text and emotional dimensions identified in OCC and builds the emotional dimension lexicons with 1,676 distinct words/phrases. Then, our method utilizes lexicons information and discourse coherence for the semantic segmentation of document and the enhancement of clause representation learning. Finally, our method combines text segmentation and clause representation into a sequential model for cause clause prediction. We construct the SECI dataset for this new task and conduct experiments to evaluate CogEES. Our method outperforms the baselines and achieves over 10% F1 improvement on average, with better interpretability of the prediction results.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号