首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 234 毫秒
1.
针对向量空间模型中语义缺失问题,将语义词典(知网)应用到文本分类的过程中以提高文本分类的准确度。对于中文文本中的一词多义现象,提出改进的词汇语义相似度计算方法,通过词义排歧选取义项进行词语的相似度计算,将相似度大于阈值的词语进行聚类,对文本特征向量进行降维,给出基于语义的文本分类算法,并对该算法进行实验分析。结果表明,该算法可有效提高中文文本分类效果。  相似文献   

2.
基于SVM的多类文本分类研究   总被引:9,自引:0,他引:9  
基于统计学习理论.构建了SVM文本分类模型,并给出了模型参数的100自动选择算法,解决了以往参数靠经验确定的弊端。传统的文本分类系统不能处理一篇文档同属多类别的情形,论文将该情形归结为多类文本分类问题,提出二叉决策树SVM模型,并就农业机械化工程文档进行了实证分析。结果表明,该算法具有较好的分类效果。  相似文献   

3.
本文依据反馈学习的思想和支持向量机分类算法,在分析中文文本分类过程的基础上,给出了基于反馈学习的中文文本分类模型,通过实验研究了反馈学习对中文文本分类模型性能的影响.结果表明,反馈学习对分类性能的提高有明显作用,它是对实时变化信息的有效解决方法.  相似文献   

4.
基于SVM与KNN的中文文本分类比较实证研究   总被引:1,自引:0,他引:1  
本文详细介绍了中文文本分类过程以及SVM和KNN两种方法在中文文本分类中的具体步骤,给出了中文文本分类的模型。通过实验对SVM算法和传统的KNN算法应用于文本分类效果进行了比较性实证研究。研究表明,SVM分类器较KNN在处理中文文本分类问题上有更良好的分类效果,有较高的查全率和查准率。  相似文献   

5.
赖娟 《科技通报》2012,28(2):152-154
研究了中文词自动分类问题。针对传统的蚁群算法中文词语分类精确度低等问题,提出了一种将蚁群算法应用到了中文词语自动分类中。方法建立在首先对大规模语料文本进行统计和计算的基础上,得到词的一元和二元信息,然后采用了蚁群算法对该信息进行词的分类。实验结果表明,提出的算法有效提高了词语分类的精确度。  相似文献   

6.
陈旭毅 《情报科学》2007,25(10):1530-1533
自动文本分类方法是文本分类中非常重要的一种分类方法,本文着重从模型与方法的角度进行探讨。首先给出了一个自动文本分类的形式化定义,然后提出了自动文本分类的流程模型。接着,对流程中的四个部分进行具体讨论。自动文本分类的应用非常广泛,为了叙述方便,以商务数据为例进行讨论,并且选择实例作为典型案例对自动文本分类后的可视化进行分析和具体研究。  相似文献   

7.
一种基于TFIDF方法的中文关键词抽取算法   总被引:4,自引:1,他引:3  
本文在海量智能分词基础之上,提出了一种基于向量空间模型和TFIDF方法的中文关键词抽取算法.该算法在对文本进行自动分词后,用TFIDF方法对文献空间中的每个词进行权重计算,然后根据计算结果抽取出科技文献的关键词.通过自编软件进行的实验测试表明该算法对中文科技文献的关键词自动抽取成效显著.  相似文献   

8.
[目的/意义]科技论文是学术界传递和交流知识的重要方式。科技论文评审是对科技论文承载的知识的价值衡量,高效准确的科技论文评审分类预测可以快速判断论文价值,加速有价值的知识传播进程。[方法/过程]本文讨论开放同行评审中自动评审分类方法,利用科技论文语义信息和开放同行评审中的专家评分,分别构建基于传统机器学习和基于深度学习的科技论文文本表示及分类模型,提供自动评审分类结果。[结果/结论]实验结果表明,融合语义信息和评分信息的评审分类模型比单纯依靠评分均值进行评审判断更为有效,以评分+均值为评分信息输入、基于SCIBERT的质量评审分类模型准确率最高,达到90.17%。本文提出的自动评审分类方法具有可用性,准确率较高,可以辅助期刊编辑快速筛选有潜力的科技论文,促进科技论文智能评审的发展。  相似文献   

9.
根据软件工程的基本原理在Ubuntu操作系统环境下使用Eclipse开发工具,设计并实现了基于Hadoop系统架构的NaiveBayes算法文本分类系统。系统将大量中文文本数据集存储在分布式文件系统HDFS上,通过MapReduce并行计算模型和Ansj中文分词库对中文数据集进行分词,采用TF-IDF算法进行文本特征抽取,最后基于Spark并行计算框架和NaiveBayes算法对特征数据集进行模型训练,得到文本分类模型,将文本分类服务集成到Web页面。系统基本实现了文本的正确分类。  相似文献   

10.
黄莉  李湘东 《情报杂志》2012,31(7):177-181,176
KNN最邻近算法是文本自动分类中最基本且常用的算法,该算法中需要计算文本之间的相似度.以Jensen-Shannon散度为例,在推导和说明其基本原理的基础之上,将其用于计算文本之间的相似度;作为对比,也使用常规的余弦值方法计算文本之间的相似度,并进而使用KNN最邻近算法对文本进行分类,以探讨不同的相似度计算方法对使用KNN最邻近算法进行文本自动分类效果的影响.多种试验材料的实证研究说明,较之于余弦值方法,基于Jensen-Shannon散度计算文本相似度的自动分类会使分类正确率更高,但会花费更长的时间.  相似文献   

11.
【目的/意义】深度学习是近几年来人工智能领域的研究热点之一,了解深度学习在信息组织与检索方面的研究现状,能为信息组织与检索的深入研究提供参考和借鉴。【方法/内容】通过对国内基于深度学习的信息组织与检索方向的相关文献进行梳理,剖析深度学习相关模型、阐述深度学习在信息组织与检索中的研究热点主题,并结合深度学习技术的特点和信息组织与检索的研究内容,对深度学习在信息组织与检索方向的应用前景进行预测。【结果/结论】研究表明,当前深度学习在信息组织与检索中的研究热点主要集中在智能信息抽取、自动文本分类、情感分析和文本聚类这四个主题,预测未来深度学习在信息组织与检索方向会朝着对异构信息处理、智能信息检索、个性化信息推荐等方向发展。  相似文献   

12.
Automated legal text classification is a prominent research topic in the legal field. It lays the foundation for building an intelligent legal system. Current literature focuses on international legal texts, such as Chinese cases, European cases, and Australian cases. Little attention is paid to text classification for U.S. legal texts. Deep learning has been applied to improving text classification performance. Its effectiveness needs further exploration in domains such as the legal field. This paper investigates legal text classification with a large collection of labeled U.S. case documents through comparing the effectiveness of different text classification techniques. We propose a machine learning algorithm using domain concepts as features and random forests as the classifier. Our experiment results on 30,000 full U.S. case documents in 50 categories demonstrated that our approach significantly outperforms a deep learning system built on multiple pre-trained word embeddings and deep neural networks. In addition, applying only the top 400 domain concepts as features for building the random forests could achieve the best performance. This study provides a reference to select machine learning techniques for building high-performance text classification systems in the legal domain or other fields.  相似文献   

13.
Automatic text classification is the task of organizing documents into pre-determined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the words in other words terms are cut from their finer context i.e. their location in a sentence or in a document. Only the broader context of document is used with some type of term frequency information in the vector space. Consequently, semantics of words that can be inferred from the finer context of its location in a sentence and its relations with neighboring words are usually ignored. However, meaning of words, semantic connections between words, documents and even classes are obviously important since methods that capture semantics generally reach better classification performances. Several surveys have been published to analyze diverse approaches for the traditional text classification methods. Most of these surveys cover application of different semantic term relatedness methods in text classification up to a certain degree. However, they do not specifically target semantic text classification algorithms and their advantages over the traditional text classification. In order to fill this gap, we undertake a comprehensive discussion of semantic text classification vs. traditional text classification. This survey explores the past and recent advancements in semantic text classification and attempts to organize existing approaches under five fundamental categories; domain knowledge-based approaches, corpus-based approaches, deep learning based approaches, word/character sequence enhanced approaches and linguistic enriched approaches. Furthermore, this survey highlights the advantages of semantic text classification algorithms over the traditional text classification algorithms.  相似文献   

14.
Section identification is an important task for library science, especially knowledge management. Identifying the sections of a paper would help filter noise in entity and relation extraction. In this research, we studied the paper section identification problem in the context of Chinese medical literature analysis, where the subjects, methods, and results are more valuable from a physician's perspective. Based on previous studies on English literature section identification, we experiment with the effective features to use with classic machine learning algorithms to tackle the problem. It is found that Conditional Random Fields, which consider sentence interdependency, is more effective in combining different feature sets, such as bag-of-words, part-of-speech, and headings, for Chinese literature section identification. Moreover, we find that classic machine learning algorithms are more effective than generic deep learning models for this problem. Based on these observations, we design a novel deep learning model, the Structural Bidirectional Long Short-Term Memory (SLSTM) model, which models word and sentence interdependency together with the contextual information. Experiments on a human-curated asthma literature dataset show that our approach outperforms the traditional machine learning methods and other deep learning methods and achieves close to 90% precision and recall in the task. The model shows good potential for use in other text mining tasks. The research has significant methodological and practical implications.  相似文献   

15.
借助文本分类系统软件,采用来自10个大类的中文文本数据,按照训练集与测试集2:1的比例,使用KNN和SVM分类算法,对数据集进行自动分类的实验。旨在通过具体的语料库实验,探讨文本自动分类的关键技术,分析、比较与评价实验结果,探讨文本分类中具体参数的设置和不同分类算法之优劣。  相似文献   

16.
一种基于词上下文向量的文本自动分类方法   总被引:1,自引:0,他引:1  
分析了传统文本自动分类方法的不足、词上下文向量的含义及其在自动分类中的作用,提出了一种基于词上下文向量的文本自动分类方法,该方法利用词上下文向量来生成分类器的类别中心向量和待分类文本的文本向量,使分类质量有所提高。  相似文献   

17.
With the advent of Web 2.0, there exist many online platforms that results in massive textual data production such as social networks, online blogs, magazines etc. This textual data carries information that can be used for betterment of humanity. Hence, there is a dire need to extract potential information out of it. This study aims to present an overview of approaches that can be applied to extract and later present these valuable information nuggets residing within text in brief, clear and concise way. In this regard, two major tasks of automatic keyword extraction and text summarization are being reviewed. To compile the literature, scientific articles were collected using major digital computing research repositories. In the light of acquired literature, survey study covers early approaches up to all the way till recent advancements using machine learning solutions. Survey findings conclude that annotated benchmark datasets for various textual data-generators such as twitter and social forms are not available. This scarcity of dataset has resulted into relatively less progress in many domains. Also, applications of deep learning techniques for the task of automatic keyword extraction are relatively unaddressed. Hence, impact of various deep architectures stands as an open research direction. For text summarization task, deep learning techniques are applied after advent of word vectors, and are currently governing state-of-the-art for abstractive summarization. Currently, one of the major challenges in these tasks is semantic aware evaluation of generated results.  相似文献   

18.
【目的/意义】学术论文的结构功能是学术论文篇章结构和语义内容的集中体现,目前针对学术论文结构功 能的研究主要集中在对学术论文不同层次的识别以及从学科差异性视角探讨模型算法的适用性两方面,缺少模 型、学科、层次之间内在联系的比较研究。【方法/过程】选择中医学、图书情报、计算机、环境科学、植物学等学科中 文权威刊物发表的学术论文作为实验语料集,在引入CNN、LSTM、BERT等深度学习模型的基础上,分别从句子、 段落、章节内容等层次对学术论文进行结构功能识别。【结果/结论】实验结果表明,BERT模型对于不同学科学术论 文以及学术论文的不同层次的结构功能识别效果最优,各个模型对于不同学科学术论文篇章内容层次的识别效果 均最优,中医学较之其他学科的学术论文结构功能识别效果最优。此外,利用混淆矩阵给出不同学科学术论文结 构功能误识的具体情形并分析了误识原因。【创新/局限】本文研究为学术论文结构功能识别研究提供了第一手的 实证资料。  相似文献   

19.
[目的/意义]实体语义关系分类是信息抽取重要任务之一,将非结构化文本转化成结构化知识,是构建领域本体、知识图谱、开发问答系统、信息检索系统的基础工作。[方法/过程]本文详细梳理了实体语义关系分类的发展历程,从技术方法、应用领域两方面回顾和总结了近5年国内外的最新研究成果,并指出了研究的不足及未来的研究方向。[结果/结论]热门的深度学习方法抛弃了传统浅层机器学习方法繁琐的特征工程,自动学习文本特征,实验发现,在神经网络模型中融入词法、句法特征、引入注意力机制能有效提升关系分类性能。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号