首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
采用向量空间模型(VSM)描述文本,利用隐性语义索引(LSI)R术进行特征重构与降维,构造了BP神经网络文本分类器。将朴素贝叶斯分类技术与前者结合构造了一种混合文本分类器。实验结果表明混合分类器分类准确度和分类速度得到提高。  相似文献   

2.
《科技风》2020,(14)
随着网络信息时代的到来和新闻数据的不断增加,人们需要对新闻进行分类的难度也不断加大。那么,是否有一种有效的分类新闻信息的方法将新闻进行分类呢?而在文本分类中,有较好的文本分类的算法是朴素贝叶斯算法。本研究以通过网络爬虫的方式爬取某新闻网站的少量新闻数据数据,然后对数据进行简单的数据预处理、中文文本分词等,构建朴素贝叶斯分类器,进而实现对新闻分类的目的。  相似文献   

3.
本文在介绍和分析贝叶斯理论的基础上,提出了贝叶斯算法和朴素贝叶斯分类器.并阐述了贝叶斯算法及朴素贝叶斯分类器在反垃圾邮件中的应用.  相似文献   

4.
[研究目的]为提高人工分类效率,降低因分类人员主观知识结构和客观环境因素影响导致的分类错误率,本研究构建了基于层次分类器的专利文本分类模型。[研究方法]随机抽取A、D、E、H4个部中的4000条中文发明专利,以其名称和摘要数据为实验对象,通过文本预处理及文本特征表示后,基于KNN、支持向量机、Rocchio和朴素贝叶斯4种机器学习模型,分别探索IPC部、大类、小类和大组层次上的最佳分类模型及其组合。[研究结论]实验结果显示,层次结构可有效改善平面分类模型的性能,层次组合模型比层次单一模型拥有更高的分类准确率,各层次的最优分类模型分别是:支持向量机(部)、Rocchio+支持向量机(大类)、Rocchio+朴素贝叶斯+支持向量机(小类)、KNN+朴素贝叶斯+支持向量机+支持向量机(大组)。  相似文献   

5.
贝叶斯分类器可以归结为求词条的先验概率,目前分类器中普遍使用词条的文档出现次数和词频来计算先验概率.本文提出了一种基于权重的朴素贝叶斯分类器,不仅改进了文本中词条的先验概率计算方式,并增加了词条的权重对计算的影响.该分类器使用TFIDF模型及其改进算法实现了分类器的设计.实验结果表明,该分类器的效果比传统算法有较大的改进.  相似文献   

6.
文章以豆瓣网站书籍评论为分析对象,采用中文情感词汇本体库进行情感要素的识别与加权,结合朴素贝叶斯算法实现了用户评论文本的情感自动分类,并探讨了该算法的分类效果,研究发现:朴素贝叶斯算法能够实现评论文本的情感分类,分类效果较好,但仍需结合规则匹配和人工校对的方式,提升分类效果。  相似文献   

7.
进入大数据时代,中文文本的数据量的显著增加,如何针对大数据量的文本数据进行有效分类是一个重要问题。传统的朴素贝叶斯算法在进行分类时,认为特征属性对分类决策的贡献是相同的,同时对于大数据集的处理也存在性能低下的缺点。针对如上问题,本文提出了一种基于TFIDFCF特征加权的并行化朴素贝叶斯文本分类算法,该算法通过Map Reduce并行框架实现。利用THUCNews新闻文本数据开展文本分类处理,实验结果表明,并行框架下的TFIDFCF特征加权的朴素贝叶斯算法在训练速度和预测精度上都有提高。  相似文献   

8.
基于贝叶斯网的分类器因其对不确定性问题有较强的处理能力,因此在CRM客户建模中有其独特的优势。在对朴素贝叶斯分类器通用贝叶斯分类器优缺点分析的基础上,引入增强型BN分类器和贝叶斯多网分类器,详细介绍了后者的算法,并将其应用到实际电信CRM客户建模中,取得较好的效果。  相似文献   

9.
及时准确地对舆情信息进行主题分类,不仅能实时了解舆情动态变化,还能为预判舆情发展趋势、舆论引导建立基础。本文提出一种基于本体和加权朴素贝叶斯的网络舆情主题分类方法,通过使用本体将领域知识和领域文本特征融入分类过程中。将该方法应用到动物卫生领域舆情主题分类中,分类结果精确度为0.9402,Marco_F1达到0.9339。通过与朴素贝叶斯(NB)和THUCTC两种方法的对比实验,证明本文提出的基于本体和加权朴素贝叶斯的分类方法有效且具有可行性,但是领域本体的概念、关系的完备程度会影响分类的效率。  相似文献   

10.
提出了一种基于机器学习的Web文本自动分类的架构,提出了中文Web文档自动分类的主要技术问题。介绍了中文Web文档自动分类工具的总体设计,它主要包括网络蜘蛛、中文分词、特征选取和贝叶斯分类器等功能模块。最后对中文Web文档自动分类器进行了实验。  相似文献   

11.
The feature selection, which can reduce the dimensionality of vector space without sacrificing the performance of the classifier, is widely used in text categorization. In this paper, we proposed a new feature selection algorithm, named CMFS, which comprehensively measures the significance of a term both in inter-category and intra-category. We evaluated CMFS on three benchmark document collections, 20-Newsgroups, Reuters-21578 and WebKB, using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVMs). The experimental results, comparing CMFS with six well-known feature selection algorithms, show that the proposed method CMFS is significantly superior to Information Gain (IG), Chi statistic (CHI), Document Frequency (DF), Orthogonal Centroid Feature Selection (OCFS) and DIA association factor (DIA) when Naïve Bayes classifier is used and significantly outperforms IG, DF, OCFS and DIA when Support Vector Machines are used.  相似文献   

12.
Practical classification problems often involve some kind of trade-off between the decisions a classifier may take. Indeed, it may be the case that decisions are not equally good or costly; therefore, it is important for the classifier to be able to predict the risk associated with each classification decision. Bayesian decision theory is a fundamental statistical approach to the problem of pattern classification. The objective is to quantify the trade-off between various classification decisions using probability and the costs that accompany such decisions. Within this framework, a loss function measures the rates of the costs and the risk in taking one decision over another.  相似文献   

13.
萧莉明  于宽  蔡珣 《现代情报》2007,27(4):146-147,150
本文设计了一个有效的基于贝叶斯分类器的中文期刊自动分类系统。首先,该系统以期刊的名称作为惟一的标引内容,并利用自动分词技术将期刊名称分成待分类的样本集;其次,通过对图书馆的样本数据进行训练建立的分类库,本文使用贝叶斯分类器实现中文期刊的自动分类。实验结果表明,该分类器对中文期刊的分类具有很好的高效性和准确性。  相似文献   

14.
徐波  孙李哲 《大众科技》2014,(11):18-20
为了能从通讯的角度识别犯罪团伙,文章以深圳杯夏令营数学建模的数据作为研究数据,利用贝叶斯网络分类器和粗糙集模型两种方法对识别出的结果进行验证。研究结果表示:通过贝叶斯网络分类器识别出的结果和粗糙集获得结果是一致的。从而可以给公安部分侦破犯罪团伙提供参考,提高犯罪案件的侦破率。  相似文献   

15.
This paper presents a classifier for text data samples consisting of main text and additional components, such as Web pages and technical papers. We focus on multiclass and single-labeled text classification problems and design the classifier based on a hybrid composed of probabilistic generative and discriminative approaches. Our formulation considers individual component generative models and constructs the classifier by combining these trained models based on the maximum entropy principle. We use naive Bayes models as the component generative models for the main text and additional components such as titles, links, and authors, so that we can apply our formulation to document and Web page classification problems. Our experimental results for four test collections confirmed that our hybrid approach effectively combined main text and additional components and thus improved classification performance.  相似文献   

16.
This paper examines the feasibility of discovering “title-like” terms using a decision tree classifier from the document. The premise of discovering title-like terms is that title terms and title-like terms should behave similarly in the document. This behavior is characterized by a set of distributional and linguistic features. By training the classifier to observe the behavior of title terms in a balanced manner using 25,000 titles in Reuters articles, other terms with similar behavior would also be discovered. Based on 5000 unseen titles, the recall of title terms was 83%, similar to the manual identification of title terms. The precision of finding title terms is low (i.e., 32%) because some non-title but title-like terms should have been identified as well. Seven subjects were asked to rate, on a scale of between 1 and 5, whether the identified term is a topical/thematic/title term. If a rating of 2.5 is used to determine whether a term is judged to be a “title-like” term, then the mean precision is increased to 58%, or the headline/title is expanded with twice the average number of terms. Since this precision (i.e., 58%) is similar to the mean precision of manually identified title terms averaged across different subjects, we conclude that the discovery of title-like terms using classifiers is a promising approach.  相似文献   

17.
Textual entailment is a task for which the application of supervised learning mechanisms has received considerable attention as driven by successive Recognizing Data Entailment data challenges. We developed a linguistic analysis framework in which a number of similarity/dissimilarity features are extracted for each entailment pair in a data set and various classifier methods are evaluated based on the instance data derived from the extracted features. The focus of the paper is to compare and contrast the performance of single and ensemble based learning algorithms for a number of data sets. We showed that there is some benefit to the use of ensemble approaches but, based on the extracted features, Naïve Bayes proved to be the strongest learning mechanism. Only one ensemble approach demonstrated a slight improvement over the technique of Naïve Bayes.  相似文献   

18.
组织创新影响机制中各变量具有不确定性与动态性等特点,可以尝试进行贝叶斯网络分析。根据贝叶斯网络的基本原理,构建了基于贝叶斯网络的组织创新影响机制模型,并对复杂贝叶斯网络计算问题的简化问题进行了探讨。实例应用表明,该方法克服了其他传统分析方法局限于线性、静态分析的缺点,较为准确地反映了组织创新影响机制各变量间的动态关系。  相似文献   

19.
This research has investigated the feasibility of using a distance measure, called the Bayesian distance, for automatic sequential document classification. It has been shown that by observing the variation of this distance measure as keywords are extracted sequentially from a document, the occurrence of noisy keywords may be detected. This property of the distance measure has been utilized to design a sequential classification algorithm which works in two phases. In the first phase keywords extracted from a document are partitioned into two groups—the good keyword group and the noisy keyword group. In the second phase these two groups of keywords are analyzed separately to assign primary and secondary classes to a document. The algorithm has been applied to several data bases of documents and very encouraging results have been obtained.  相似文献   

20.
在文本自动分类中,目前有词频和文档频率统计这两种概率估算方法,采用的估算方法恰当与否会直接影响特征抽取的质量与分类的准确度。本文采用K最近邻算法实现中文文本分类器,在中文平衡与非平衡两种训练语料下进行了训练与分类实验,实验数据表明使用非平衡语料语料时,可以采用基于词频的概率估算方法,使用平衡语料语料时,采用基于文档频率的概率估算方法,能够有效地提取高质量的文本特征,从而提高分类的准确度。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号