共查询到17条相似文献,搜索用时 100 毫秒
1.
文本分类中一种基于密度的KNN改进方法 总被引:2,自引:1,他引:1
特征降维与分类算法的性能是文本自动分类的两个主要问题.KNN算法以其简单、有效、非参数特点常用于文本分类,但是训练文本分布的不均匀对KNN的分类效果产生负面影响,而在实际应用中训练文本分布不均是常见现象.本文针对这种分类环境,首先提出了一种改进的tf-idf赋权方法用于特征降维,在此基础上进一步提出了一种基于密度的改进KNN方法用于文本分类, 使处于样本点分布较密集区域的样本点之间的距离增大.随后的文本分类试验表明,本文提出的方法基于密度的KNN方法具有较好的文本分类效果. 相似文献
2.
3.
4.
5.
6.
7.
文本分类是信息检索领域的重要应用之一,由于采用统一特征向量形式表示所有文档,导致针对每个文档的特征向量具有高维性和稀疏性,从而影响文档分类的性能和精度。为有效提升文本特征选择的准确度,本文首先提出基于信息增益的特征选择函数改进方法,提高特征选择的精度。KNN(K-Nearest Neighbor)算法是文本分类中广泛应用的算法,本文针对经典KNN计算量大、类别标定函数精度不高的问题,提出基于训练集裁剪的加权KNN算法。该算法通过对训练集进行裁剪提升了分类算法的计算效率,通过模糊集的隶属度函数提升分类算法的准确性。在公开数据上的实验结果及实验分析证明了算法的有效性。 相似文献
8.
基于潜在语义分析和改进的HS-SVM的文本分类模型研究 总被引:1,自引:0,他引:1
9.
本文依据KNN分类算法和反馈学习的思想,在分析中文文本分类过程的基础上,给出了基于反馈学习的中文文本分类模型和基于KNN的中文文本分类反馈学习过程。通过实验研究了反馈学习对中文文本分类模型性能的影响。结果表明,反馈学习是实时变化信息的一种有效的学习方法,它对训练不充分的文本分类器具有很大的改善作用。 相似文献
10.
基于Ontology的Web文本分类法 总被引:2,自引:0,他引:2
传统方法处理文本分类时都需要进行文本训练,并且在文本表示时需要抽取特征项。搜集训练文本的过程需要费时费力的人工参与,而且中文信息的特征项抽取工作难度较大。为了解决这些问题,本文探讨了一种新的文本分类法———基于Ontology的Web文本分类法。该方法首先通过“知网”建立一个Ontology,然后根据分类体系建立每个类的Ontology,最后根据每个类的Ontology对文本进行分类。试验表明这种分类法与KNN分类法在准确率上相当,但比KNN方法稳定,在召回率上优于KNN方法。 相似文献
11.
12.
Hierarchical Text Categorization Using Neural Networks 总被引:8,自引:1,他引:7
This paper presents the design and evaluation of a text categorization method based on the Hierarchical Mixture of Experts model. This model uses a divide and conquer principle to define smaller categorization problems based on a predefined hierarchical structure. The final classifier is a hierarchical array of neural networks. The method is evaluated using the UMLS Metathesaurus as the underlying hierarchical structure, and the OHSUMED test set of MEDLINE records. Comparisons with an optimized version of the traditional Rocchio's algorithm adapted for text categorization, as well as flat neural network classifiers are provided. The results show that the use of the hierarchical structure improves text categorization performance with respect to an equivalent flat model. The optimized Rocchio algorithm achieves a performance comparable with that of the hierarchical neural networks. 相似文献
13.
Although considerable research has been conducted in the field of hierarchical text categorization, little has been done on automatically collecting labeled corpus for building hierarchical taxonomies. In this paper, we propose an automatic method of collecting training samples to build hierarchical taxonomies. In our method, the category node is initially defined by some keywords, the web search engine is then used to construct a small set of labeled documents, and a topic tracking algorithm with keyword-based content normalization is applied to enlarge the training corpus on the basis of the seed documents. We also design a method to check the consistency of the collected corpus. The above steps produce a flat category structure which contains all the categories for building the hierarchical taxonomy. Next, linear discriminant projection approach is utilized to construct more meaningful intermediate levels of hierarchies in the generated flat set of categories. Experimental results show that the training corpus is good enough for statistical classification methods. 相似文献
14.
基于字频向量的中文文本自动分类系统 总被引:15,自引:3,他引:12
本文提出了一种根据汉字统计特性和基于实例映射的中文文本自动分类方法。该方法采用汉字字频向量作为文本的表示方法。它的显著特点是引入线性最小二乘方估计技术建立文本分类器模型,通过对训练集语料的手工分类标引以及对文本和类别间的相关性判定的学习,实现了基于全局最小错误率的汉字一类别两个向量空间的映射函数,并用该函数对测试文本进行分类。 相似文献
15.
16.
为减少一词多义现象及训练样本的类偏斜问题对分类性能的影响,提出一种基于语义网络社团划分的中文文本分类算法。通过维基百科知识库对文本特征词进行消歧,构建出训练语义复杂网络以表示文本间的语义关系,再次结合节点特性采用K-means算法对训练集进行社团划分以改善类偏斜问题,进而查找待分类文本的最相近社团并以此为基础进行文本分类。实验结果表明,本文所提出的中文文本分类算法是可行的,且具有较好的分类效果。 相似文献
17.
Rafael Guzmán-Cabrera Manuel Montes-y-Gómez Paolo Rosso Luis Villaseñor-Pineda 《Information Retrieval》2009,12(3):400-415
Most current methods for automatic text categorization are based on supervised learning techniques and, therefore, they face the problem of requiring a great number of training instances to construct an accurate classifier. In order to tackle this problem, this paper proposes a new semi-supervised method for text categorization, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the construction of the classifier. This method, even though language independent, is more pertinent for scenarios where large sets of labeled resources do not exist. That, for instance, could be the case of several application domains in different non-English languages such as Spanish. The experimental evaluation of the method was carried out in three different tasks and in two different languages. The achieved results demonstrate the applicability and usefulness of the proposed method. 相似文献