首页 | 本学科首页   官方微博 | 高级检索  
 共查询到19条相似文献,搜索用时 252 毫秒
TF-IDF是文本分类中计算特征权重的经典方法,但其本身并未考虑特征词在文档集合中的分布情况,从而导致类别区分度不大。通过计算特征词类内密度与特征词在样本中均匀分布时整体平均密度的比值对IDF函数进行改进。实验结果表明,改进后的TF-IDF考虑了特征词内分布与在整体文档集中的分布,提升了对类别的区分能力,有效改善了文本分类效果。  相似文献   

TF-IDF是文档特征权重表示常用方法,但不能真正地反映特征词对区分每个类的贡献。故针对网页分类中特征选择方法存在的问题,加入网页标签特征权重改进TF-IDF公式,提出了一种比较有效的网页分类算法,实验结果表明该方法具有较好的特征选择效果,能够有效地提高分类精度。  相似文献   

针对短文本信息篇幅短、信息量少、特征稀疏的特点,提出一种基于LDA(Laten Dirichlet Allocation)主题模型特征扩展的短文本分类方法。该方法利用LDA模型得到文档的主题分布,然后将对应主题下的词扩充到原来短文本的特征中,作为新的部分特征词,最后利用SVM分类方法进行分类。实验结果表明,相比于传统的基于VSM模型的分类方法,基于LDA特征扩展的短文本分类方法克服了特征稀疏的问题,在各个类别上的查准率、查全率和F1值都有所提高,充分验证了该方法对短文本分类的可行性。  相似文献   

针对文档分类过程中不同大小文档表示、文档特征选择和文档特征编码问题,提出了一种基于粗糙集的角分类神经网络Rough-CC4.利用近义词构成等价类,以此表示文档,可以缩小文档表示的维数、解决由于文档不同大小导致的精度问题、模糊近义词之间的差别;利用二进制编码方法对文档特征编码,可以提高Rough-CC4的精度,同时减小Rough-CC4的空间复杂度.Rough-CC4可以广泛用于大量文档集合的自动分类.  相似文献   

结合蚁群算法在解决分类问题方面的优势,以及中文网页内容特征值的离散性特点,提出一种改进的基于蚁群算法的网页分类方法。该算法通过携带类别信息的种群蚂蚁的爬行,在迭代过程中寻找一条最佳路径与之匹配,实现了Web页面的分类。最佳路径通过计算测试文档与每一类别的覆盖集合,进而比较最优覆盖集合得到。其中类别权重计算中引入了文字链接比和标签权值,进一步提高了分类精度。实验证明,引入类别覆盖集的蚁群分类算法能够取得更好的分类效果。  相似文献   

针对不同类别文档可能被表示为相同向量的问题,在研究常用文档特征权重计算方法的基础上,分析文档中特征项之间的相对位置关系,引入文档结构矩阵DS。将DS与3种常用权重算法相结合,构造3种新模型,并利用6种模型在实际语料上进行分类实验。结果表明,基于DS的权重算法与原始权重算法相比,能够提高文本分类效果。  相似文献   

传统的重复文档检测方法是以单词或n-grams为单位提取特征,造成特征集合过于庞大。针对该缺点,提出以句子块作为文档特征的提取方法,将每个文档表示成句子长度序列,使用后缀树快速匹配公共子串。实验中,使用两个标准文档集与3种经典方法在有效性和效率方面进行比较,结果表明新算法有较高的准确率和效率。  相似文献   

当前大多数图像集合分类方法对图像集合进行表示时往往做出部分先验假设,然而在许多实际应用中,这些假设可能无法成立,尤其是当集合内部存在大量复杂的数据变化时更是如此。此外,基于这些假设进行模型学习时可能会丢失部分区分性分类信息。针对这一问题,本文提出一种基于特征表示与学习的图像集合分类方法。对每个图像集合,首先将计算它的多阶统计量作为特征表示。对每阶统计量,计算一个内核矩阵来衡量两个图像集合的相似性。然后,通过利用局部多内核指标学习(LMKML)方法来学习一种距离指标,进而将不同阶统计量综合起来。最后,利用最近邻分类器进行分类。基于4种常用图像集合数据库的实验结果验证了本文算法的有效性。  相似文献   

对文本分类技术进行研究,首先介绍文档频数特征词评价方法;然后提出一种词分布均衡度评价的特征词选取方法,最后分析基于词分布均衡度评价的支持向量机文本分类算法,并实验证明其优越性.  相似文献   

在自动文本分类中,TF IDF算法是最为常用的特征权重计算方法。该算法运用广泛,但是存在不足:只考虑了特征词的频率和包含特征词的文档数量,没有考虑到特征词在类内和类间对权重的影响。对特征词权重计算方法进行了改进。为了解决特征词在类内均匀分布以及在类间的比重问题,提出了修正函数TF DFI DFO。实验比较发现,新的特征词权重算法能够更加精确地反映出特征词的分布情况,该算法与传统的TF IDF算法相比,在召回率、查准率和宏平均值上都有较大的提升。  相似文献   

This article investigates the effect of the number of item response categories on chi‐square statistics for confirmatory factor analysis to assess whether a greater number of categories increases the likelihood of identifying spurious factors, as previous research had concluded. Four types of continuous single‐factor data were simulated for a 20‐item test: (a) uniform for all items, (b) symmetric unimodal for all items, (c) negatively skewed for all items, or (d) negatively skewed for 10 items and positively skewed for 10 items. For each of the 4 types of distributions, item responses were divided to yield item scores with 2,4, or 6 categories. The results indicated that the chi‐square statistic for evaluating a single‐factor model was most inflated (suggesting spurious factors) for 2‐category responses and became less inflated as the number of categories increased. However, the Satorra‐Bentler scaled chi‐square tended not to be inflated even for 2‐category responses, except if the continuous item data had both negatively and positively skewed distributions.  相似文献   

针对当前基于LDA模型扩展的文本情感分析方法未能考虑同一词语在不同语境下其情感极性的差异及非特征情感词对微博文本情感极性的影响这两个问题,提出一种基于语境分类和遗传算法的微博情感分析方法。该方法首先利用LDA模型构造微博主题集及微博主题词集,然后用微博标签数据逐一对各微博主题词集应用遗传算法自动迭代计算得出词集中词语的情感值,最后利用词集词语的情感值计算微博文本情感极性。实验结果表明,该方法精确度比LDA提升3.12%,召回率达87.32%,F1达73.79%,能够从语境和非特征情感词获取微博情感信息,有效提高情感分类准确率。  相似文献   

The purpose of this study was to investigate the impact of imagery interventions on the vocabulary acquisition abilities of second grade students. A total of 15 students were randomly assigned to three different intervention conditions: Word Only, which involves the simple verbal presentation of a vocabulary word; Dual Coding, in which a picture was paired with the vocabulary word, and Image Creation, in which students were told to create a mental picture of the vocabulary word in their mind and draw it on paper. These students were taught a total of 21 vocabulary words: seven animal and habitat words, seven musical instrument terms, and seven science terms. A Latin square design was used, in which each group of students rotated through each of the interventions, being exposed to a different treatment condition for each category of words. Participants were measured on the number of words they were successfully able to acquire through the use of experimenter designed comprehension measures. While no statistical significance was shown between the interventions across the word categories, a significant difference was found between the Image Creation and Word Only interventions within the science terms category. Students also reported that the imagery interventions facilitated the ease with which they learned the words. The findings have implications for increasing the success of classroom instruction, specifically for presenting novel vocabulary words to early elementary learners using imagery methods.  相似文献   

The purpose of this study was to identify the writing errors made by 310 first year university students in 13 disciplines using a checklist of five writing error categories, each with a number of sub‐categories. The overall median error rate was 3S.0 errors per 1000 words. Punctuation and capitalisation was by far the most common category of error, and sentence structure, word usage, spelling and vocabulary followed in descending order of frequency. Law and Economics students exhibited the highest error rates while Geography, Mechanical Engineering, Philosophy, English, Statistics, Linguistics, Geology, History and Sociology students had error rates near the median value, and French and Psychology students made fewest errors. The major error sub‐categories that best indicated difference between good and poor writers were the use of commas and some aspects of sentence construction.  相似文献   

Six match-to-sample picture/object selection experiments were designed to explore children's knowledge about superordinate words (e.g., "food") and how they acquire this knowledge. Three factors were found to influence the learning and extension of superordinate words in 3- to 5-year-old children (N = 230): The number of standards (one versus two), the type of standards presented (from different basic-level categories versus from the same basic-level category), and the nature of the object representations used (pictures versus objects). A different pattern of superordinate word acquisition was found between 3-year-olds and 4- and 5-year-olds. Although 4- and 5-year-olds could learn and extend novel words to superordinate categories in the presence of two picture exemplars from different categories or a single three-dimensional (3-D) exemplar, 3-year-olds could do so only in the presence of two 3-D exemplars. These findings indicate that young children's acquisition of superordinate words is influenced by multiple factors and that there is a developmental progression from multiple exemplars to single exemplars in superordinate word learning.  相似文献   

本文采用基于递归算法的去除离散点法消除孤立噪声,选用扫描边界的方法分割字符,来研究验证码自动识别技术,选择和提取稳定而又便于表示的特征向量是本系统的核心之一。本文提出了简单的字符特征提取方法:采用网格灰度特征并对该特征进行线性鉴别分析(LDA,Linear discrimlnant analysis)变换,结合最小距离分类器完成字符识别过程,通过提高训练样本数,有效解决了形近字符识别率低的问题,取得了很好的识别效果。  相似文献   

特征选择是避免维度诅咒的一种数据预处理技术。在多变量时间序列预测中,为了同时找到与问题相关性最大的变量及其对应时延,提出一种基于多注意力的有监督特征选择方法。该方法利用带有注意力模块和学习模块的深度学习模型,将原始二维时间序列数据正交分割成两组一维数据,分别输入两个不同维度的注意力生成模块,得到特征维度和时间维度的注意权重。两个维度的注意力权值点积叠加作为全局注意力得分进行特征选择,作用于原始数据后输入随学习模块训练不断更新至收敛。实验结果表明,所提出的方法在特征数小于10时可达到全量数据训练效果,与现有几种基线方法相比实现了最佳准确率。  相似文献   

"极小"负极性词语属于有量词语,代表的量值为最小,与否定词共现在一起,词汇功能被归入英语语法模型描写的一个范畴。通过对英语"极小"负极性词语在语义方面所表现的语言事实简单描述,基于其词汇范畴的改变,解释了"极小"负极性词语与否定词之间的关系,揭示了"极小"负极性词语和否定词组合等值于全称量词的否定,代表一种质的否定,是极限表达的一种方式。  相似文献   

提出一个以大规模语料库为资源提取材料,以词汇集为语义成分表达方式的动态词群建构方法,其维度特征值的提取不但具有客观性,而且操作简易。维度特征值用词汇集的方式表示特征取值范围,可以避免传统的义素概括性过强、词汇个性表现不足的缺陷。这种方法有利于面向应用的大规模动态词群的建构。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号