共查询到17条相似文献,搜索用时 968 毫秒
1.
一种基于向量空间模型的改进文本分类算法 总被引:2,自引:0,他引:2
探讨了基于向量空间模型的文本分类技术,通过规范化向量空间模型术语,论述了向量空间模型中TD-IDF向量化文档的不足;提出基于位置等因素的权重改进算法;借助扩展的潜在语义索引算法KLSC和辅助主题词表来消除模型很难处理一词多义、一义多词的现象;根据用户个性化的服务需求,给出了个性化服务的意见。 相似文献
2.
3.
4.
本文主要研究了查询语义树的生成策略、用户查询语义的提取机制,以及查询语义树中语义边界的确定方法。通过查询语义树产生候选扩展词,再计算候选扩展词与所有查询项在初检局部文档集合中的共现度,用以评估扩展词质量,使得扩展词与用户查询所蕴涵的主题具有较强的语义相关性。实验结果表明,与传统向量空间模型检索算法比较,查询性能有明显的改善。 相似文献
5.
6.
7.
8.
文本分类是中文信息处理的热点研究内容,而语义是文本类别归属的依据。提出一种基于语义引导的特征选择方法,在特征选择的同时,对典型类别区分词进行加权,提高该类词在分类中的作用;采用支持向量机技术进行试验,实验表明建立语义知识库的特征选择改善了文本的分类性能。 相似文献
9.
10.
对识别后的语音文档进行了向量空间模型的建立,针对得到的高维稀疏矩阵提出了基于局部敏感哈希的语音文档分类算法,算法能够直接在高维稀疏矩阵上进行分类,无需降维。此外,在构建局部敏感哈希函数的时候结合了稳定分布。实验证明,局部敏感哈希算法能够对语音文档进行合理有效的分类,同时获得了较小的时间复杂度。 相似文献
11.
针对图书、期刊论文等数字文献文本特征较少而导致特征向量语义表达不够准确、分类效果差的问题,本文提出一种基于特征语义扩展的数字文献分类方法。该方法首先利用TF-IDF方法获取对数字文献文本表示能力较强、具有较高TF-IDF值的核心特征词;其次分别借助知网(Hownet)语义词典以及开放知识库维基百科(Wikipedia)对核心特征词集进行语义概念的扩展,以构建维度较低、语义丰富的概念向量空间;最后采用MaxEnt、SVM等多种算法构造分类器实现对数字文献的自动分类。实验结果表明:相比传统基于特征选择的短文本分类方法,该方法能有效地实现对短文本特征的语义扩展,提高数字文献分类的分类性能。 相似文献
12.
[目的/意义]针对技术功效图构建过程中的主要问题和薄弱环节,提出了一种基于SAO结构和词向量的专利技术功效图构建方法。[方法/过程]利用Python程序获取专利摘要中的SAO结构,从中识别技术词和功效词;结合领域词典与专利领域语料库,运用Word2Vec和WordNet计算词语间的语义相似度;利用基于网络关系的主题聚类算法实现主题的自动标引;采用基于SAO结构的共现关系构建技术功效矩阵。[结果/结论]实现了基于SAO结构和词向量的技术功效图自动构建,该构建方法提高了构建技术功效主题的合理性和专利分类标注的准确性,为技术功效图的自动化构建提供新的思路。 相似文献
13.
Automatic text classification is the task of organizing documents into pre-determined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the words in other words terms are cut from their finer context i.e. their location in a sentence or in a document. Only the broader context of document is used with some type of term frequency information in the vector space. Consequently, semantics of words that can be inferred from the finer context of its location in a sentence and its relations with neighboring words are usually ignored. However, meaning of words, semantic connections between words, documents and even classes are obviously important since methods that capture semantics generally reach better classification performances. Several surveys have been published to analyze diverse approaches for the traditional text classification methods. Most of these surveys cover application of different semantic term relatedness methods in text classification up to a certain degree. However, they do not specifically target semantic text classification algorithms and their advantages over the traditional text classification. In order to fill this gap, we undertake a comprehensive discussion of semantic text classification vs. traditional text classification. This survey explores the past and recent advancements in semantic text classification and attempts to organize existing approaches under five fundamental categories; domain knowledge-based approaches, corpus-based approaches, deep learning based approaches, word/character sequence enhanced approaches and linguistic enriched approaches. Furthermore, this survey highlights the advantages of semantic text classification algorithms over the traditional text classification algorithms. 相似文献
14.
词语相似度计算方法在信息检索、词义消歧、机器翻译等自然语言处理领域有着广泛的应用。现有的词语相似度算法主要分为基于统计和基于语义资源两类方法,前者是从大规模的语料中统计与词语共现的上下文信息以计算其相似度,而后者利用人工构建的语义词典或语义网络计算相似度。本文比较分析了两类词语相似度算法,重点介绍了基于Web语料库和基于维基百科的算法,并总结了各自的特点和不足之处。最后提出,在信息技术的影响下,基于维基百科和基于混合技术的词语相似度算法以及关联数据驱动的相似性计算具有潜在的发展趋势。 相似文献
15.
【目的/意义】对互联网产生的大量文本数据进行有效分类,提高文本处理效率,为企业用户决策提供建
议。【方法/过程】针对传统的词向量特征嵌入无法获取一词多义,特征稀疏、特征提取困难等问题,本文提出了一种
基于句子特征的多通道层次特征文本分类模型(SFM-DCNN)。首先,该模型通过Bert句向量建模,将特征嵌入从
传统的词特征嵌入升级为句特征嵌入,有效获取一词多义、词语位置及词间联系等语义特征。其次,通过构建多通
道深度卷积模型,将句特征从多层级来获取隐藏特征,获取更接近原语义的特征。【结果/结论】采用三种不同的数
据对模型进行验证分析,采用对比相关的分类方法,SFM-DCNN模型准确率较其他模型分类性能有所提高,这说
明该模型具有一定的借鉴意义。【创新/局限】基于文本分类中存在的一词多义、特征稀疏问题,创新性地利用Bert来
抽取全局语义信息,并结合多通道深层卷积来获取局部层次特征,但限于时间和设备条件,模型没有进行进一步的
预训练,实验数据集不够充分。 相似文献
16.
《Information processing & management》2023,60(2):103192
The replies of people seeking support in online mental health communities can be analyzed to discover if they feel better after receiving support; feeling better indicates a cognitive change. Most research uses key phrase matching and word frequency statistics to identify psychological cognitive change, methods that result in omissions and inaccuracy. This study constructs an intelligent method for identifying psychological cognitive change based on natural language processing technology. It incorporates information related to emotions that appears in reply text to help identify whether psychological cognitive change has occurred. The model first encodes the emotion information based on rule matching and manual annotation, then adds the encoded emotion lexicon and a cognitive change lexicon to a word2vec high-dimensional semantic word vector training, converts the annotated cognitive change recognition text into a vector matrix using the trained model, and train in the annotated text using TextCNN. To compare the results with those of the traditional methods (key phrase matching and sentiment word frequency statistics), this study uses a semi-automated approach to construct a lexicon of psychological cognitive change, as well as a keyword lexicon without cognitive change, based on word vectors and similarity. We compare the performance of the classifier before and after the fusion of the graphical emotion information, compare the LSTM and Transformer as baselines, and compare traditional word frequency statistics methods. The experimental results show that our proposed classification model performs better than the others; it achieves 84.38% precision, an 84.09% recall rate, and an 84.17% F1 value. Our work bears methodological implications for online mental health platforms. 相似文献
17.
《Information processing & management》2022,59(3):102925
We propose bidirectional imparting or BiImp, a generalized method for aligning embedding dimensions with concepts during the embedding learning phase. While preserving the semantic structure of the embedding space, BiImp makes dimensions interpretable, which has a critical role in deciphering the black-box behavior of word embeddings. BiImp separately utilizes both directions of a vector space dimension: each direction can be assigned to a different concept. This increases the number of concepts that can be represented in the embedding space. Our experimental results demonstrate the interpretability of BiImp embeddings without making compromises on the semantic task performance. We also use BiImp to reduce gender bias in word embeddings by encoding gender-opposite concepts (e.g., male–female) in a single embedding dimension. These results highlight the potential of BiImp in reducing biases and stereotypes present in word embeddings. Furthermore, task or domain-specific interpretable word embeddings can be obtained by adjusting the corresponding word groups in embedding dimensions according to task or domain. As a result, BiImp offers wide liberty in studying word embeddings without any further effort. 相似文献