首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 275 毫秒
1.
基于潜在语义索引的文本结构分析方法的研究   总被引:4,自引:0,他引:4  
文本结构分析是文本处理领域中的重要内容,它可以有效地改进文本检索、文本过滤以及文本摘要的精度。通过描述文本的物理结构和逻辑结构以及文本分析的背景,将潜在语义索引引入文本结构分析中,提出了基于潜在语义索引的层次分析方法,该方法保证了层次划分的有序性和聚合性,可操作性强,便于解释,并给出了在文本检索、文本过滤和文本摘要中的应用。  相似文献   

2.
基于潜在语义索引和遗传算法的文本特征提取方法   总被引:9,自引:0,他引:9  
郝占刚  王正欧 《情报科学》2006,24(1):104-107
本文采用潜在语义索引(LSI)和遗传算法(GA)进行文本特征提取。在采用潜在语义索引将语义关系体现在VSM(Vector Space Model)中,通过奇异值分解(SVD,Singular Value Deccvaposition)可以有效地降低向量空间的维数,但通过维数约简后的文本特征仍要保持在数百维左右,因此本文采用遗传算法在此基础上继续降维。实验结果表明,这两种方法结合可以极大的降低文本向量空间的雏数,并能提高分类准确率。  相似文献   

3.
一种基于向量空间模型的改进文本分类算法   总被引:2,自引:0,他引:2  
牛玲 《情报杂志》2006,25(6):63-64,67
探讨了基于向量空间模型的文本分类技术,通过规范化向量空间模型术语,论述了向量空间模型中TD-IDF向量化文档的不足;提出基于位置等因素的权重改进算法;借助扩展的潜在语义索引算法KLSC和辅助主题词表来消除模型很难处理一词多义、一义多词的现象;根据用户个性化的服务需求,给出了个性化服务的意见。  相似文献   

4.
潜在语义索引方法是一种无监督的学习方法,能够自动地从未经加工的文本中学习词法分析处理的数据。通过计算单词之间的语义相关性,提高学习的效果。本文首先对词法分析和词法学习的概念和早期出现过的词法学习的方法进行简单阐述,然后描述了基于这一理论进行词法学习的方法,接着是对这一方法的一些改进和测评,最后是结论和展望。  相似文献   

5.
基于中文信息抽取的概念,提出一套基于语义模板的地震应急文本信息地图自动标绘的技术方案。针对地震应急文本信息和汉语的特点,对应急文本信息进行分词、词性标注和语义分析等处理,按照预定义语义模板提取震情、灾情信息并形成结构化信息、并与空间关联,标绘形成地震应急态势图。方案实现了传统手工标绘向自动标绘的转换,提高了地震应急态势图标绘效率。  相似文献   

6.
文本分类是处理和组织大量文本数据的关键技术之一。为了更加有效地实现文本分类,本文提出了一种基于图模型的文本特征提取方法。该方法利用类别信息在训练数据集上构造邻接带权图及其补图,使得属于同一个类别的样本点的投影尽可能近,不属于同一个类别的样本点的投影尽可能远。这种方法既能够获得文本空间的全局结构信息又可以保留局部结构信息。最后,采用K近邻分类器在20Newsgroups标准数据集上进行训练和测试,并且与基于潜在语义索引的文本分类方法做了比较,文本分类的性能得到很大提高。实验结果表明,本文所提出的方法能够有效地提高文本分类的性能。  相似文献   

7.
赵展一  钟永恒  王辉  刘佳 《现代情报》2023,(10):152-163+177
[目的/意义]技术关联与匹配是企业进行研发合作的内在动因,梳理基于技术关联关系的企业研发潜在合作伙伴识别方法,总结研究不足并提出发展建议,完善面向潜在合作伙伴识别的情报方法体系。[方法/过程]基于136篇重点文献,归纳基于技术承继、共现、结构、应用关系以及复合技术关系的企业研发潜在合作伙伴识别方法,对比分析每种方法的优劣,并提出未来研究方向。[结果/结论]现有研究通过挖掘引用关系、共现关系、文本语义、复杂网络中的有用信息,综合统计和语义特征取得了较好的识别效果。不足在于:数据源和数据范围受限,技术文本语义分析方法存在缺陷,技术关联与合作行为的关系未梳理清晰。未来方向:纳入多类型数据并保障识别范围的完整性;完善技术文本语义分析与计算的理论方法体系;系统梳理技术关联与合作行为的关系,完善合作潜力测度指标。  相似文献   

8.
【目的/意义】文献的向量表示方法对文献主题聚合、聚类和分类等研究具有重要意义。基于二元共现信息 的潜在语义向量空间模型(CLSVSM)挖掘了文本信息中词与词之间的潜在语义关系,与文本向量表示的基本模型- 向量空间模型(VSM)相比很大程度上提高了文本聚类的精度。【方法/过程】为使CLSVSM能更优的提取文献的潜 在语义信息,本文在二元CLSVSM基础上进一步引入了三元共现信息,以深度挖掘文献的潜在语义,通过研究三元 共现矩阵的表示,三元共现频次和相对共现强度的计算方法,最终建立了加权共现潜在语义向量空间模型(加权 CLSVSM)。最后我们分别利用中、英文献数据对二元CLSVSM和加权CLSVSM两类模型进行了实验比较。【结果/ 结论】结果显示:新模型对英文文献的聚类效果与二元CLSVSM相当,但对中文文献主题聚类效果明显要优于二元 CLSVSM。  相似文献   

9.
为及时有效地识别潜在技术机会,采用文本挖掘和异常值检测的方法,提出一种基于专利文本的技术机会识别方法.首先采用文本表示模型Doc2vec技术对专利摘要进行建模,以更深层表征文本语义信息;然后利用基于密度的离群值检测算法,识别出具有潜在技术机会的专利方向;最后以深度学习领域潜在技术识别为例,构建专利检索式并收集458条专利文献作为数据集.实证结果总结出4类主题共10个潜在的技术机会,验证了该基于专利的技术机会识别方法的有效性,可为企业相应技术应用、研发和创新提供参考.  相似文献   

10.
[目的/意义]针对现有基于文本挖掘的政策主题扩散特征研究中文本主题识别的随机性和高度依赖人工等不足,提出一套基于创新价值链理论的政策主题分析框架及对应的文本挖掘方法,从而更好地识别政策扩散过程中政策内容变化的特征以及背后潜在机制。[方法/过程]以我国人工智能政策为实证对象,在理论上构建了基于创新价值链的政策文本主题分析框架,在方法上基于依存句法和语义信息抽取政策文本关键短语结构,通过构建分析框架主题与短语结构词汇的一一映射关系词典来完成对政策文本主题扩散分布的计算。[结果/结论]采集了自2017年以来的110份人工智能政策文本,分析了人工智能政策扩散时间分布、空间层级、内容扩散程度特征和主题扩散分布特征,在此基础上将政府主题扩散倾向性与其发展阶段所处梯队作为定序变量,从而分析两者关系以及背后的潜在机制。由此证明了本文所提方法可有效融合文本挖掘方法和政策分析理论,有助于从对政策扩散特征的描述性分析走向对政策扩散机理的解释性分析。  相似文献   

11.
This paper proposes a method to improve retrieval performance of the vector space model (VSM) in part by utilizing user-supplied information of those documents that are relevant to the query in question. In addition to the user's relevance feedback information, information such as original document similarities is incorporated into the retrieval model, which is built by using a sequence of linear transformations. High-dimensional and sparse vectors are then reduced by singular value decomposition (SVD) and transformed into a low-dimensional vector space, namely the space representing the latent semantic meanings of words. The method has been tested with two test collections, the Medline collection and the Cranfield collection. In order to train the model, multiple partitions are created for each collection. Improvement of average precision of the averages over all partitions, compared with the latent semantic indexing (LSI) model, are 20.57% (Medline) and 22.23% (Cranfield) for the two training data sets, and 0.47% (Medline) and 4.78% (Cranfield) for the test data, respectively. The proposed method provides an approach that makes it possible to preserve user-supplied relevance information for the long term in the system in order to use it later.  相似文献   

12.
陈立华 《现代情报》2010,30(3):26-28,31
潜在语义分析是自然语言使用于情报检索系统的理论基础,以此理论建构的空间向量模型是评判检索系统性能优良与否的知识工具。阐述了潜在语义标引(LSI)的基本内容、LSI下影响自然语言检索查准率的因素及向量空间模型检索软件的运行机制。此评述对网络化的情报检索技术的发展起到了一定的参考作用。  相似文献   

13.
Traditional information retrieval techniques that primarily rely on keyword-based linking of the query and document spaces face challenges such as the vocabulary mismatch problem where relevant documents to a given query might not be retrieved simply due to the use of different terminology for describing the same concepts. As such, semantic search techniques aim to address such limitations of keyword-based retrieval models by incorporating semantic information from standard knowledge bases such as Freebase and DBpedia. The literature has already shown that while the sole consideration of semantic information might not lead to improved retrieval performance over keyword-based search, their consideration enables the retrieval of a set of relevant documents that cannot be retrieved by keyword-based methods. As such, building indices that store and provide access to semantic information during the retrieval process is important. While the process for building and querying keyword-based indices is quite well understood, the incorporation of semantic information within search indices is still an open challenge. Existing work have proposed to build one unified index encompassing both textual and semantic information or to build separate yet integrated indices for each information type but they face limitations such as increased query process time. In this paper, we propose to use neural embeddings-based representations of term, semantic entity, semantic type and documents within the same embedding space to facilitate the development of a unified search index that would consist of these four information types. We perform experiments on standard and widely used document collections including Clueweb09-B and Robust04 to evaluate our proposed indexing strategy from both effectiveness and efficiency perspectives. Based on our experiments, we find that when neural embeddings are used to build inverted indices; hence relaxing the requirement to explicitly observe the posting list key in the indexed document: (a) retrieval efficiency will increase compared to a standard inverted index, hence reduces the index size and query processing time, and (b) while retrieval efficiency, which is the main objective of an efficient indexing mechanism improves using our proposed method, retrieval effectiveness also retains competitive performance compared to the baseline in terms of retrieving a reasonable number of relevant documents from the indexed corpus.  相似文献   

14.
Similarity search with hashing has become one of the fundamental research topics in computer vision and multimedia. The current researches on semantic-preserving hashing mainly focus on exploring the semantic similarities between pointwise or pairwise samples in the visual space to generate discriminative hash codes. However, such learning schemes fail to explore the intrinsic latent features embedded in the high-dimensional feature space and they are difficult to capture the underlying topological structure of data, yielding low-quality hash codes for image retrieval. In this paper, we propose an ordinal-preserving latent graph hashing (OLGH) method, which derives the objective hash codes from the latent space and preserves the high-order locally topological structure of data into the learned hash codes. Specifically, we conceive a triplet constrained topology-preserving loss to uncover the ordinal-inferred local features in binary representation learning. By virtue of this, the learning system can implicitly capture the high-order similarities among samples during the feature learning process. Moreover, the well-designed latent subspace learning is built to acquire the noise-free latent features based on the sparse constrained supervised learning. As such, the latent under-explored characteristics of data are fully employed in subspace construction. Furthermore, the latent ordinal graph hashing is formulated by jointly exploiting latent space construction and ordinal graph learning. An efficient optimization algorithm is developed to solve the resulting problem to achieve the optimal solution. Extensive experiments conducted on diverse datasets show the effectiveness and superiority of the proposed method when compared to some advanced learning to hash algorithms for fast image retrieval. The source codes of this paper are available at https://github.com/DarrenZZhang/OLGH .  相似文献   

15.
Rocchio relevance feedback and latent semantic indexing (LSI) are well-known extensions of the vector space model for information retrieval (IR). This paper analyzes the statistical relationship between these extensions. The analysis focuses on each method’s basis in least-squares optimization. Noting that LSI and Rocchio relevance feedback both alter the vector space model in a way that is in some sense least-squares optimal, we ask: what is the relationship between LSI’s and Rocchio’s notions of optimality? What does this relationship imply for IR? Using an analytical approach, we argue that Rocchio relevance feedback is optimal if we understand retrieval as a simplified classification problem. On the other hand, LSI’s motivation comes to the fore if we understand it as a biased regression technique, where projection onto a low-dimensional orthogonal subspace of the documents reduces model variance.  相似文献   

16.
使用基于本体的语言标注能有效地避免词汇上的歧义,全面提高数字图书馆用户信息检索效率.本文以生物学为例,采用本体的语义辨析来分析用户给定的关键词,揭示这几个关键词所指向的概念,介绍了生物信息领域本体构建方面的工作,探讨了领域本体构建过程中的语义标注方法.  相似文献   

17.
相关概念的关联参照检索是概念检索的重要研究内容。本文提出了一种基于主题的语义关联的参照检索模型,通过融合语义网、本体论的相关知识及信息提取等语言处理技术,提取关于特定主题的文档的主题概念及概念之间的关联构成该主题的语义关联模型,并辅助于参照检索过程。  相似文献   

18.
Hashing has been an emerging topic and has recently attracted widespread attention in multi-modal similarity search applications. However, most existing approaches rely on relaxation schemes to generate binary codes, leading to large quantization errors. In addition, amounts of existing approaches embed labels into the pairwise similarity matrix, leading to expensive time and space costs and losing category information. To address these issues, we propose an Efficient Discrete Matrix factorization Hashing (EDMH). Specifically, EDMH first learns the latent subspaces for individual modality through matrix factorization strategy, which preserves the semantic structure representation information of each modality. In particular, we develop a semantic label offset embedding learning strategy, improving the stability of label embedding regression. Furthermore, we design an efficient discrete optimization scheme to generate compact binary codes discretely. Eventually, we present two efficient learning strategies EDMH-L and EDMH-S to pursue high-quality hash functions. Extensive experiments on various widely-used databases verify that the proposed algorithms produce significant performance and outperform some state-of-the-art approaches, with an average improvement of 2.50% (for Wiki), 2.66% (for MIRFlickr) and 2.25% (for NUS-WIDE) over the best available results, respectively.  相似文献   

19.
关于本体论的研究综述   总被引:3,自引:0,他引:3  
顾金睿  王芳 《情报科学》2007,25(6):949-956
本文是一篇关于本体论的综述性文章,介绍了本体的概念、本体的理论研究,包括本体的建模元语、分类、表示语言、构造规则以及目前研究本体的权威机构,对与本体相关的概念进行了介绍,分别探讨了本体与语义网络,本体与语义网,本体与叙词表的关系,最后介绍了本体在信息检索以及其他一些领域的应用。  相似文献   

20.
基于向量空间的中文概念检索技术研究   总被引:2,自引:1,他引:2  
随着信息化程度的加剧 ,全文检索系统越来越广泛地应用于社会的各个领域 ,它们中的绝大多数都是基于词匹配的。理论上 ,信息检索是以概念为基本单位的语言处理 ,词只是概念的表现形式。由于同一个概念通常可以由不同的词表达 ,且概念之间可以有复杂的语义关系 ,因此许多检索系统经常会出现检索不全、答非所问的情况。为此有人提出了概念检索。笔者认为概念检索至少包括两个方面的含义。首先 ,它是一种思想 ,它是为了突破机械式匹配局限于表面形式的缺陷 ,从词所表达的概念意义层次上来认识和处理用户的检索请求 ,以更好地满足用户。其次 ,它…  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号