首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
A new dictionary-based text categorization approach is proposed to classify the chemical web pages efficiently. Using a chemistry dictionary, the approach can extract chemistry-related information more exactly from web pages. After automatic segmentation on the documents to find dictionary terms for document expansion, the approach adopts latent semantic indexing (LSI) to produce the final document vectors, and the relevant categories are finally assigned to the test document by using the k-NN text categorization algorithm. The effects of the characteristics of chemistry dictionary and test collection on the categorization efficiency are discussed in this paper, and a new voting method is also introduced to improve the categorization performance further based on the collection characteristics. The experimental results show that the proposed approach has the superior performance to the traditional categorization method and is applicable to the classification of chemical web pages.  相似文献   

2.
针对图书、期刊论文等数字文献文本特征较少而导致特征向量语义表达不够准确、分类效果差的问题,本文提出一种基于特征语义扩展的数字文献分类方法。该方法首先利用TF-IDF方法获取对数字文献文本表示能力较强、具有较高TF-IDF值的核心特征词;其次分别借助知网(Hownet)语义词典以及开放知识库维基百科(Wikipedia)对核心特征词集进行语义概念的扩展,以构建维度较低、语义丰富的概念向量空间;最后采用MaxEnt、SVM等多种算法构造分类器实现对数字文献的自动分类。实验结果表明:相比传统基于特征选择的短文本分类方法,该方法能有效地实现对短文本特征的语义扩展,提高数字文献分类的分类性能。  相似文献   

3.
Abstractive summarization aims to generate a concise summary covering salient content from single or multiple text documents. Many recent abstractive summarization methods are built on the transformer model to capture long-range dependencies in the input text and achieve parallelization. In the transformer encoder, calculating attention weights is a crucial step for encoding input documents. Input documents usually contain some key phrases conveying salient information, and it is important to encode these phrases completely. However, existing transformer-based summarization works did not consider key phrases in input when determining attention weights. Consequently, some of the tokens within key phrases only receive small attention weights, which is not conducive to encoding the semantic information of input documents. In this paper, we introduce some prior knowledge of key phrases into the transformer-based summarization model and guide the model to encode key phrases. For the contextual representation of each token in the key phrase, we assume the tokens within the same key phrase make larger contributions compared with other tokens in the input sequence. Based on this assumption, we propose the Key Phrase Aware Transformer (KPAT), a model with the highlighting mechanism in the encoder to assign greater attention weights for tokens within key phrases. Specifically, we first extract key phrases from the input document and score the phrases’ importance. Then we build the block diagonal highlighting matrix to indicate these phrases’ importance scores and positions. To combine self-attention weights with key phrases’ importance scores, we design two structures of highlighting attention for each head and the multi-head highlighting attention. Experimental results on two datasets (Multi-News and PubMed) from different summarization tasks and domains show that our KPAT model significantly outperforms advanced summarization baselines. We conduct more experiments to analyze the impact of each part of our model on the summarization performance and verify the effectiveness of our proposed highlighting mechanism.  相似文献   

4.
Lately there has been intensive research into the possibilities of using additional information about documents (such as hyperlinks) to improve retrieval effectiveness. It is called data fusion, based on the intuitive principle that different document and query representations or different methods lead to a better estimation of the documents' relevance scores.In this paper we propose a new method of document re-ranking that enables us to improve document scores using inter-document relationships. These relationships are expressed by distances and can be obtained from the text, hyperlinks or other information. The method formalizes the intuition that strongly related documents should not be assigned very different weights.  相似文献   

5.
On the web, a huge variety of text collections contain knowledge in different expertise domains, such as technology or medicine. The texts are written for different uses and thus for people having different levels of expertise on the domain. Texts intended for professionals may not be understandable at all by a lay person, and texts for lay people may not contain all the detailed information needed by a professional. Many information retrieval applications, such as search engines, would offer better user experience if they were able to select the text sources that best fit the expertise level of the user. In this article, we propose a novel approach for assessing the difficulty level of a document: our method assesses difficulty for each user separately. The method enables, for instance, offering information in a personalised manner based on the user’s knowledge of different domains. The method is based on the comparison of terms appearing in a document and terms known by the user. We present two ways to collect information about the terminology the user knows: by directly asking the users the difficulty of terms or, as a novel automatic approach, indirectly by analysing texts written by the users. We examine the applicability of the methodology with text documents in the medical domain. The results show that the method is able to distinguish between documents written for lay people and documents written for experts.  相似文献   

6.
This work addresses the information retrieval problem of auto-indexing Arabic documents. Auto-indexing a text document refers to automatically extracting words that are suitable for building an index for the document. In this paper, we propose an auto-indexing method for Arabic text documents. This method is mainly based on morphological analysis and on a technique for assigning weights to words. The morphological analysis uses a number of grammatical rules to extract stem words that become candidate index words. The weight assignment technique computes weights for these words relative to the container document. The weight is based on how spread is the word in a document and not only on its rate of occurrence. The candidate index words are then sorted in descending order by weight so that information retrievers can select the more important index words. We empirically verify the usefulness of our method using several examples. For these examples, we obtained an average recall of 46% and an average precision of 64%.  相似文献   

7.
Social emotion refers to the emotion evoked to the reader by a textual document. In contrast to the emotion cause extraction task which analyzes the cause of the author's sentiments based on the expressions in text, identifying the causes of social emotion evoked to the reader from text has not been explored previously. Social emotion mining and its cause analysis is not only an important research topic in Web-based social media analytics and text mining but also has a number of applications in multiple domains. As the focus of social emotion cause identification is on analyzing the causes of the reader's emotions elicited by a text that are not explicitly or implicitly expressed, it is a challenging task fundamentally different from the previous research. To tackle this, it also needs a deeper level understanding of the cognitive process underlying the inference of social emotion and its cause analysis. In this paper, we propose the new task of social emotion cause identification (SECI). Inspired by the cognitive structure of emotions (OCC) theory, we present a Cognitive Emotion model Enhanced Sequential (CogEES) method for SECI. Specifically, based on the implications of the OCC model, our method first establishes the correspondence between words/phrases in text and emotional dimensions identified in OCC and builds the emotional dimension lexicons with 1,676 distinct words/phrases. Then, our method utilizes lexicons information and discourse coherence for the semantic segmentation of document and the enhancement of clause representation learning. Finally, our method combines text segmentation and clause representation into a sequential model for cause clause prediction. We construct the SECI dataset for this new task and conduct experiments to evaluate CogEES. Our method outperforms the baselines and achieves over 10% F1 improvement on average, with better interpretability of the prediction results.  相似文献   

8.
9.
10.
The Internet, together with the large amount of textual information available in document archives, has increased the relevance of information retrieval related tools. In this work we present an extension of the Gambal system for clustering and visualization of documents based on fuzzy clustering techniques. The tool allows to structure the set of documents in a hierarchical way (using a fuzzy hierarchical structure) and represent this structure in a graphical interface (a 3D sphere) over which the user can navigate.Gambal allows the analysis of the documents and the computation of their similarity not only on the basis of the syntactic similarity between words but also based on a dictionary (Wordnet 1.7) and latent semantics analysis.  相似文献   

11.
This paper examines several different approaches to exploiting structural information in semi-structured document categorization. The methods under consideration are designed for categorization of documents consisting of a collection of fields, or arbitrary tree-structured documents that can be adequately modeled with such a flat structure. The approaches range from trivial modifications of text modeling to more elaborate schemes, specifically tailored to structured documents. We combine these methods with three different text classification algorithms and evaluate their performance on four standard datasets containing different types of semi-structured documents. The best results were obtained with stacking, an approach in which predictions based on different structural components are combined by a meta classifier. A further improvement of this method is achieved by including the flat text model in the final prediction.  相似文献   

12.
Traditional index weighting approaches for information retrieval from texts depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in extracting semantically exact indexes that represent the semantic content of a document. To address this issue, we developed a new indexing formalism that considers not only the terms in a document, but also the concepts. In this approach, concept clusters are defined and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document. Through an experiment on the TREC collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the few highest-ranked documents. Moreover, the index term dimension was 80% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment.  相似文献   

13.
温有奎 《情报科学》2002,20(3):259-261,292
文中提出了一种文本知识分析中的概念层次网络方法。该方法是一种文本概念分析的新思路。探讨了用面向对象技术与框架技术结合实现文本知识的抽象概念和具体概念的自动分析、知识组织、检索智能化方法。  相似文献   

14.
鉴于在目前的技术机会识别中存在研判的创新路径往往较为抽象和模糊,并在很多情况下需领域专家参与解读的问题,以冷库技术为例,研究构建基于文本挖掘、机器学习算法及多维空间专利地图的技术创新路径识别模型。首先,构建技术创新路径识别框架对相关专利文献进行分词、清洗等预处理并建立知识图谱;其次,采用融合词频-逆文档频率(TF-IDF)文本挖掘方法对专利文档提取关键词,继而采用隐含狄利克雷分布(LDA)算法对主题聚类降维并萃取创新维度;再次,依据目标技术问题和目标优选创新法则耦合变换于多维空间专利地图并具象出具有现实意义、有价值前景的创新路径;最后,利用可拓学计算各创新路径综合关联度评级优选。以期减少创新成本、提高创新效率,为企业精准开展技术创新、不断提升核心竞争力提供决策参考。  相似文献   

15.
This study proposes a probabilistic model for automatically extracting English noun phrases without part-of-speech tagging or any syntactic analysis. The technique is based on a Markov model, whose initial parameters are estimated by a phrase lookup program with a phrase dictionary, then optimized by a set of maximum entropy (ME) parameters for a set of morphological features. Using the Viterbi algorithm with the trained Markov model, the program can dynamically extract noun phrases from input text. Experiments show that this technique is of comparable effectiveness with the best existing techniques.  相似文献   

16.
This paper explores the incorporation of prior knowledge into support vector machines as a means of compensating for a shortage of training data in text categorization. The prior knowledge about transformation invariance is generated by a virtual document method. The method applies a simple transformation to documents, i.e., making virtual documents by combining relevant document pairs for a topic in the training set. The virtual document thus created not only is expected to preserve the topic, but even improve the topical representation by exploiting relevant terms that are not given high importance in individual real documents. Artificially generated documents result in the change in the distribution of training data without the randomization. Experiments with support vector machines based on linear, polynomial and radial-basis function kernels showed the effectiveness on Reuters-21578 set for the topics with a small number of relevant documents. The proposed method achieved 131%, 34%, 12% improvements in micro-averaged F1 for 25, 46, and 58 topics with less than 10, 30, and 50 relevant documents in learning, respectively. The result analysis indicates that incorporating virtual documents contributes to a steady improvement on the performance.  相似文献   

17.
为了快速有效地自动处理中文Web文本,提出了一种基于领域本体的主题特征抽取方法.该方法针对Web文本特点,介绍了一种领域词典的半自动化构建方法.基于领域词典切分文本,通过对词条的主题映射,采用领域本体的概念表示文本向量,从而有效地降低文本特征向量的维数,提高主题抽取的质量.考虑文本信息的不同位置与频率,计算主题特征的权值,并且基于领域本体的结构,对主题概念的权值进行调整和排序.实例验证了该方法的有效性.  相似文献   

18.
Transfer learning utilizes labeled data available from some related domain (source domain) for achieving effective knowledge transformation to the target domain. However, most state-of-the-art cross-domain classification methods treat documents as plain text and ignore the hyperlink (or citation) relationship existing among the documents. In this paper, we propose a novel cross-domain document classification approach called Link-Bridged Topic model (LBT). LBT consists of two key steps. Firstly, LBT utilizes an auxiliary link network to discover the direct or indirect co-citation relationship among documents by embedding the background knowledge into a graph kernel. The mined co-citation relationship is leveraged to bridge the gap across different domains. Secondly, LBT simultaneously combines the content information and link structures into a unified latent topic model. The model is based on an assumption that the documents of source and target domains share some common topics from the point of view of both content information and link structure. By mapping both domains data into the latent topic spaces, LBT encodes the knowledge about domain commonality and difference as the shared topics with associated differential probabilities. The learned latent topics must be consistent with the source and target data, as well as content and link statistics. Then the shared topics act as the bridge to facilitate knowledge transfer from the source to the target domains. Experiments on different types of datasets show that our algorithm significantly improves the generalization performance of cross-domain document classification.  相似文献   

19.
We propose a new finite mixture model for clustering multiple-field documents, such as scientific literature with distinct fields: title, abstract, keywords, main text and references. This probabilistic model, which we call field independent clustering model (FICM), incorporates the distinct word distributions of each field to integrate the discriminative abilities of each field as well as to select the most suitable component probabilistic model for each field. We evaluated the performance of FICM by applying it to the problem of clustering three-field (title, abstract and MeSH) biomedical documents from TREC 2004 and 2005 Genomics tracks, and two-field (title and abstract) news reports from Reuters-21578. Experimental results showed that FICM outperformed the classical multinomial model and the multivariate Bernoulli model, being at a statistically significant level for all the three collections. These results indicate that FICM outperformed widely-used probabilistic models for document clustering by considering the characteristics of each field. We further showed that the component model, which is consistent with the nature of the corresponding field, achieved a better performance and considering the diversity of model setting also gave a further performance improvement. An extended abstract of parts of the work presented in this paper has appeared in Zhu et al. [Zhu, S., Takigawa, I., Zhang, S., & Mamitsuka, H. (2007). A probabilistic model for clustering text documents with multiple fields. In Proceedings of the 29th European conference on information retrieval, ECIR 2007. Lecture notes in computer science (Vol. 4425, pp. 331–342)].  相似文献   

20.
In this paper, we describe a model of information retrieval system that is based on a document re-ranking method using document clusters. In the first step, we retrieve documents based on the inverted-file method. Next, we analyze the retrieved documents using document clusters, and re-rank them. In this step, we use static clusters and dynamic cluster view. Consequently, we can produce clusters that are tailored to characteristics of the query. We focus on the merits of the inverted-file method and cluster analysis. In other words, we retrieve documents based on the inverted-file method and analyze all terms in document based on the cluster analysis. By these two steps, we can get the retrieved results which are made by the consideration of the context of all terms in a document as well as query terms. We will show that our method achieves significant improvements over the method based on similarity search ranking alone.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号