首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Transfer learning utilizes labeled data available from some related domain (source domain) for achieving effective knowledge transformation to the target domain. However, most state-of-the-art cross-domain classification methods treat documents as plain text and ignore the hyperlink (or citation) relationship existing among the documents. In this paper, we propose a novel cross-domain document classification approach called Link-Bridged Topic model (LBT). LBT consists of two key steps. Firstly, LBT utilizes an auxiliary link network to discover the direct or indirect co-citation relationship among documents by embedding the background knowledge into a graph kernel. The mined co-citation relationship is leveraged to bridge the gap across different domains. Secondly, LBT simultaneously combines the content information and link structures into a unified latent topic model. The model is based on an assumption that the documents of source and target domains share some common topics from the point of view of both content information and link structure. By mapping both domains data into the latent topic spaces, LBT encodes the knowledge about domain commonality and difference as the shared topics with associated differential probabilities. The learned latent topics must be consistent with the source and target data, as well as content and link statistics. Then the shared topics act as the bridge to facilitate knowledge transfer from the source to the target domains. Experiments on different types of datasets show that our algorithm significantly improves the generalization performance of cross-domain document classification.  相似文献   

2.
This paper is an interim report on our efforts at NIST to construct an information discovery tool through the fusion of hypertext and information retrieval (IR) technologies. The tool works by parsing a contiguous document base into smaller documents and inserting semantic links between these documents using document–document similarity measures based on IR techniques. The focus of the paper is a case study in which domain experts evaluate the utility of the tool in the performance of information discovery tasks on a large, dynamic procedural manual. The results of the case study are discussed, and their implications for the design of large-scale automatic hypertext generation systems are described.  相似文献   

3.
通过使用功能等价法,实现了传统文献到网络文献的迁移。从表面上看,该方法是一种类似功能的传递,实质上它是传统文献在网络环境中的嫁接。它既适应了网络文献灵活多变的特征,又满足了传统文献引用价值的平衡,是网络文献引用规范化的一个有益的尝试。  相似文献   

4.
Document similarity search (i.e. query by example) aims to retrieve a ranked list of documents similar to a query document in a text corpus or on the Web. Most existing approaches to similarity search first compute the pairwise similarity score between each document and the query using a retrieval function or similarity measure (e.g. Cosine), and then rank the documents by the similarity scores. In this paper, we propose a novel retrieval approach based on manifold-ranking of document blocks (i.e. a block of coherent text about a subtopic) to re-rank a small set of documents initially retrieved by some existing retrieval function. The proposed approach can make full use of the intrinsic global manifold structure of the document blocks by propagating the ranking scores between the blocks on a weighted graph. First, the TextTiling algorithm and the VIPS algorithm are respectively employed to segment text documents and web pages into blocks. Then, each block is assigned with a ranking score by the manifold-ranking algorithm. Lastly, a document gets its final ranking score by fusing the scores of its blocks. Experimental results on the TDT data and the ODP data demonstrate that the proposed approach can significantly improve the retrieval performances over baseline approaches. Document block is validated to be a better unit than the whole document in the manifold-ranking process.  相似文献   

5.
基于引证网络的高被引文献实证分析——以知识服务为例   总被引:3,自引:0,他引:3  
魏瑞斌  陈丹丹 《现代情报》2011,31(3):117-121
引证网络是以节点文献为中心,通过文献之间的引用关系将相关文献链接为一个网络。本文利用h-b指数选择了知识服务研究领域的30高被引文献,利用聚类方法其引证网络的6个指标数据进行分析。研究发现,这些文献的6个指标数据呈现出4种不同类型的数据分布情况,这反映出高被引文献在知识的吸收和传播中具有不同的特点。  相似文献   

6.
【目的/意义】利用网络分析方法对融入引文内容的引文网络中的知识流动规律与模式展开系统研究,以期 为引文网络中的知识扩散、转化与创新提供理论与实证依据。【方法/过程】选取描述性统计量和网络分析指标,对 知识节点的知识流动能力及角色、知识群落的知识流动类型及结构、整体网络的知识流动分布特征及结构特征进 行深度刻画和剖析。【结果/结论】依据CNKI数据库主题期刊论文为测度数据,分别构建“智库”“数字人文”“数据治 理”三个主题的引文网络,并依据文中方法比较分析其间知识流动特征的异同。文中方法能够深入挖掘学术文献 间的知识关联,弥补过去引文网络知识流动研究中因忽略深层次引用信息而产生的缺陷。【创新/局限】本文采用多 种指标与方法对引文内容视角下引文网络知识流动规律与模式展开系统研究,但是未从整体引文网络中抽取反映 某一或某些知识属性的个体引文网络进行分析。  相似文献   

7.
The Internet, together with the large amount of textual information available in document archives, has increased the relevance of information retrieval related tools. In this work we present an extension of the Gambal system for clustering and visualization of documents based on fuzzy clustering techniques. The tool allows to structure the set of documents in a hierarchical way (using a fuzzy hierarchical structure) and represent this structure in a graphical interface (a 3D sphere) over which the user can navigate.Gambal allows the analysis of the documents and the computation of their similarity not only on the basis of the syntactic similarity between words but also based on a dictionary (Wordnet 1.7) and latent semantics analysis.  相似文献   

8.
Automated legal text classification is a prominent research topic in the legal field. It lays the foundation for building an intelligent legal system. Current literature focuses on international legal texts, such as Chinese cases, European cases, and Australian cases. Little attention is paid to text classification for U.S. legal texts. Deep learning has been applied to improving text classification performance. Its effectiveness needs further exploration in domains such as the legal field. This paper investigates legal text classification with a large collection of labeled U.S. case documents through comparing the effectiveness of different text classification techniques. We propose a machine learning algorithm using domain concepts as features and random forests as the classifier. Our experiment results on 30,000 full U.S. case documents in 50 categories demonstrated that our approach significantly outperforms a deep learning system built on multiple pre-trained word embeddings and deep neural networks. In addition, applying only the top 400 domain concepts as features for building the random forests could achieve the best performance. This study provides a reference to select machine learning techniques for building high-performance text classification systems in the legal domain or other fields.  相似文献   

9.
孙海生 《现代情报》2019,39(4):134-142
[目的/意义]已有研究对文献耦合关系和同被引关系比较的研究较少,本文比较两种关系在文献间建立联系的差异,并且比较耦合/同被引强度与文献相似度的相关性,分析耦合分析和同被引分析各自更适合哪些方面的应用。[方法/过程]根据复杂网络理论,构建文献耦合网络和同被引网络,实证比较文献耦合网络和同被引网络的拓扑性质。利用QAP关联分析,研究耦合关系、同被引关系与文献内容相似度的关系。[结果/结论]网络拓扑结构分析表明,耦合关系在文献之间建立的联系更普遍而且更稳定,更利于检索被引用次数较少的大多数文献;同被引关系在高被引文献之间建立的联系更紧密,利于检索和确定领域内的核心文献。QAP关联分析表明耦合强度和文献相似度的相关性更强,在文献聚类分析研究主题时,耦合强度更可靠。  相似文献   

10.
One of the most important problems in information retrieval is determining the order of documents in the answer returned to the user. Many methods and algorithms for document ordering have been proposed. The method introduced in this paper differs from them especially in that it uses a probabilistic model of a document set. In this model documents are regarded as states of a Markov chain, where transition probabilities are directly proportional to similarities between documents. Steady-state probabilities reflect similarities of particular documents to the whole answer set. If documents are ordered according to these probabilities, at the top of a list there will be documents that are the best representatives of the set, and at the bottom those which are the worst representatives. The method was tested against databases INSPEC and Networked Computer Science Technical Reference Library (NCSTRL). Test results are positive. Values of the Kendall rank correlation coefficient indicate high similarity between rankings generated by the proposed method and rankings produced by experts. Results are comparable with rankings generated by the vector model using standard weighting schema tf·idf.  相似文献   

11.
Patent documents are an ample source of technical and commercial knowledge and, thus, patent analysis has long been considered a useful vehicle for R&D management and technoeconomic analysis. In terms of techniques for patent analysis, citation analysis has been the most frequently adopted tool. In this research, we note that citation analysis is subject to some crucial drawbacks and propose a network-based analysis, an alternative method for citation analysis. By using an illustrative data set, the overall process of developing patent network is described. Furthermore, such new indexes as technology centrality index, technology cycle index, and technology keyword clusters are suggested for in-depth quantitative analysis. Although network analysis shares some commonality with conventional citation analysis, its relative advantage is substantial. It shows the overall relationship among patents as a visual network. In addition, the proposed method provides richer information and thus enables deeper analysis since it takes more diverse keywords into account and produces more meaningful indexes. These visuals and indexes can be used in analyzing up-to-date trends of high technologies and identifying promising avenues for new product development.  相似文献   

12.
This study proposes a novel extended co-citation search technique, which is graph-based document retrieval on a co-citation network containing citation context information. The proposed search expands the scope of the target documents by repetitively spreading the relationship of co-citation in order to obtain relevant documents that are not identified by traditional co-citation searches. Specifically, this search technique is a combination of (a) applying a graph-based algorithm to compute the similarity score on a complicated network, and (b) incorporating co-citation contexts into the process of calculating similarity scores to reduce the negative effects of an increasing number of irrelevant documents. To evaluate the search performance of the proposed search, 10 proposed methods (five representative graph-based algorithms applied to co-citation networks weighted with/without contexts) are compared with two kinds of baselines (a traditional co-citation search with/without contexts) in information retrieval experiments based on two test collections (biomedicine and computer linguistic articles). The experiment results showed that the scores of the normalized discounted cumulative gain ([email protected]) of the proposed methods using co-citation contexts tended to be higher than those of the baselines. In addition, the combination of the random walk with restart (RWR) algorithm and the network weighted with contexts achieved the best search performance among the 10 proposed methods. Thus, it is clarified that the combination of graph-based algorithms and co-citation contexts are effective in improving the performance of co-citation search techniques, and that sole use of a graph-based algorithm is not enough to enhance search performances from the baselines.  相似文献   

13.
[目的/意义]有效融合引文网络中的引用关系和文本属性等多元数据,增强文献节点间的语义关联,从而为数据挖掘和知识发现等任务提供有力的支撑。[方法/过程]提出了一种引文网络的知识表示方法,先利用神经网络模型学习引文网络中的k阶邻近结构;然后使用doc2vec模型学习标题、摘要等文本属性;最后给出了基于向量共享的交叉学习机制用于多元数据融合。[结果/结论]通过面向干细胞领域的CNKI引文数据集的测试,在链路预测上取得了较好的性能,证明了方法的有效性和科学性。  相似文献   

14.
Opinion mining is one of the most important research tasks in the information retrieval research community. With the huge volume of opinionated data available on the Web, approaches must be developed to differentiate opinion from fact. In this paper, we present a lexicon-based approach for opinion retrieval. Generally, opinion retrieval consists of two stages: relevance to the query and opinion detection. In our work, we focus on the second state which itself focusses on detecting opinionated documents . We compare the document to be analyzed with opinionated sources that contain subjective information. We hypothesize that a document with a strong similarity to opinionated sources is more likely to be opinionated itself. Typical lexicon-based approaches treat and choose their opinion sources according to their test collection, then calculate the opinion score based on the frequency of subjective terms in the document. In our work, we use different open opinion collections without any specific treatment and consider them as a reference collection. We then use language models to determine opinion scores. The analysis document and reference collection are represented by different language models (i.e., Dirichlet, Jelinek-Mercer and two-stage models). These language models are generally used in information retrieval to represent the relationship between documents and queries. However, in our study, we modify these language models to represent opinionated documents. We carry out several experiments using Text REtrieval Conference (TREC) Blogs 06 as our analysis collection and Internet Movie Data Bases (IMDB), Multi-Perspective Question Answering (MPQA) and CHESLY as our reference collection. To improve opinion detection, we study the impact of using different language models to represent the document and reference collection alongside different combinations of opinion and retrieval scores. We then use this data to deduce the best opinion detection models. Using the best models, our approach improves on the best baseline of TREC Blog (baseline4) by 30%.  相似文献   

15.
Text clustering is a well-known method for information retrieval and numerous methods for classifying words, documents or both together have been proposed. Frequently, textual data are encoded using vector models so the corpus is transformed in to a matrix of terms by documents; using this representation text clustering generates groups of similar objects on the basis of the presence/absence of the words in the documents. An alternative way to work on texts is to represent them as a network where nodes are entities connected by the presence and distribution of the words in the documents. In this work, after summarising the state of the art of text clustering we will present a new network approach to textual data. We undertake text co-clustering using methods developed for social network analysis. Several experimental results will be presented to demonstrate the validity of the approach and the advantages of this technique compared to existing methods.  相似文献   

16.
In this paper, we describe a model of information retrieval system that is based on a document re-ranking method using document clusters. In the first step, we retrieve documents based on the inverted-file method. Next, we analyze the retrieved documents using document clusters, and re-rank them. In this step, we use static clusters and dynamic cluster view. Consequently, we can produce clusters that are tailored to characteristics of the query. We focus on the merits of the inverted-file method and cluster analysis. In other words, we retrieve documents based on the inverted-file method and analyze all terms in document based on the cluster analysis. By these two steps, we can get the retrieved results which are made by the consideration of the context of all terms in a document as well as query terms. We will show that our method achieves significant improvements over the method based on similarity search ranking alone.  相似文献   

17.
This paper presents a robust and comprehensive graph-based rank aggregation approach, used to combine results of isolated ranker models in retrieval tasks. The method follows an unsupervised scheme, which is independent of how the isolated ranks are formulated. Our approach is able to combine arbitrary models, defined in terms of different ranking criteria, such as those based on textual, image or hybrid content representations.We reformulate the ad-hoc retrieval problem as a document retrieval based on fusion graphs, which we propose as a new unified representation model capable of merging multiple ranks and expressing inter-relationships of retrieval results automatically. By doing so, we claim that the retrieval system can benefit from learning the manifold structure of datasets, thus leading to more effective results. Another contribution is that our graph-based aggregation formulation, unlike existing approaches, allows for encapsulating contextual information encoded from multiple ranks, which can be directly used for ranking, without further computations and post-processing steps over the graphs. Based on the graphs, a novel similarity retrieval score is formulated using an efficient computation of minimum common subgraphs. Finally, another benefit over existing approaches is the absence of hyperparameters.A comprehensive experimental evaluation was conducted considering diverse well-known public datasets, composed of textual, image, and multimodal documents. Performed experiments demonstrate that our method reaches top performance, yielding better effectiveness scores than state-of-the-art baseline methods and promoting large gains over the rankers being fused, thus demonstrating the successful capability of the proposal in representing queries based on a unified graph-based model of rank fusions.  相似文献   

18.
Interdocument similarities are the fundamental information source required in cluster-based retrieval, which is an advanced retrieval approach that significantly improves performance during information retrieval (IR). An effective similarity metric is query-sensitive similarity, which was introduced by Tombros and Rijsbergen as method to more directly satisfy the cluster hypothesis that forms the basis of cluster-based retrieval. Although this method is reported to be effective, existing applications of query-specific similarity are still limited to vector space models wherein there is no connection to probabilistic approaches. We suggest a probabilistic framework that defines query-sensitive similarity based on probabilistic co-relevance, where the similarity between two documents is proportional to the probability that they are both co-relevant to a specific given query. We further simplify the proposed co-relevance-based similarity by decomposing it into two separate relevance models. We then formulate all the requisite components for the proposed similarity metric in terms of scoring functions used by language modeling methods. Experimental results obtained using standard TREC test collections consistently showed that the proposed query-sensitive similarity measure performs better than term-based similarity and existing query-sensitive similarity in the context of Voorhees’ nearest neighbor test (NNT).  相似文献   

19.
Automatic text classification is the task of organizing documents into pre-determined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the words in other words terms are cut from their finer context i.e. their location in a sentence or in a document. Only the broader context of document is used with some type of term frequency information in the vector space. Consequently, semantics of words that can be inferred from the finer context of its location in a sentence and its relations with neighboring words are usually ignored. However, meaning of words, semantic connections between words, documents and even classes are obviously important since methods that capture semantics generally reach better classification performances. Several surveys have been published to analyze diverse approaches for the traditional text classification methods. Most of these surveys cover application of different semantic term relatedness methods in text classification up to a certain degree. However, they do not specifically target semantic text classification algorithms and their advantages over the traditional text classification. In order to fill this gap, we undertake a comprehensive discussion of semantic text classification vs. traditional text classification. This survey explores the past and recent advancements in semantic text classification and attempts to organize existing approaches under five fundamental categories; domain knowledge-based approaches, corpus-based approaches, deep learning based approaches, word/character sequence enhanced approaches and linguistic enriched approaches. Furthermore, this survey highlights the advantages of semantic text classification algorithms over the traditional text classification algorithms.  相似文献   

20.
Traditional information retrieval techniques that primarily rely on keyword-based linking of the query and document spaces face challenges such as the vocabulary mismatch problem where relevant documents to a given query might not be retrieved simply due to the use of different terminology for describing the same concepts. As such, semantic search techniques aim to address such limitations of keyword-based retrieval models by incorporating semantic information from standard knowledge bases such as Freebase and DBpedia. The literature has already shown that while the sole consideration of semantic information might not lead to improved retrieval performance over keyword-based search, their consideration enables the retrieval of a set of relevant documents that cannot be retrieved by keyword-based methods. As such, building indices that store and provide access to semantic information during the retrieval process is important. While the process for building and querying keyword-based indices is quite well understood, the incorporation of semantic information within search indices is still an open challenge. Existing work have proposed to build one unified index encompassing both textual and semantic information or to build separate yet integrated indices for each information type but they face limitations such as increased query process time. In this paper, we propose to use neural embeddings-based representations of term, semantic entity, semantic type and documents within the same embedding space to facilitate the development of a unified search index that would consist of these four information types. We perform experiments on standard and widely used document collections including Clueweb09-B and Robust04 to evaluate our proposed indexing strategy from both effectiveness and efficiency perspectives. Based on our experiments, we find that when neural embeddings are used to build inverted indices; hence relaxing the requirement to explicitly observe the posting list key in the indexed document: (a) retrieval efficiency will increase compared to a standard inverted index, hence reduces the index size and query processing time, and (b) while retrieval efficiency, which is the main objective of an efficient indexing mechanism improves using our proposed method, retrieval effectiveness also retains competitive performance compared to the baseline in terms of retrieving a reasonable number of relevant documents from the indexed corpus.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号