首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 609 毫秒
1.
Computing Semantic Similarity (SS) between concepts is one of the most critical issues in many domains such as Natural Language Processing and Artificial Intelligence. Over the years, several SS measurement methods have been proposed by exploiting different knowledge resources. Wikipedia provides a large domain-independent encyclopedic repository and a semantic network for computing SS between concepts. Traditional feature-based measures rely on linear combinations of different properties with two main limitations, the insufficient information and the loss of semantic information. In this paper, we propose several hybrid SS measurement approaches by using the Information Content (IC) and features of concepts, which avoid the limitations introduced above. Considering integrating discrete properties into one component, we present two models of semantic representation, called CORM and CARM. Then, we compute SS based on these models and take the IC of categories as a supplement of SS measurement. The evaluation, based on several widely used benchmarks and a benchmark developed by ourselves, sustains the intuitions with respect to human judgments. In summary, our approaches are more efficient in determining SS between concepts and have a better human correlation than previous methods such as Word2Vec and NASARI.  相似文献   

2.
Wikipedia provides a huge collaboratively made semi-structured taxonomy called Wikipedia category graph (WCG), which can be utilized as a Knowledge Graph (KG) to measure the semantic similarity (SS) between Wikipedia concepts. Previously, several Most Informative Common Ancestor-based (MICA-based) SS methods have been proposed by intrinsically manipulating the taxonomic structure of WCG. However, some basic structural issues in WCG such as huge size, branching factor and multiple inheritance relations hamper the applicability of traditional MICA-based and multiple inheritance-based approaches in it. Therefore, in this paper, we propose a solution to handle these structural issues and present a new multiple inheritance-based SS approach, called Neighborhood Ancestor Semantic Contribution (NASC). In this approach, firstly, we define the neighborhood of a category (a taxonomic concept in WCG) to define its semantic space. Secondly, we describe the semantic value of a category by aggregating the intrinsic IC-based semantic contribution weights of its semantically relevant multiple ancestors. Thirdly, based on our approach, we propose six different methods to compute the SS between Wikipedia concepts. Finally, we evaluate our methods on gold standard word similarity benchmarks for English, German, Spanish and French languages. The experimental evaluation demonstrates that the proposed NASC-based methods remarkably outperform traditional MICA-based and multiple inheritance-based approaches.  相似文献   

3.
Automatic text summarization has been an active field of research for many years. Several approaches have been proposed, ranging from simple position and word-frequency methods, to learning and graph based algorithms. The advent of human-generated knowledge bases like Wikipedia offer a further possibility in text summarization – they can be used to understand the input text in terms of salient concepts from the knowledge base. In this paper, we study a novel approach that leverages Wikipedia in conjunction with graph-based ranking. Our approach is to first construct a bipartite sentence–concept graph, and then rank the input sentences using iterative updates on this graph. We consider several models for the bipartite graph, and derive convergence properties under each model. Then, we take up personalized and query-focused summarization, where the sentence ranks additionally depend on user interests and queries, respectively. Finally, we present a Wikipedia-based multi-document summarization algorithm. An important feature of the proposed algorithms is that they enable real-time incremental summarization – users can first view an initial summary, and then request additional content if interested. We evaluate the performance of our proposed summarizer using the ROUGE metric, and the results show that leveraging Wikipedia can significantly improve summary quality. We also present results from a user study, which suggests that using incremental summarization can help in better understanding news articles.  相似文献   

4.
Identifying and extracting user communities is an important step towards understanding social network dynamics from a macro perspective. For this reason, the work in this paper explores various aspects related to the identification of user communities. To date, user community detection methods employ either explicit links between users (link analysis), or users’ topics of interest in posted content (content analysis), or in tandem. Little work has considered temporal evolution when identifying user communities in a way to group together those users who share not only similar topical interests but also similar temporal behavior towards their topics of interest. In this paper, we identify user communities through multimodal feature learning (embeddings). Our core contributions can be enumerated as (a) we propose a new method for learning neural embeddings for users based on their temporal content similarity; (b) we learn user embeddings based on their social network connections (links) through neural graph embeddings; (c) we systematically interpolate temporal content-based embeddings and social link-based embeddings to capture both social network connections and temporal content evolution for representing users, and (d) we systematically evaluate the quality of each embedding type in isolation and also when interpolated together and demonstrate their performance on a Twitter dataset under two different application scenarios, namely news recommendation and user prediction. We find that (1) content-based methods produce higher quality communities compared to link-based methods; (2) methods that consider temporal evolution of content, our proposed method in particular, show better performance compared to their non-temporal counter-parts; (3) communities that are produced when time is explicitly incorporated in user vector representations have higher quality than the ones produced when time is incorporated into a generative process, and finally (4) while link-based methods are weaker than content-based methods, their interpolation with content-based methods leads to improved quality of the identified communities.  相似文献   

5.
Entity disambiguation is a fundamental task of semantic Web annotation. Entity Linking (EL) is an essential procedure in entity disambiguation, which aims to link a mention appearing in a plain text to a structured or semi-structured knowledge base, such as Wikipedia. Existing research on EL usually annotates the mentions in a text one by one and treats entities independent to each other. However this might not be true in many application scenarios. For example, if two mentions appear in one text, they are likely to have certain intrinsic relationships. In this paper, we first propose a novel query expansion method for candidate generation utilizing the information of co-occurrences of mentions. We further propose a re-ranking model which can be iteratively adjusted based on the prediction in the previous round. Experiments on real-world data demonstrate the effectiveness of our proposed methods for entity disambiguation.  相似文献   

6.
Cross-Lingual Link Discovery (CLLD) is a new problem in Information Retrieval. The aim is to automatically identify meaningful and relevant hypertext links between documents in different languages. This is particularly helpful in knowledge discovery if a multi-lingual knowledge base is sparse in one language or another, or the topical coverage in each language is different; such is the case with Wikipedia. Techniques for identifying new and topically relevant cross-lingual links are a current topic of interest at NTCIR where the CrossLink task has been running since the 2011 NTCIR-9. This paper presents the evaluation framework for benchmarking algorithms for cross-lingual link discovery evaluated in the context of NTCIR-9.  相似文献   

7.
Dynamic link prediction is a critical task in network research that seeks to predict future network links based on the relative behavior of prior network changes. However, most existing methods overlook mutual interactions between neighbors and long-distance interactions and lack the interpretability of the model’s predictions. To tackle the above issues, in this paper, we propose a temporal group-aware graph diffusion network(TGGDN). First, we construct a group affinity matrix to describe mutual interactions between neighbors, i.e., group interactions. Then, we merge the group affinity matrix into the graph diffusion to form a group-aware graph diffusion, which simultaneously captures group interactions and long-distance interactions in dynamic networks. Additionally, we present a transformer block that models the temporal information of dynamic networks using self-attention, allowing the TGGDN to pay greater attention to task-related snapshots while also providing interpretability to better understand the network evolutionary patterns. We compare the proposed TGGDN with state-of-the-art methods on five different sizes of real-world datasets ranging from 1k to 20k nodes. Experimental results show that TGGDN achieves an average improvement of 8.3% and 3.8% in terms of ACC and AUC on all datasets, respectively, demonstrating the superiority of TGGDN in the dynamic link prediction task.  相似文献   

8.
Automatic text summarization attempts to provide an effective solution to today’s unprecedented growth of textual data. This paper proposes an innovative graph-based text summarization framework for generic single and multi document summarization. The summarizer benefits from two well-established text semantic representation techniques; Semantic Role Labelling (SRL) and Explicit Semantic Analysis (ESA) as well as the constantly evolving collective human knowledge in Wikipedia. The SRL is used to achieve sentence semantic parsing whose word tokens are represented as a vector of weighted Wikipedia concepts using ESA method. The essence of the developed framework is to construct a unique concept graph representation underpinned by semantic role-based multi-node (under sentence level) vertices for summarization. We have empirically evaluated the summarization system using the standard publicly available dataset from Document Understanding Conference 2002 (DUC 2002). Experimental results indicate that the proposed summarizer outperforms all state-of-the-art related comparators in the single document summarization based on the ROUGE-1 and ROUGE-2 measures, while also ranking second in the ROUGE-1 and ROUGE-SU4 scores for the multi-document summarization. On the other hand, the testing also demonstrates the scalability of the system, i.e., varying the evaluation data size is shown to have little impact on the summarizer performance, particularly for the single document summarization task. In a nutshell, the findings demonstrate the power of the role-based and vectorial semantic representation when combined with the crowd-sourced knowledge base in Wikipedia.  相似文献   

9.
This paper describes the development and testing of a novel Automatic Search Query Enhancement (ASQE) algorithm, the Wikipedia N Sub-state Algorithm (WNSSA), which utilises Wikipedia as the sole data source for prior knowledge. This algorithm is built upon the concept of iterative states and sub-states, harnessing the power of Wikipedia’s data set and link information to identify and utilise reoccurring terms to aid term selection and weighting during enhancement. This algorithm is designed to prevent query drift by making callbacks to the user’s original search intent by persisting the original query between internal states with additional selected enhancement terms. The developed algorithm has shown to improve both short and long queries by providing a better understanding of the query and available data. The proposed algorithm was compared against five existing ASQE algorithms that utilise Wikipedia as the sole data source, showing an average Mean Average Precision (MAP) improvement of 0.273 over the tested existing ASQE algorithms.  相似文献   

10.
This study examined: (1) whether a peripheral cue and subject knowledge influenced the credibility judgments in the context of Wikipedia; and (2) whether certain factors affected heuristic processing in the context of Wikipedia. The theory of bounded rationality and the heuristic-systematic model serve as the basis of this study. Data were collected employing a quasi-experiment and a web survey at a large public university in the Midwestern United States in the fall of 2011. The study participants consisted of undergraduate students from nine courses whose instructors agreed to their participation. A total of 142 students participated in the study, of which a total of 138 surveys were useable. The major findings of this study include the following: a peripheral cue and knowledge influenced the credibility judgments of college students concerning Wikipedia. The effect of a peripheral cue on credibility judgments was not different between those with high versus low knowledge. Finally, perceived credibility was positively related to heuristic processing, but knowledge, cognitive workload or involvement in a topic was not. This study suggests that educators and librarians need to integrate heuristic approaches into their literacy programs, guiding students to effectively use and not blindly accept cues. Wikipedia needs to offer noticeable cues that can help Wikipedia readers assess the credibility of information. The role of perceptions in heuristic processing needs further investigation. Further, this study demonstrates the strength of a peripheral cue on credibility judgments, suggesting that further research is needed when cues lead to effective credibility judgments and when cues lead to biased credibility judgments. Finally, this study provides the suggestion of an integrated model of the theory of bounded rationality and the heuristic-systematic model that can enhance our understanding of heuristics in relation to credibility judgments.  相似文献   

11.
Wikipedia is known as a free online encyclopedia. Wikipedia uses largely transparent writing and editing processes, which aim at providing the user with quality information through a democratic collaborative system. However, one aspect of these processes is not transparent—the identity of contributors, editors, and administrators. We argue that this particular lack of transparency jeopardizes the validity of the information being produced by Wikipedia. We analyze the social and ethical consequences of this lack of transparency in Wikipedia for all users, but especially students; we assess the corporate social performance issues involved, and we propose courses of action to compensate for the potential problems. We show that Wikipedia has the appearance, but not the reality, of responsible, transparent information production. This paper’s authors are the same as those who authored Wood, D. J. and Queiroz, A. 2008. Information versus. knowledge: Transparency and social responsibility issues for Wikipedia. In Antonino Vaccaro, Hugo Horta, and Peter Madsen (Eds.), Transparency, Information, and Communication Technology (pp. 261–283). Charlottesville, VA: Philosophy Documentation Center. Adele has changed her surname from Queiroz to Santana  相似文献   

12.
赵辉  刘怀亮 《现代情报》2013,33(10):70-74
为解决社区问答系统中的问题短文本特征词少、描述信息弱的问题,本文利用维基百科进行特征扩展以辅助中文问题短文本分类。首先通过维基百科概念及链接等信息进行词语相关概念集合抽取,并综合利用链接结构和类别体系信息进行概念间相关度计算。然后以相关概念集合为基础进行特征扩展以补充文本特征语义信息。实验结果表明,本文提出的基于特征扩展的短文本分类算法能有效提高问题短文本分类效果。  相似文献   

13.
Recent developments have shown that entity-based models that rely on information from the knowledge graph can improve document retrieval performance. However, given the non-transitive nature of relatedness between entities on the knowledge graph, the use of semantic relatedness measures can lead to topic drift. To address this issue, we propose a relevance-based model for entity selection based on pseudo-relevance feedback, which is then used to systematically expand the input query leading to improved retrieval performance. We perform our experiments on the widely used TREC Web corpora and empirically show that our proposed approach to entity selection significantly improves ad hoc document retrieval compared to strong baselines. More concretely, the contributions of this work are as follows: (1) We introduce a graphical probability model that captures dependencies between entities within the query and documents. (2) We propose an unsupervised entity selection method based on the graphical model for query entity expansion and then for ad hoc retrieval. (3) We thoroughly evaluate our method and compare it with the state-of-the-art keyword and entity based retrieval methods. We demonstrate that the proposed retrieval model shows improved performance over all the other baselines on ClueWeb09B and ClueWeb12B, two widely used Web corpora, on the [email protected], and [email protected] metrics. We also show that the proposed method is most effective on the difficult queries. In addition, We compare our proposed entity selection with a state-of-the-art entity selection technique within the context of ad hoc retrieval using a basic query expansion method and illustrate that it provides more effective retrieval for all expansion weights and different number of expansion entities.  相似文献   

14.
Semantic information in judgement documents has been an important source in Artificial Intelligence and Law. Sequential representation is the traditional structure for analyzing judgement documents and supporting the legal charge prediction task. The main problem is that it is not effective to represent the criminal semantic information. In this paper, to represent and verify the criminal semantic information such as multi-linked legal features, we propose a novel criminal semantic representation model, which constructs the Criminal Action Graph (CAG) by extracting criminal actions linked in two temporal relationships. Based on the CAG, a Graph Convolutional Network is also adopted as the predictor for legal charge prediction. We evaluate the validity of CAG on the confusing charges which composed of 32,000 judgement documents on five confusing charge sets. The CAG reaches about 88% accuracy averagely, more than 3% over the compared model. The experimental standard deviation also show the stability of our model, which is about 0.0032 on average, nearly 0. The results show the effectiveness of our model for representing and using the semantic information in judgement documents.  相似文献   

15.
Machine reading comprehension (MRC) is a challenging task in the field of artificial intelligence. Most existing MRC works contain a semantic matching module, either explicitly or intrinsically, to determine whether a piece of context answers a question. However, there is scant work which systematically evaluates different paradigms using semantic matching in MRC. In this paper, we conduct a systematic empirical study on semantic matching. We formulate a two-stage framework which consists of a semantic matching model and a reading model, based on pre-trained language models. We compare and analyze the effectiveness and efficiency of using semantic matching modules with different setups on four types of MRC datasets. We verify that using semantic matching before a reading model improves both the effectiveness and efficiency of MRC. Compared with answering questions by extracting information from concise context, we observe that semantic matching yields more improvements for answering questions with noisy and adversarial context. Matching coarse-grained context to questions, e.g., paragraphs, is more effective than matching fine-grained context, e.g., sentences and spans. We also find that semantic matching is helpful for answering who/where/when/what/how/which questions, whereas it decreases the MRC performance on why questions. This may imply that semantic matching helps to answer a question whose necessary information can be retrieved from a single sentence. The above observations demonstrate the advantages and disadvantages of using semantic matching in different scenarios.  相似文献   

16.
This paper addresses the problem of the automatic recognition and classification of temporal expressions and events in human language. Efficacy in these tasks is crucial if the broader task of temporal information processing is to be successfully performed. We analyze whether the application of semantic knowledge to these tasks improves the performance of current approaches. We therefore present and evaluate a data-driven approach as part of a system: TIPSem. Our approach uses lexical semantics and semantic roles as additional information to extend classical approaches which are principally based on morphosyntax. The results obtained for English show that semantic knowledge aids in temporal expression and event recognition, achieving an error reduction of 59% and 21%, while in classification the contribution is limited. From the analysis of the results it may be concluded that the application of semantic knowledge leads to more general models and aids in the recognition of temporal entities that are ambiguous at shallower language analysis levels. We also discovered that lexical semantics and semantic roles have complementary advantages, and that it is useful to combine them. Finally, we carried out the same analysis for Spanish. The results obtained show comparable advantages. This supports the hypothesis that applying the proposed semantic knowledge may be useful for different languages.  相似文献   

17.
In this work, we present the first quality flaw prediction study for articles containing the two most frequent verifiability flaws in Spanish Wikipedia: articles which do not cite any references or sources at all (denominated Unreferenced) and articles that need additional citations for verification (so-called Refimprove). Based on the underlying characteristics of each flaw, different state-of-the-art approaches were evaluated. For articles not citing any references, a well-established rule-based approach was evaluated and interesting findings show that some of them suffer from Refimprove flaw instead. Likewise, for articles that need additional citations for verification, the well-known PU learning and one-class classification approaches were evaluated. Besides, new methods were compared and a new feature was also proposed to model this latter flaw. The results showed that new methods such as under-bagged decision trees with sum or majority voting rules, biased-SVM, and centroid-based balanced SVM, perform best in comparison with the ones previously published.  相似文献   

18.
This paper proposes collaborative filtering as a means to predict semantic preferences by combining information on social ties with information on links between actors and semantics. First, the authors present an overview of the most relevant collaborative filtering approaches, showing how they work and how they differ. They then compare three different collaborative filtering algorithms using articles published by New York Times journalists from 2003 to 2005 to predict preferences, where preferences refer to journalists’ inclination to use certain words in their writing. Results show that while preference profile similarities in an actor’s neighbourhood are a good predictor of her semantic preferences, information on her social network adds little to prediction accuracy.  相似文献   

19.
对维基百科编辑中的冲突与协调的研究有助于深化对社会化创新中内在协调机制的理解。在T9r9k等人所提的维基编辑动力学模型的基础上,本文提出一个考虑狂热者影响的协同编辑冲突动力学模型。通过基于Agent建模分析,所提出的模型复现了从"单一冲突"到"间歇性冲突"再到"持续冲突"的相变。进而,本模型还展示了比T9r9k等的模型更为丰富的动力学形态。本文工作为维基百科编辑中所观察到的三种冲突模式提供了一种新的解释,并对社会化创新项目中的观点冲突协调有一定启发意义。  相似文献   

20.
以维基百科为代表的网络合作创造了一种全新的知识生产方式,并从认识论上提出了新的问题和挑战,文章关注了这些问题并对集体合作的认识论研究展开深入的分析和思考。文章指出维基百科与科学的四种认识文化差异,认为维基百科知识可以被视为一种认识论研究中的集体陈词,这种集体陈词具有较强的可辩护性,维基百科并不能取代专家作用,但是却能够生成一种认识平均主义模型从而打破知识特权实现知识权利的转移和流动。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号