首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Probabilistic topic models are unsupervised generative models which model document content as a two-step generation process, that is, documents are observed as mixtures of latent concepts or topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable texts. We define multilingual probabilistic topic modeling (MuPTM) and present the first full overview of the current research, methodology, advantages and limitations in MuPTM. As a representative example, we choose a natural extension of the omnipresent LDA model to multilingual settings called bilingual LDA (BiLDA). We provide a thorough overview of this representative multilingual model from its high-level modeling assumptions down to its mathematical foundations. We demonstrate how to use the data representation by means of output sets of (i) per-topic word distributions and (ii) per-document topic distributions coming from a multilingual probabilistic topic model in various real-life cross-lingual tasks involving different languages, without any external language pair dependent translation resource: (1) cross-lingual event-centered news clustering, (2) cross-lingual document classification, (3) cross-lingual semantic similarity, and (4) cross-lingual information retrieval. We also briefly review several other applications present in the relevant literature, and introduce and illustrate two related modeling concepts: topic smoothing and topic pruning. In summary, this article encompasses the current research in multilingual probabilistic topic modeling. By presenting a series of potential applications, we reveal the importance of the language-independent and language pair independent data representations by means of MuPTM. We provide clear directions for future research in the field by providing a systematic overview of how to link and transfer aspect knowledge across corpora written in different languages via the shared space of latent cross-lingual topics, that is, how to effectively employ learned per-topic word distributions and per-document topic distributions of any multilingual probabilistic topic model in various cross-lingual applications.  相似文献   

2.
[目的/意义] 从跨语言视角探究如何更好地解决低资源语言的实体抽取问题。[方法/过程] 以英语为源语言,西班牙语和荷兰语为目标语言,借助迁移学习和深度学习的思想,提出一种结合自学习和GRU-LSTM-CRF网络的无监督跨语言实体抽取方法。[结果/结论] 与有监督的跨语言实体抽取方法相比,本文提出的无监督跨语言实体抽取方法可以取得更好的效果,在西班牙语上,F1值为0.6419,在荷兰语上,F1值为0.6557。利用跨语言知识在源语言和目标语言间建立桥梁,提升低资源语言实体抽取的效果。  相似文献   

3.
We study the selection of transfer languages for automatic abusive language detection. Instead of preparing a dataset for every language, we demonstrate the effectiveness of cross-lingual transfer learning for zero-shot abusive language detection. This way we can use existing data from higher-resource languages to build better detection systems for low-resource languages. Our datasets are from seven different languages from three language families. We measure the distance between the languages using several language similarity measures, especially by quantifying the World Atlas of Language Structures. We show that there is a correlation between linguistic similarity and classifier performance. This discovery allows us to choose an optimal transfer language for zero shot abusive language detection.  相似文献   

4.
Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most existing text categorization techniques deal with monolingual documents (i.e., written in the same language) during the learning of the text categorization model and category assignment (or prediction) for unclassified documents. However, with the globalization of business environments and advances in Internet technology, an organization or individual may generate and organize into categories documents in one language and subsequently archive documents in different languages into existing categories, which necessitate cross-lingual text categorization (CLTC). Specifically, cross-lingual text categorization deals with learning a text categorization model from a set of training documents written in one language (e.g., L1) and then classifying new documents in a different language (e.g., L2). Motivated by the significance of this demand, this study aims to design a CLTC technique with two different category assignment methods, namely, individual- and cluster-based. Using monolingual text categorization as a performance reference, our empirical evaluation results demonstrate the cross-lingual capability of the proposed CLTC technique. Moreover, the classification accuracy achieved by the cluster-based category assignment method is statistically significantly higher than that attained by the individual-based method.  相似文献   

5.
We study the selection of transfer languages for different Natural Language Processing tasks, specifically sentiment analysis, named entity recognition and dependency parsing. In order to select an optimal transfer language, we propose to utilize different linguistic similarity metrics to measure the distance between languages and make the choice of transfer language based on this information instead of relying on intuition. We demonstrate that linguistic similarity correlates with cross-lingual transfer performance for all of the proposed tasks. We also show that there is a statistically significant difference in choosing the optimal language as the transfer source instead of English. This allows us to select a more suitable transfer language which can be used to better leverage knowledge from high-resource languages in order to improve the performance of language applications lacking data. For the study, we used datasets from eight different languages from three language families.  相似文献   

6.
A comparative study of two types of patent retrieval tasks, technology survey and invalidity search, using the NTCIR-3 and -4 test collections is described, with a focus on pseudo-feedback effectiveness and different retrieval models. Invalidity searches are peculiar to patent retrieval tasks and feature small numbers of relevant documents and long queries. Different behaviors of effectiveness are observed when applying different retrieval models and pseudo-feedback. These different behaviors are analyzed in terms of the “weak cluster hypothesis”, i.e., terminological cohesiveness through relevant documents.  相似文献   

7.
林德明  刘则渊 《科学学研究》2010,28(8):1141-1147
利用文献计量方法结合知识可视化技术,通过绘制科学知识图谱计量分析了"文献"与"发现"之间的关系,并系统地展示了基于文献的科学发现的发展与现状,明确了实现基于文献的科学发现计算性的现实基础,包括由Swanson所提出的以非相关文献为基础的知识发现理论的研究,以及以相关文献为基础的科学发现的研究,尤其是应用科学计量学方法对科学发展前沿和热点的感知与识别的研究,这些研究为科学发现计算性的实现提供了参考。最后,在研究现状分析的基础上,提出一种计算机模拟结合可视化技术计量文献的新视角,并且提出了实现基于文献的科学发现计算的途径。  相似文献   

8.
Hate speech is an increasingly important societal issue in the era of digital communication. Hateful expressions often make use of figurative language and, although they represent, in some sense, the dark side of language, they are also often prime examples of creative use of language. While hate speech is a global phenomenon, current studies on automatic hate speech detection are typically framed in a monolingual setting. In this work, we explore hate speech detection in low-resource languages by transferring knowledge from a resource-rich language, English, in a zero-shot learning fashion. We experiment with traditional and recent neural architectures, and propose two joint-learning models, using different multilingual language representations to transfer knowledge between pairs of languages. We also evaluate the impact of additional knowledge in our experiment, by incorporating information from a multilingual lexicon of abusive words. The results show that our joint-learning models achieve the best performance on most languages. However, a simple approach that uses machine translation and a pre-trained English language model achieves a robust performance. In contrast, Multilingual BERT fails to obtain a good performance in cross-lingual hate speech detection. We also experimentally found that the external knowledge from a multilingual abusive lexicon is able to improve the models’ performance, specifically in detecting the positive class. The results of our experimental evaluation highlight a number of challenges and issues in this particular task. One of the main challenges is related to the issue of current benchmarks for hate speech detection, in particular how bias related to the topical focus in the datasets influences the classification performance. The insufficient ability of current multilingual language models to transfer knowledge between languages in the specific hate speech detection task also remain an open problem. However, our experimental evaluation and our qualitative analysis show how the explicit integration of linguistic knowledge from a structured abusive language lexicon helps to alleviate this issue.  相似文献   

9.
This article proposes a syntactic parsing strategy based on a dependency grammar containing formal rules and a compression technique that reduces the complexity of those rules. Compression parsing is mainly driven by the ‘single-head’ constraint of Dependency Grammar, and can be seen as an alternative method to the well-known constructive strategy. The compression algorithm simplifies the input sentence by progressively removing from it the dependent tokens as soon as binary syntactic dependencies are recognized. This strategy is thus similar to that used in deterministic dependency parsing. A compression parser was implemented and released under General Public License, as well as a cross-lingual grammar with Universal Dependencies, containing only broad-coverage rules applied to Romance languages. The system is an almost delexicalized parser which does not need training data to analyze Romance languages. The rule-based cross-lingual parser was submitted to CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. The performance of our system was compared to the other supervised systems participating in the competition, paying special attention to the parsing of different treebanks of the same language. We also trained a supervised delexicalized parser for Romance languages in order to compare it to our rule-based system. The results show that the performance of our cross-lingual method does not change across related languages and across different treebanks, while most supervised methods turn out to be very dependent on the text domain used to train the system.  相似文献   

10.
This paper describes our novel retrieval model that is based on contexts of query terms in documents (i.e., document contexts). Our model is novel because it explicitly takes into account of the document contexts instead of implicitly using the document contexts to find query expansion terms. Our model is based on simulating a user making relevance decisions, and it is a hybrid of various existing effective models and techniques. It estimates the relevance decision preference of a document context as the log-odds and uses smoothing techniques as found in language models to solve the problem of zero probabilities. It combines these estimated preferences of document contexts using different types of aggregation operators that comply with different relevance decision principles (e.g., aggregate relevance principle). Our model is evaluated using retrospective experiments (i.e., with full relevance information), because such experiments can (a) reveal the potential of our model, (b) isolate the problems of the model from those of the parameter estimation, (c) provide information about the major factors affecting the retrieval effectiveness of the model, and (d) show that whether the model obeys the probability ranking principle. Our model is promising as its mean average precision is 60–80% in our experiments using different TREC ad hoc English collections and the NTCIR-5 ad hoc Chinese collection. Our experiments showed that (a) the operators that are consistent with aggregate relevance principle were effective in combining the estimated preferences, and (b) that estimating probabilities using the contexts in the relevant documents can produce better retrieval effectiveness than using the entire relevant documents.  相似文献   

11.
Technical terms and proper names constitute a major problem in dictionary-based cross-language information retrieval (CLIR). However, technical terms and proper names in different languages often share the same Latin or Greek origin, being thus spelling variants of each other. In this paper we present a novel two-step fuzzy translation technique for cross-lingual spelling variants. In the first step, transformation rules are applied to source words to render them more similar to their target language equivalents. The rules are generated automatically using translation dictionaries as source data. In the second step, the intermediate forms obtained in the first step are translated into a target language using fuzzy matching. The effectiveness of the technique was evaluated empirically using five source languages and English as a target language. The two-step technique performed better, in some cases considerably better, than fuzzy matching alone. Even using the first step as such showed promising results.  相似文献   

12.
This paper presents a laboratory based evaluation study of cross-language information retrieval technologies, utilizing partially parallel test collections, NTCIR-2 (used together with NTCIR-1), where Japanese–English parallel document collections, parallel topic sets and their relevance judgments are available. These enable us to observe and compare monolingual retrieval processes in two languages as well as retrieval across languages. Our experiments focused on (1) the Rosetta stone question (whether a partially parallel collection helps in cross-language information access or not?) and (2) two aspects of retrieval difficulties namely “collection discrepancy” and “query discrepancy”. Japanese and English monolingual retrieval systems are combined by dictionary based query translation modules so that a symmetrical bilingual evaluation environment is implemented.  相似文献   

13.
[目的/意义]旨在探索信息时代科技情报工作的发展新模式。[方法/过程]分析了我国科技情报工作的发展与面临的挑战,讨论了将科技情报工作融入于科技档案工作的必要性与可行性,分析了科技情报工作在科技档案知识管理中的作用,在此基础上提出了融入于科技档案工作的科技情报工作模式。[结果/结论]科技情报工作需要融入于科技档案知识管理的知识积累、知识组织、知识评价、知识发现、知识开发、知识服务等环节,在促进科技档案知识资源利用的同时,提升科技情报工作自身的质量与水平。  相似文献   

14.
浅析企业知识联盟   总被引:2,自引:0,他引:2  
田敏 《情报科学》2001,19(2):187-190
本文对知识经济条件下企业合作新模式——知识联盟存在的必要性及特点、作用加以浅析,分析了我国企业组建知识联盟存在的问题,并提出了对策。  相似文献   

15.
基于企业内知识转移与共享的激励模式研究   总被引:12,自引:2,他引:12  
冯天学  田金信 《预测》2005,24(5):9-13
企业内的知识转移与共享是知识资本持续增值与价值实现的关键环节。而知识转移与共享的效率取决于员工个体的相互信任、自愿合作、奉献及环境。针对传统激励模式的局限性,本文根据Charles Ehin总结的人类天性的有关理论,提出了环境激励的概念,构建了新的激励模式。  相似文献   

16.
刘颖 《科教文汇》2011,(20):133-134
本文着重介绍在英语阅读教学中探究利用和挖掘教材中的文化内容,渗透相关文化信息,让学生在学习语言知识、提高阅读技巧的同时,提升人文素养,增强全球意识及文化差异敏感性.为其形成得体的跨文化交际能力奠定扎实的基础。  相似文献   

17.
Recently, sentiment classification has received considerable attention within the natural language processing research community. However, since most recent works regarding sentiment classification have been done in the English language, there are accordingly not enough sentiment resources in other languages. Manual construction of reliable sentiment resources is a very difficult and time-consuming task. Cross-lingual sentiment classification aims to utilize annotated sentiment resources in one language (typically English) for sentiment classification of text documents in another language. Most existing research works rely on automatic machine translation services to directly project information from one language to another. However, different term distribution between original and translated text documents and translation errors are two main problems faced in the case of using only machine translation. To overcome these problems, we propose a novel learning model based on active learning and semi-supervised co-training to incorporate unlabelled data from the target language into the learning process in a bi-view framework. This model attempts to enrich training data by adding the most confident automatically-labelled examples, as well as a few of the most informative manually-labelled examples from unlabelled data in an iterative process. Further, in this model, we consider the density of unlabelled data so as to select more representative unlabelled examples in order to avoid outlier selection in active learning. The proposed model was applied to book review datasets in three different languages. Experiments showed that our model can effectively improve the cross-lingual sentiment classification performance and reduce labelling efforts in comparison with some baseline methods.  相似文献   

18.
Through the recent NTCIR workshops, patent retrieval casts many challenging issues to information retrieval community. Unlike newspaper articles, patent documents are very long and well structured. These characteristics raise the necessity to reassess existing retrieval techniques that have been mainly developed for structure-less and short documents such as newspapers. This study investigates cluster-based retrieval in the context of invalidity search task of patent retrieval. Cluster-based retrieval assumes that clusters would provide additional evidence to match user’s information need. Thus far, cluster-based retrieval approaches have relied on automatically-created clusters. Fortunately, all patents have manually-assigned cluster information, international patent classification codes. International patent classification is a standard taxonomy for classifying patents, and has currently about 69,000 nodes which are organized into a five-level hierarchical system. Thus, patent documents could provide the best test bed to develop and evaluate cluster-based retrieval techniques. Experiments using the NTCIR-4 patent collection showed that the cluster-based language model could be helpful to improving the cluster-less baseline language model.  相似文献   

19.
彭博 《情报科学》2021,39(9):162-169
【目的/意义】如何将网络文物信息资源中不同的知识提炼后推荐给有关用户,是文物信息资源开发与利用 过程中的关键问题。【方法/过程】通过主题-知识关联模型构建文物知识网络并识别网络中文物信息资源文本中的 主题词,而后根据知识及主题词的重要性对耦合后的知识进行重要性排序,按照知识与主题的关联程度实现文物 信息资源的知识推荐。【结果/结论】在实验中实现了不同网络文物信息资源的知识推荐,对比了不同数量主题词下 知识发现的效果,发现该方法在学术型文物信息资源的知识发现与推荐中效果较好。【创新/局限】利用知识库与信 息资源内容构建知识网络,通过计算网络节点的重要性进行知识推荐,为文物信息资源的利用提供了新的方法。 但受制于知识库知识储备的影响,可能无法挖掘信息资源的知识全貌。  相似文献   

20.
知识发现及其发展趋势研究   总被引:2,自引:0,他引:2  
随着计算机和信息科学技术的迅速发展,知识发现作为一门新学科引起了普遍的关注。本文介绍了知识发现的定义、任务、过程和技术,最后描述了知识发现的发展趋势。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号