首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We will explore various ways to apply query structuring in cross-language information retrieval. In the first test, English queries were translated into Finnish using an electronic dictionary, and were run in a Finnish newspaper database of 55,000 articles. Queries were structured by combining the Finnish translation equivalents of the same English query key using the syn-operator of the InQuery retrieval system. Structured queries performed markedly better than unstructured queries. Second, the effects of compound-based structuring using a proximity operator for the translation equivalents of query language compound components were tested. The method was not useful in syn-based queries but resulted in decrease in retrieval effectiveness. Proper names are often non-identical spelling variants in different languages. This allows n-gram based translation of names not included in a dictionary. In the third test, a query structuring method where the Boolean and-operator was used to assign more weight to keys translated through n-gram matching gave good results.  相似文献   

2.
A main challenge in Cross-Language Information Retrieval (CLIR) is to estimate a proper translation model from available translation resources, since translation quality directly affects the retrieval performance. Among different translation resources, we focus on obtaining translation models from comparable corpora, because they provide appropriate translations for both languages and domains with limited linguistic resources. In this paper, we employ a two-step approach to build an effective translation model from comparable corpora, without requiring any additional linguistic resources, for the CLIR task. In the first step, translations are extracted by deriving correlations between source–target word pairs. These correlations are used to estimate word translation probabilities in the second step. We propose a language modeling approach for the first step, where modeling based on probability distribution provides two key advantages. First, our approach can be tuned easier in comparison with heuristically adjusted previous work. Second, it provides a principled basis for integrating additional lexical and translational relations to improve the accuracy of translations from comparable corpora. As an indication, we integrate monolingual relations of word co-occurrences into the process of translation extraction, which helps to extract more reliable translations for low-frequency words in a comparable corpus. Experimental results on an English–Persian comparable corpus show that our method outperforms the previous approaches in terms of both translation quality and the performance of CLIR. Indeed, the proposed method is naturally applicable to any comparable corpus, regardless of its languages. In addition, we demonstrate the significant impact of word translation probabilities, estimated in the second step of our approach, on the performance of CLIR.  相似文献   

3.
In this paper, we propose a new learning method for extracting bilingual word pairs from parallel corpora in various languages. In cross-language information retrieval, the system must deal with various languages. Therefore, automatic extraction of bilingual word pairs from parallel corpora with various languages is important. However, previous works based on statistical methods are insufficient because of the sparse data problem. Our learning method automatically acquires rules, which are effective to solve the sparse data problem, only from parallel corpora without any prior preparation of a bilingual resource (e.g., a bilingual dictionary, a machine translation system). We call this learning method Inductive Chain Learning (ICL). Moreover, the system using ICL can extract bilingual word pairs even from bilingual sentence pairs for which the grammatical structures of the source language differ from the grammatical structures of the target language because the acquired rules have the information to cope with the different word orders of source language and target language in local parts of bilingual sentence pairs. Evaluation experiments demonstrated that the recalls of systems based on several statistical approaches were improved through the use of ICL.  相似文献   

4.
For historical and cultural reasons, English phases, especially proper nouns and new words, frequently appear in Web pages written primarily in East Asian languages such as Chinese, Korean, and Japanese. Although such English terms and their equivalences in these East Asian languages refer to the same concept, they are often erroneously treated as independent index units in traditional Information Retrieval (IR). This paper describes the degree to which the problem arises in IR and proposes a novel technique to solve it. Our method first extracts English terms from native Web documents in an East Asian language, and then unifies the extracted terms and their equivalences in the native language as one index unit. For Cross-Language Information Retrieval (CLIR), one of the major hindrances to achieving retrieval performance at the level of Mono-Lingual Information Retrieval (MLIR) is the translation of terms in search queries which can not be found in a bilingual dictionary. The Web mining approach proposed in this paper for concept unification of terms in different languages can also be applied to solve this well-known challenge in CLIR. Experimental results based on NTCIR and KT-Set test collections show that the high translation precision of our approach greatly improves performance of both Mono-Lingual and Cross-Language Information Retrieval.  相似文献   

5.
Knowledge acquisition and bilingual terminology extraction from multilingual corpora are challenging tasks for cross-language information retrieval. In this study, we propose a novel method for mining high quality translation knowledge from our constructed Persian–English comparable corpus, University of Tehran Persian–English Comparable Corpus (UTPECC). We extract translation knowledge based on Term Association Network (TAN) constructed from term co-occurrences in same language as well as term associations in different languages. We further propose a post-processing step to do term translation validity check by detecting the mistranslated terms as outliers. Evaluation results on two different data sets show that translating queries using UTPECC and using the proposed methods significantly outperform simple dictionary-based methods. Moreover, the experimental results show that our methods are especially effective in translating Out-Of-Vocabulary terms and also expanding query words based on their associated terms.  相似文献   

6.
[目的/意义] 从跨语言视角探究如何更好地解决低资源语言的实体抽取问题。[方法/过程] 以英语为源语言,西班牙语和荷兰语为目标语言,借助迁移学习和深度学习的思想,提出一种结合自学习和GRU-LSTM-CRF网络的无监督跨语言实体抽取方法。[结果/结论] 与有监督的跨语言实体抽取方法相比,本文提出的无监督跨语言实体抽取方法可以取得更好的效果,在西班牙语上,F1值为0.6419,在荷兰语上,F1值为0.6557。利用跨语言知识在源语言和目标语言间建立桥梁,提升低资源语言实体抽取的效果。  相似文献   

7.
Subjectivity detection is a task of natural language processing that aims to remove ‘factual’ or ‘neutral’ content, i.e., objective text that does not contain any opinion, from online product reviews. Such a pre-processing step is crucial to increase the accuracy of sentiment analysis systems, as these are usually optimized for the binary classification task of distinguishing between positive and negative content. In this paper, we extend the extreme learning machine (ELM) paradigm to a novel framework that exploits the features of both Bayesian networks and fuzzy recurrent neural networks to perform subjectivity detection. In particular, Bayesian networks are used to build a network of connections among the hidden neurons of the conventional ELM configuration in order to capture dependencies in high-dimensional data. Next, a fuzzy recurrent neural network inherits the overall structure generated by the Bayesian networks to model temporal features in the predictor. Experimental results confirmed the ability of the proposed framework to deal with standard subjectivity detection problems and also proved its capacity to address portability across languages in translation tasks.  相似文献   

8.
With the rapid evolution of the mobile environment, the demand for natural language applications on mobile devices is increasing. This paper proposes an automatic word spacing system, the first step module of natural language processing (NLP) for many languages with their own word spacing rules, that is designed for mobile devices with limited hardware resources. The proposed system uses two stages. In the first stage, it preliminarily corrects word spacing errors by using a modified hidden Markov model based on character unigrams. In the second stage, the proposed system re-corrects the miscorrected word spaces by using lexical rules based on character bigrams or longer combinations. By using this hybrid method, the proposed system improves the robustness against unknown word patterns, reduces memory usage, and increases accuracy. To evaluate the proposed system in a realistic mobile environment, we constructed a mobile-style colloquial corpus using a simple simulation method. In experiments with a commercial mobile phone, the proposed system showed good performances (a response time of 0.20 s per sentence, a memory usage of 2.04 MB, and an accuracy of 92–95%) in the various evaluation measures.  相似文献   

9.
魏羽 《情报科研学报》2013,(6):615-617,624
如何在翻译中有效突出信息文本功能,保证译文信息传递的真实性和准确性,奈达认为,“内容的精确不应以(译文)对原作者的‘忠实’来判断,而应以传递的信息不被译文读者误解作为判断的基准”.因此在翻译信息型文本的过程中,译者应本着让读者客观、准确地理解原文信息的目的,在语言表达和文体形式上利用翻译的各种技巧灵活地翻译.秦兵马俑博物馆的文物介绍属于“信息型”文本,其英文译文存在着不少问题.作者以翻译理论为指导,运用翻译的技巧和方法针对秦兵马俑博物馆的文物简介的英文译文中的问题进行客观地评析.  相似文献   

10.
Recently, sentiment classification has received considerable attention within the natural language processing research community. However, since most recent works regarding sentiment classification have been done in the English language, there are accordingly not enough sentiment resources in other languages. Manual construction of reliable sentiment resources is a very difficult and time-consuming task. Cross-lingual sentiment classification aims to utilize annotated sentiment resources in one language (typically English) for sentiment classification of text documents in another language. Most existing research works rely on automatic machine translation services to directly project information from one language to another. However, different term distribution between original and translated text documents and translation errors are two main problems faced in the case of using only machine translation. To overcome these problems, we propose a novel learning model based on active learning and semi-supervised co-training to incorporate unlabelled data from the target language into the learning process in a bi-view framework. This model attempts to enrich training data by adding the most confident automatically-labelled examples, as well as a few of the most informative manually-labelled examples from unlabelled data in an iterative process. Further, in this model, we consider the density of unlabelled data so as to select more representative unlabelled examples in order to avoid outlier selection in active learning. The proposed model was applied to book review datasets in three different languages. Experiments showed that our model can effectively improve the cross-lingual sentiment classification performance and reduce labelling efforts in comparison with some baseline methods.  相似文献   

11.
Cluster analysis using multiple representations of data is known as multi-view clustering and has attracted much attention in recent years. The major drawback of existing multi-view algorithms is that their clustering performance depends heavily on hyperparameters which are difficult to set. In this paper, we propose the Multi-View Normalized Cuts (MVNC) approach, a two-step algorithm for multi-view clustering. In the first step, an initial partitioning is performed using a spectral technique. In the second step, a local search procedure is used to refine the initial clustering. MVNC has been evaluated and compared to state-of-the-art multi-view clustering approaches using three real-world datasets. Experimental results have shown that MVNC significantly outperforms existing algorithms in terms of clustering quality and computational efficiency. In addition to its superior performance, MVNC is parameter-free which makes it easy to use.  相似文献   

12.
陈莹 《科教文汇》2014,(25):159-160
翻译不仅是两种语言之间的转换,也是两种文化之间的交流。翻译与文化密切相关。在所有语言要素中,与文化联系最为密切的是词汇,而文化负载词是其中重要的组成部分。这篇文章立足于文化翻译的角度,对两部英译本中的典型示例进行比较评估及详尽分析。对于中国古典文学作品的翻译,译者应在文化翻译观的指导下,采取多样的翻译策略力求最大程度地再现源语文化的特色,以便实现不同文化间的等值翻译,这便要求译者在准确传达原文意思的前提下翻译文化因素时尽可能多地使用异化策略。必要时,适当地采用“中国英语”而非“中式英语”对具有深刻文化内涵的作品进行翻译。  相似文献   

13.
14.
We study the selection of transfer languages for different Natural Language Processing tasks, specifically sentiment analysis, named entity recognition and dependency parsing. In order to select an optimal transfer language, we propose to utilize different linguistic similarity metrics to measure the distance between languages and make the choice of transfer language based on this information instead of relying on intuition. We demonstrate that linguistic similarity correlates with cross-lingual transfer performance for all of the proposed tasks. We also show that there is a statistically significant difference in choosing the optimal language as the transfer source instead of English. This allows us to select a more suitable transfer language which can be used to better leverage knowledge from high-resource languages in order to improve the performance of language applications lacking data. For the study, we used datasets from eight different languages from three language families.  相似文献   

15.
Eliminating noisy information and extracting informative content have become important issues for web mining, search and accessibility. This extraction process can employ automatic techniques and hand-crafted rules. Automatic extraction techniques focus on various machine learning methods, but implementing these techniques increases time complexity of the extraction process. Conversely, extraction through hand-crafted rules is an efficient technique that uses string manipulation functions, but preparing these rules is difficult and cumbersome for users. In this paper, we present a hybrid approach that contains two steps that can invoke each other. The first step discovers informative content using Decision Tree Learning as an appropriate machine learning method and creates rules from the results of this learning method. The second step extracts informative content using rules obtained from the first step. However, if the second step does not return an extraction result, the first step gets invoked. In our experiments, the first step achieves high accuracy with 95.76% in extraction of the informative content. Moreover, 71.92% of the rules can be used in the extraction process, and it is approximately 240 times faster than the first step.  相似文献   

16.
We study the selection of transfer languages for automatic abusive language detection. Instead of preparing a dataset for every language, we demonstrate the effectiveness of cross-lingual transfer learning for zero-shot abusive language detection. This way we can use existing data from higher-resource languages to build better detection systems for low-resource languages. Our datasets are from seven different languages from three language families. We measure the distance between the languages using several language similarity measures, especially by quantifying the World Atlas of Language Structures. We show that there is a correlation between linguistic similarity and classifier performance. This discovery allows us to choose an optimal transfer language for zero shot abusive language detection.  相似文献   

17.
刘庆连 《科教文汇》2014,(19):133-134
从近期云南白药宣布其配方含有草乌,即断肠草这一事件,引出草乌和断肠草的英文名,从而进一步讨论中药的英语翻译,及其应当注意到的中药名翻译过程中存在的相关问题。通过讨论分析,我们得出在翻译中药和中医术语时归化与异化应该相结合,由此希望合适的中药中医术语翻译能够促进中国与世界更好地进行学术文化交流。  相似文献   

18.
彭志瑛 《科教文汇》2011,(35):122-124
字幕翻译是一种特殊的语码转换类型,具有语言浓缩和对白性格化的特点。字幕翻译中,文化预设表现为电影对白与文化现实之间的关联,有效解读源语对白的文化预设是成功字幕翻译的前提。当源语和目的语共有某种文化预设,译者采用"形意对应"的编码方式,以保持源语对白的异域特色;当两种语言不共享某种文化预设时,译者采用打破重组,创意缩合的编码方式,如明示与阐释、替换与重构、增补与删减等翻译策略。  相似文献   

19.
This paper presents a Foreign-Language Search Assistant that uses noun phrases as fundamental units for document translation and query formulation, translation and refinement. The system (a) supports the foreign-language document selection task providing a cross-language indicative summary based on noun phrase translations, and (b) supports query formulation and refinement using the information displayed in the cross-language document summaries. Our results challenge two implicit assumptions in most of cross-language Information Retrieval research: first, that once documents in the target language are found, Machine Translation is the optimal way of informing the user about their contents; and second, that in an interactive setting the optimal way of formulating and refining the query is helping the user to choose appropriate translations for the query terms.  相似文献   

20.
Two probabilistic approaches to cross-lingual retrieval are in wide use today, those based on probabilistic models of relevance, as exemplified by INQUERY, and those based on language modeling. INQUERY, as a query net model, allows the easy incorporation of query operators, including a synonym operator, which has proven to be extremely useful in cross-language information retrieval (CLIR), in an approach often called structured query translation. In contrast, language models incorporate translation probabilities into a unified framework. We compare the two approaches on Arabic and Spanish data sets, using two kinds of bilingual dictionaries––one derived from a conventional dictionary, and one derived from a parallel corpus. We find that structured query processing gives slightly better results when queries are not expanded. On the other hand, when queries are expanded, language modeling gives better results, but only when using a probabilistic dictionary derived from a parallel corpus.We pursue two additional issues inherent in the comparison of structured query processing with language modeling. The first concerns query expansion, and the second is the role of translation probabilities. We compare conventional expansion techniques (pseudo-relevance feedback) with relevance modeling, a new IR approach which fits into the formal framework of language modeling. We find that relevance modeling and pseudo-relevance feedback achieve comparable levels of retrieval and that good translation probabilities confer a small but significant advantage.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号