首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We will explore various ways to apply query structuring in cross-language information retrieval. In the first test, English queries were translated into Finnish using an electronic dictionary, and were run in a Finnish newspaper database of 55,000 articles. Queries were structured by combining the Finnish translation equivalents of the same English query key using the syn-operator of the InQuery retrieval system. Structured queries performed markedly better than unstructured queries. Second, the effects of compound-based structuring using a proximity operator for the translation equivalents of query language compound components were tested. The method was not useful in syn-based queries but resulted in decrease in retrieval effectiveness. Proper names are often non-identical spelling variants in different languages. This allows n-gram based translation of names not included in a dictionary. In the third test, a query structuring method where the Boolean and-operator was used to assign more weight to keys translated through n-gram matching gave good results.  相似文献   

2.
Many operational IR indexes are non-normalized, i.e. no lemmatization or stemming techniques, etc. have been employed in indexing. This poses a challenge for dictionary-based cross-language retrieval (CLIR), because translations are mostly lemmas. In this study, we face the challenge of dictionary-based CLIR in a non-normalized index. We test two optional approaches: FCG (Frequent Case Generation) and s-gramming. The idea of FCG is to automatically generate the most frequent inflected forms for a given lemma. FCG has been tested in monolingual retrieval and has been shown to be a good method for inflected retrieval, especially for highly inflected languages. S-gramming is an approximate string matching technique (an extension of n-gramming). The language pairs in our tests were English–Finnish, English–Swedish, Swedish–Finnish and Finnish–Swedish. Both our approaches performed quite well, but the results varied depending on the language pair. S-gramming and FCG performed quite equally in all the other language pairs except Finnish–Swedish, where s-gramming outperformed FCG.  相似文献   

3.
This paper analyzes the features of the Swedish language from the viewpoint of mono- and cross-language information retrieval (CLIR). The study was motivated by the fact that Swedish is known poorly from the IR perspective. This paper shows that Swedish has unique features, in particular gender features, the use of fogemorphemes in the formation of compound words, and a high frequency of homographic words. Especially in dictionary-based CLIR, correct word normalization and compound splitting are essential. It was shown in this study, however, that publicly available morphological analysis tools used for normalization and compound splitting have pitfalls that might decrease the effectiveness of IR and CLIR. A comparative study was performed to test the degree of lexical ambiguity in Swedish, Finnish and English. The results suggest that part-of-speech tagging might be useful in Swedish IR due to the high frequency of homographic words.  相似文献   

4.
For historical and cultural reasons, English phases, especially proper nouns and new words, frequently appear in Web pages written primarily in East Asian languages such as Chinese, Korean, and Japanese. Although such English terms and their equivalences in these East Asian languages refer to the same concept, they are often erroneously treated as independent index units in traditional Information Retrieval (IR). This paper describes the degree to which the problem arises in IR and proposes a novel technique to solve it. Our method first extracts English terms from native Web documents in an East Asian language, and then unifies the extracted terms and their equivalences in the native language as one index unit. For Cross-Language Information Retrieval (CLIR), one of the major hindrances to achieving retrieval performance at the level of Mono-Lingual Information Retrieval (MLIR) is the translation of terms in search queries which can not be found in a bilingual dictionary. The Web mining approach proposed in this paper for concept unification of terms in different languages can also be applied to solve this well-known challenge in CLIR. Experimental results based on NTCIR and KT-Set test collections show that the high translation precision of our approach greatly improves performance of both Mono-Lingual and Cross-Language Information Retrieval.  相似文献   

5.
Two probabilistic approaches to cross-lingual retrieval are in wide use today, those based on probabilistic models of relevance, as exemplified by INQUERY, and those based on language modeling. INQUERY, as a query net model, allows the easy incorporation of query operators, including a synonym operator, which has proven to be extremely useful in cross-language information retrieval (CLIR), in an approach often called structured query translation. In contrast, language models incorporate translation probabilities into a unified framework. We compare the two approaches on Arabic and Spanish data sets, using two kinds of bilingual dictionaries––one derived from a conventional dictionary, and one derived from a parallel corpus. We find that structured query processing gives slightly better results when queries are not expanded. On the other hand, when queries are expanded, language modeling gives better results, but only when using a probabilistic dictionary derived from a parallel corpus.We pursue two additional issues inherent in the comparison of structured query processing with language modeling. The first concerns query expansion, and the second is the role of translation probabilities. We compare conventional expansion techniques (pseudo-relevance feedback) with relevance modeling, a new IR approach which fits into the formal framework of language modeling. We find that relevance modeling and pseudo-relevance feedback achieve comparable levels of retrieval and that good translation probabilities confer a small but significant advantage.  相似文献   

6.
This paper explores the integration of textual and visual information for cross-language image retrieval. An approach which automatically transforms textual queries into visual representations is proposed. First, we mine the relationships between text and images and employ the mined relationships to construct visual queries from textual ones. Then, the retrieval results of textual and visual queries are combined. To evaluate the proposed approach, we conduct English monolingual and Chinese–English cross-language retrieval experiments. The selection of suitable textual query terms to construct visual queries is the major issue. Experimental results show that the proposed approach improves retrieval performance, and use of nouns is appropriate to generate visual queries.  相似文献   

7.
8.
Arabic is a morphologically rich language that presents significant challenges to many natural language processing applications because a word often conveys complex meanings decomposable into several morphemes (i.e. prefix, stem, suffix). By segmenting words into morphemes, we could improve the performance of English/Arabic translation pair’s extraction from parallel texts. This paper describes two algorithms and their combination to automatically extract an English/Arabic bilingual dictionary from parallel texts that exist in the Internet archive after using an Arabic light stemmer as a preprocessing step. Before using the Arabic light stemmer, the total system precision and recall were 88.6% and 81.5% respectively, then the system precision an recall increased to 91.6% and 82.6% respectively after applying the Arabic light stemmer on the Arabic documents.  相似文献   

9.
In this paper, we compile and review several experiments measuring cross-lingual information retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers. Our CLIR system uses a simple probabilistic language model; the studies used TREC test corpora over Chinese, Spanish and Arabic. Our findings include:
  • •One can achieve an acceptable CLIR performance using only a bilingual term list (70–80% on Chinese and Arabic corpora).
  • •However, if a bilingual term list and parallel corpora are available, CLIR performance can rival monolingual performance.
  • •If no parallel corpus is available, pseudo-parallel texts produced by an MT system can partially overcome the lack of parallel text.
  • •While stemming is useful normally, with a very large parallel corpus for Arabic–English, stemming hurt performance in our empirical studies with Arabic, a highly inflected language.
  相似文献   

10.
Cross-lingual semantic interoperability has drawn significant attention in recent digital library and World Wide Web research as the information in languages other than English has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish, and French, has been widely explored; however, CLIR across European languages and Oriental languages is still in the initial stage. To cross language boundary, corpus-based approach is promising to overcome the limitation of the knowledge-based and controlled vocabulary approaches but collecting parallel corpora between European language and Oriental language is not an easy task. Length-based and text-based approaches are two major approaches to align parallel documents. In this paper, we investigate several techniques using these approaches and compare their performances in aligning English and Chinese titles of parallel documents available on the Web.  相似文献   

11.
Cross-language information retrieval (CLIR) systems allow users to find documents written in different languages from that of their query. Simple knowledge structures such as bilingual term lists have proven to be a remarkably useful basis for bridging that language gap. A broad array of dictionary-based techniques have demonstrated utility, but comparison across techniques has been difficult because evaluation results often span only a limited range of conditions. This article identifies the key issues in dictionary-based CLIR, develops unified frameworks for term selection and term translation that help to explain the relationships among existing techniques, and illustrates the effect of those techniques using four contrasting languages for systematic experiments with a uniform query translation architecture. Key results include identification of a previously unseen dependence of pre- and post-translation expansion on orthographic cognates and development of a query-specific measure for translation fanout that helps to explain the utility of structured query methods.  相似文献   

12.
In contrast with their monolingual counterparts, little attention has been paid to the effects that misspelled queries have on the performance of Cross-Language Information Retrieval (CLIR) systems. The present work makes a first attempt to fill this gap by extending our previous work on monolingual retrieval in order to study the impact that the progressive addition of misspellings to input queries has, this time, on the output of CLIR systems. Two approaches for dealing with this problem are analyzed in this paper. Firstly, the use of automatic spelling correction techniques for which, in turn, we consider two algorithms: the first one for the correction of isolated words and the second one for a correction based on the linguistic context of the misspelled word. The second approach to be studied is the use of character n-grams both as index terms and translation units, seeking to take advantage of their inherent robustness and language-independence. All these approaches have been tested on a from-Spanish-to-English CLIR system, that is, Spanish queries on English documents. Real, user-generated spelling errors have been used under a methodology that allows us to study the effectiveness of the different approaches to be tested and their behavior when confronted with different error rates. The results obtained show the great sensitiveness of classic word-based approaches to misspelled queries, although spelling correction techniques can mitigate such negative effects. On the other hand, the use of character n-grams provides great robustness against misspellings.  相似文献   

13.
Query translation is a viable method for cross-language information retrieval (CLIR), but it suffers from translation ambiguities caused by multiple translations of individual query terms. Previous research has employed various methods for disambiguation, including the method of selecting an individual target query term from multiple candidates by comparing their statistical associations with the candidate translations of other query terms. This paper proposes a new method where we examine all combinations of target query term translations corresponding to the source query terms, instead of looking at the candidates for each query term and selecting the best one at a time. The goodness value for a combination of target query terms is computed based on the association value between each pair of the terms in the combination. We tested our method using the NTCIR-3 English–Korean CLIR test collection. The results show some improvements regardless of the association measures we used.  相似文献   

14.
A main challenge in Cross-Language Information Retrieval (CLIR) is to estimate a proper translation model from available translation resources, since translation quality directly affects the retrieval performance. Among different translation resources, we focus on obtaining translation models from comparable corpora, because they provide appropriate translations for both languages and domains with limited linguistic resources. In this paper, we employ a two-step approach to build an effective translation model from comparable corpora, without requiring any additional linguistic resources, for the CLIR task. In the first step, translations are extracted by deriving correlations between source–target word pairs. These correlations are used to estimate word translation probabilities in the second step. We propose a language modeling approach for the first step, where modeling based on probability distribution provides two key advantages. First, our approach can be tuned easier in comparison with heuristically adjusted previous work. Second, it provides a principled basis for integrating additional lexical and translational relations to improve the accuracy of translations from comparable corpora. As an indication, we integrate monolingual relations of word co-occurrences into the process of translation extraction, which helps to extract more reliable translations for low-frequency words in a comparable corpus. Experimental results on an English–Persian comparable corpus show that our method outperforms the previous approaches in terms of both translation quality and the performance of CLIR. Indeed, the proposed method is naturally applicable to any comparable corpus, regardless of its languages. In addition, we demonstrate the significant impact of word translation probabilities, estimated in the second step of our approach, on the performance of CLIR.  相似文献   

15.
This paper reviews state-of-the-art techniques and methods for enhancing effectiveness of cross-language information retrieval (CLIR). The following research issues are covered: (1) matching strategies and translation techniques, (2) methods for solving the problem of translation ambiguity, (3) formal models for CLIR such as application of the language model, (4) the pivot language approach, (5) methods for searching multilingual document collection, (6) techniques for combining multiple language resources, etc.  相似文献   

16.
Technical terms and proper names constitute a major problem in dictionary-based cross-language information retrieval (CLIR). However, technical terms and proper names in different languages often share the same Latin or Greek origin, being thus spelling variants of each other. In this paper we present a novel two-step fuzzy translation technique for cross-lingual spelling variants. In the first step, transformation rules are applied to source words to render them more similar to their target language equivalents. The rules are generated automatically using translation dictionaries as source data. In the second step, the intermediate forms obtained in the first step are translated into a target language using fuzzy matching. The effectiveness of the technique was evaluated empirically using five source languages and English as a target language. The two-step technique performed better, in some cases considerably better, than fuzzy matching alone. Even using the first step as such showed promising results.  相似文献   

17.
白杨 《科教文汇》2014,(28):122-123
公示语翻译对高校英语教学至关重要,它在一定程度上反映了高校英语教学的水平以及不足之处。由于中西文化的差异,公示语在表达、翻译策略和跨文化意识方面容易存在偏差。针对公示语的翻译在高校英语教学中存在的各种问题,还需制定有效的教学策略。本文结合我国高校英语教学中的公示语翻译状况,根据公示语的定义和分类,阐述公示语的语用功能和示意功能,探讨公示语翻译在高效英语教学中的贯穿策略。  相似文献   

18.
潘毓卿 《大众科技》2014,(11):171-172
翻译不仅是一种语言活动,同样也是一种思维活动,思维是翻译活动的基础,对翻译活动能够产生重要的影响。英汉思维方式对比在中英翻译教学中的应用,能够有效提升学生的翻译质量,培养学生的英语思维能力。文章主要对英汉思维方式对比在中英翻译教学中应用的意义和措施进行分析,以期能够帮助学生更好的学习英语,提升学生的英语翻译水平。  相似文献   

19.
邵玲玲 《科教文汇》2011,(7):104-105
翻译是英语专业高年级的必修课程,对于翻译教学的研究还远远少于对翻译本身的研究探索。针对现在翻译教学中的普遍现象.即把翻译教学当做外语课来教授.把对外语的讲解多于翻译知识及翻译技能的传授.出现了主次不分.甚至本末倒置的现象.本文将比较翻译教学和外语教学的异同.总结出本科英语专业翻译教学课堂策略。  相似文献   

20.
In the KL divergence framework, the extended language modeling approach has a critical problem of estimating a query model, which is the probabilistic model that encodes the user’s information need. For query expansion in initial retrieval, the translation model had been proposed to involve term co-occurrence statistics. However, the translation model was difficult to apply, because the term co-occurrence statistics must be constructed in the offline time. Especially in a large collection, constructing such a large matrix of term co-occurrences statistics prohibitively increases time and space complexity. In addition, reliable retrieval performance cannot be guaranteed because the translation model may comprise noisy non-topical terms in documents. To resolve these problems, this paper investigates an effective method to construct co-occurrence statistics and eliminate noisy terms by employing a parsimonious translation model. The parsimonious translation model is a compact version of a translation model that can reduce the number of terms containing non-zero probabilities by eliminating non-topical terms in documents. Through experimentation on seven different test collections, we show that the query model estimated from the parsimonious translation model significantly outperforms not only the baseline language modeling, but also the non-parsimonious models.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号