首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Empirical studies on the impact of lexical resources on CLIR performance
Institution:1. The StoryLab@Texas A&M, Texas A&M University, United States;2. Department of Visualization, Texas A&M University, United States;3. TAMU Embodied Interaction Lab, Texas A&M University, United States;1. Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China;2. School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China;3. Key Laboratory of Network Information System Technology (NIST), Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China;4. Key Laboratory for Information Science of Electromagnetic Waves (MoE), Fudan University, Shanghai, China;5. Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China;6. Beijing Remote Sensing Information Institute, Beijing 100011, China;7. Institute of Photogrammetry and Remote Sensing, Karlsruhe Institute of Technology, Karlsruhe, Germany;8. Fujian Key Laboratory of Sensing and Computing for Smart Cities, School of Information Science and Engineering, Xiamen University, Xiamen 361005, China;9. Fujian Collaborative Innovation Center for Big Data Applications in Governments, Fuzhou 350003, China
Abstract:In this paper, we compile and review several experiments measuring cross-lingual information retrieval (CLIR) performance as a function of the following resources: bilingual term lists, parallel corpora, machine translation (MT), and stemmers. Our CLIR system uses a simple probabilistic language model; the studies used TREC test corpora over Chinese, Spanish and Arabic. Our findings include:
  • ?One can achieve an acceptable CLIR performance using only a bilingual term list (70–80% on Chinese and Arabic corpora).
  • ?However, if a bilingual term list and parallel corpora are available, CLIR performance can rival monolingual performance.
  • ?If no parallel corpus is available, pseudo-parallel texts produced by an MT system can partially overcome the lack of parallel text.
  • ?While stemming is useful normally, with a very large parallel corpus for Arabic–English, stemming hurt performance in our empirical studies with Arabic, a highly inflected language.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号