首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Language models and fusion for authorship attribution
Institution:1. Japan Advanced Institute of Science and Technology (JAIST) Japan;2. Toshiba Research & Development Center, Japan;1. School of Data and Computer Science, Sun Yat-sen University, China;2. School of Computing Science, University of Glasgow, Glasgow, UK;3. School of Computer Science, The University of Adelaide, Adelaide, Australia;1. School of Management and Economics, University of Electronic Science and Technology of China, Chengdu 611731, China;2. School of Business, Yunnan University of Finance and Economics, Kunming 650221, China;3. School of Business and Management, Shanghai International Studies University, Shanghai 201620, China
Abstract:We deal with the task of authorship attribution, i.e. identifying the author of an unknown document, proposing the use of Part Of Speech (POS) tags as features for language modeling. The experimentation is carried out on corpora untypical for the task, i.e., with documents edited by non-professional writers, such as movie reviews or tweets. The former corpus is homogeneous with respect to the topic making the task more challenging, The latter corpus, puts language models into a framework of a continuously and fast evolving language, unique and noisy writing style, and limited length of social media messages. While we find that language models based on POS tags are competitive in only one of the corpora (movie reviews), they generally provide efficiency benefits and robustness against data sparsity. Furthermore, we experiment with model fusion, where language models based on different modalities are combined. By linearly combining three language models, based on characters, words, and POS trigrams, respectively, we achieve the best generalization accuracy of 96% on movie reviews, while the combination of language models based on characters and POS trigrams provides 54% accuracy on the Twitter corpus. In fusion, POS language models are proven essential effective components.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号