首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Using the Web as corpus for self-training text categorization
Authors:Rafael Guzmán-Cabrera  Manuel Montes-y-Gómez  Paolo Rosso  Luis Villaseñor-Pineda
Institution:(1) Facultad de Ingeniería Mecánica, Electrica y Electrónica, Universidad de Guanajuato, Guanajuato, Mexico;(2) Natural Language Engineering Lab., Polytechnic University of Valencia, Valencia, Spain;(3) Laboratorio de Tecnologías del Lenguaje, Instituto Nacional de Astrofísica, óptica y Electrónica, Tonantzintla, Mexico
Abstract:Most current methods for automatic text categorization are based on supervised learning techniques and, therefore, they face the problem of requiring a great number of training instances to construct an accurate classifier. In order to tackle this problem, this paper proposes a new semi-supervised method for text categorization, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the construction of the classifier. This method, even though language independent, is more pertinent for scenarios where large sets of labeled resources do not exist. That, for instance, could be the case of several application domains in different non-English languages such as Spanish. The experimental evaluation of the method was carried out in three different tasks and in two different languages. The achieved results demonstrate the applicability and usefulness of the proposed method.
Keywords:Text categorization  Semi-supervised learning  Self-training  Web as corpus  Authorship attribution
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号