首页 | 本学科首页   官方微博 | 高级检索  
     检索      


An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit
Institution:1. CERTH-ITI, 6th km Charilaou-Thermi Rd, 57001 Thermi, Thessaloniki, Greece;2. AUTH, Department of Electrical and Computer Engineering, 54124 Thessaloniki, Greece;1. Software and Information Systems Engineering Department, Ben-Gurion University, Beer-Sheva, Israel;2. Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, Israel;1. School of Computer Science and Engineering, Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, China;2. Department of Computer Science, University of Warwick, UK
Abstract:Methods for document clustering and topic modelling in online social networks (OSNs) offer a means of categorising, annotating and making sense of large volumes of user generated content. Many techniques have been developed over the years, ranging from text mining and clustering methods to latent topic models and neural embedding approaches. However, many of these methods deliver poor results when applied to OSN data as such text is notoriously short and noisy, and often results are not comparable across studies. In this study we evaluate several techniques for document clustering and topic modelling on three datasets from Twitter and Reddit. We benchmark four different feature representations derived from term-frequency inverse-document-frequency (tf-idf) matrices and word embedding models combined with four clustering methods, and we include a Latent Dirichlet Allocation topic model for comparison. Several different evaluation measures are used in the literature, so we provide a discussion and recommendation for the most appropriate extrinsic measures for this task. We also demonstrate the performance of the methods over data sets with different document lengths. Our results show that clustering techniques applied to neural embedding feature representations delivered the best performance over all data sets using appropriate extrinsic evaluation measures. We also demonstrate a method for interpreting the clusters with a top-words based approach using tf-idf weights combined with embedding distance measures.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号