首页 | 本学科首页   官方微博 | 高级检索  
     检索      


TSVFN: Two-Stage Visual Fusion Network for multimodal relation extraction
Institution:1. Faculty of Economics and Management, East China Normal University, Shanghai, China;2. Key Laboratory of Advanced Theory and Application in Statistics and Data Science (East China Normal University), Ministry of Education of China, Shanghai, China;3. University of Chinese Academy of Sciences, Shanghai, China;1. School of Cyber Science and Engineering, Sichuan University, Chengdu, China;2. Cybersecurity Research Institute, Sichuan University, Chengdu, China;1. Studio Galilei Co. Ltd., Yongin, Gyeonggi, Republic of Korea;2. Department of Transportation Engineering, College of Engineering, Myongji University, Yongin, Gyeonggi, Republic of Korea;3. Department of Geography, College of Sciences, Kyung Hee University, Seoul, Republic of Korea;4. Department of Transportation Engineering, College of Engineering, Yonsei University, Seoul, Republic of Korea;5. Smart Tourism Education Platform, College of Hotel & Tourism Management, Kyung Hee University, Seoul, Republic of Korea;6. Korea Railroad Research Institute, Uiwang, Gyeonggi, Republic of Korea
Abstract:Multimodal relation extraction is a critical task in information extraction, aiming to predict the class of relations between head and tail entities from linguistic sequences and related images. However, the current works are vulnerable to less relevant visual objects detected from images and are not able to sufficiently fuse visual information into text pre-trained models. To overcome these problems, we propose a Two-Stage Visual Fusion Network (TSVFN) that employs the multimodal fusion approach in vision-enhanced entity relation extraction. In the first stage, we design multimodal graphs, whose novelty lies mainly in transforming the sequence learning into the graph learning. In the second stage, we merge the transformer-based visual representation into the text pre-trained model by a multi-scale cross-model projector. Specifically, two multimodal fusion operations are implemented inside the pre-trained model respectively. We finally accomplish deep interaction of multimodal multi-structured data in two fusion stages. Extensive experiments are conducted on a dataset (MNRE), our model outperforms the current state-of-the-art method by 1.76%, 1.52%, 1.29%, and 1.17% in terms of accuracy, precision, recall, and F1 score, respectively. Moreover, our model also achieves excellent results under the condition of fewer samples.
Keywords:Multimodal relation extraction  Two-stage visual fusion  Multimodal pretrained model  Multimodal graph  Graph convolution networks  Vision transformer  Information extraction
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号