首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 218 毫秒
Multimodal fake news detection methods based on semantic information have achieved great success. However, these methods only exploit the deep features of multimodal information, which leads to a large loss of valid information at the shallow level. To address this problem, we propose a progressive fusion network (MPFN) for multimodal disinformation detection, which captures the representational information of each modality at different levels and achieves fusion between modalities at the same level and at different levels by means of a mixer to establish a strong connection between the modalities. Specifically, we use a transformer structure, which is effective in computer vision tasks, as a visual feature extractor to gradually sample features at different levels and combine features obtained from a text feature extractor and image frequency domain information at different levels for fine-grained modeling. In addition, we design a feature fusion approach to better establish connections between modalities, which can further improve the performance and thus surpass other network structures in the literature. We conducted extensive experiments on two real datasets, Weibo and Twitter, where our method achieved 83.3% accuracy on the Twitter dataset, which has increased by at least 4.3% compared to other state-of-the-art methods. This demonstrates the effectiveness of MPFN for identifying fake news, and the method reaches a relatively advanced level by combining different levels of information from each modality and a powerful modality fusion method.  相似文献   

This paper focuses on personalized outfit generation, aiming to generate compatible fashion outfits catering to given users. Personalized recommendation by generating outfits of compatible items is an emerging task in the recommendation community with great commercial value but less explored. The task requires to explore both user-outfit personalization and outfit compatibility, any of which is challenging due to the huge learning space resulted from large number of items, users, and possible outfit options. To specify the user preference on outfits and regulate the outfit compatibility modeling, we propose to incorporate coordination knowledge in fashion. Inspired by the fact that users might have coordination preference in terms of category combination, we first define category combinations as templates and propose to model user-template relationship to capture users’ coordination preferences. Moreover, since a small number of templates can cover the majority of fashion outfits, leveraging templates is also promising to guide the outfit generation process. In this paper, we propose Template-guided Outfit Generation (TOG) framework, which unifies the learning of user-template interaction, user–item interaction and outfit compatibility modeling. The personal preference modeling and outfit generation are organically blended together in our problem formulation, and therefore can be achieved simultaneously. Furthermore, we propose new evaluation protocols to evaluate different models from both the personalization and compatibility perspectives. Extensive experiments on two public datasets have demonstrated that the proposed TOG can achieve preferable performance in both evaluation perspectives, namely outperforming the most competitive baseline BGN by 7.8% and 10.3% in terms of personalization precision on iFashion and Polyvore datasets, respectively, and improving the compatibility of the generated outfits by over 2%.  相似文献   

Graph neural networks (GNNs) have shown great potential for personalized recommendation. At the core is to reorganize interaction data as a user-item bipartite graph and exploit high-order connectivity among user and item nodes to enrich their representations. While achieving great success, most existing works consider interaction graph based only on ID information, foregoing item contents from multiple modalities (e.g., visual, acoustic, and textual features of micro-video items). Distinguishing personal interests on different modalities at a granular level was not explored until recently proposed MMGCN (Wei et al., 2019). However, it simply employs GNNs on parallel interaction graphs and treats information propagated from all neighbors equally, failing to capture user preference adaptively. Hence, the obtained representations might preserve redundant, even noisy information, leading to non-robustness and suboptimal performance. In this work, we aim to investigate how to adopt GNNs on multimodal interaction graphs, to adaptively capture user preference on different modalities and offer in-depth analysis on why an item is suitable to a user. Towards this end, we propose a new Multimodal Graph Attention Network, short for MGAT, which disentangles personal interests at the granularity of modality. In particular, built upon multimodal interaction graphs, MGAT conducts information propagation within individual graphs, while leveraging the gated attention mechanism to identify varying importance scores of different modalities to user preference. As such, it is able to capture more complex interaction patterns hidden in user behaviors and provide a more accurate recommendation. Empirical results on two micro-video recommendation datasets, Tiktok and MovieLens, show that MGAT exhibits substantial improvements over the state-of-the-art baselines like NGCF (Wang, He, et al., 2019) and MMGCN (Wei et al., 2019). Further analysis on a case study illustrates how MGAT generates attentive information flow over multimodal interaction graphs.  相似文献   

As one of the challenging cross-modal tasks, video question answering (VideoQA) aims to fully understand video content and answer relevant questions. The mainstream approach in current work involves extracting appearance and motion features to characterize videos separately, ignoring the interactions between them and with the question. Furthermore, some crucial semantic interaction details between visual objects are overlooked. In this paper, we propose a novel Relation-aware Graph Reasoning (ReGR) framework for video question answering, which first combines appearance–motion and location–semantic multiple interaction relations between visual objects. For the interaction between appearance and motion, we design the Appearance–Motion Block, which is question-guided to capture the interdependence between appearance and motion. For the interaction between location and semantics, we design the Location–Semantic Block, which utilizes the constructed Multi-Relation Graph Attention Network to capture the geometric position and semantic interaction between objects. Finally, the question-driven Multi-Visual Fusion captures more accurate multimodal representations. Extensive experiments on three benchmark datasets, TGIF-QA, MSVD-QA, and MSRVTT-QA, demonstrate the superiority of our proposed ReGR compared to the state-of-the-art methods.  相似文献   

Knowledge graphs are sizeable graph-structured knowledge with both abstract and concrete concepts in the form of entities and relations. Recently, convolutional neural networks have achieved outstanding results for more expressive representations of knowledge graphs. However, existing deep learning-based models exploit semantic information from single-level feature interaction, potentially limiting expressiveness. We propose a knowledge graph embedding model with an attention-based high-low level features interaction convolutional network called ConvHLE to alleviate this issue. This model effectively harvests richer semantic information and generates more expressive representations. Concretely, the multilayer convolutional neural network is utilized to fuse high-low level features. Then, features in fused feature maps interact with other informative neighbors through the criss-cross attention mechanism, which expands the receptive fields and boosts the quality of interactions. Finally, a plausibility score function is proposed for the evaluation of our model. The performance of ConvHLE is experimentally investigated on six benchmark datasets with individual characteristics. Extensive experimental results prove that ConvHLE learns more expressive and discriminative feature representations and has outperformed other state-of-the-art baselines over most metrics when addressing link prediction tasks. Comparing MRR and Hits@1 on FB15K-237, our model outperforms the baseline ConvE by 13.5% and 16.0%, respectively.  相似文献   

The matrix factorization model based on user-item rating data has been widely studied and applied in recommender systems. However, data sparsity, the cold-start problem, and poor explainability have restricted its performance. Textual reviews usually contain rich information about items’ features and users’ sentiments and preferences, which can solve the problem of insufficient information from only user ratings. However, most recommendation algorithms that take sentiment analysis of review texts into account are either fine- or coarse-grained, but not both, leading to uncertain accuracy and comprehensiveness regarding user preference. This study proposes a deep learning recommendation model (i.e., DeepCGSR) that integrates textual review sentiments and the rating matrix. DeepCGSR uses the review sets of users and items as a corpus to perform cross-grained sentiment analysis by combining fine- and coarse-grained levels to extract sentiment feature vectors for users and items. Deep learning technology is used to map between the extracted feature vector and latent factor through the rating-based matrix factorization model and obtain deep, nonlinear features to predict the user's rating of an item. Iterative experiments on e-commerce datasets from Amazon show that DeepCGSR consistently outperforms the recommendation models LFM, SVD++, DeepCoNN, TOPICMF, and NARRE. Overall, comparing with other recommendation models, the DeepCGSR model demonstrated improved evaluation results by 14.113% over LFM, 13.786% over SVD++, 9.920% over TOPICMF, 5.122% over DeepCoNN, and 2.765% over NARRE. Meanwhile, the DeepCGSR has great potential in fixing the overfitting and cold-start problems. Built upon previous studies and findings, the DeepCGSR is the state of the art, moving the design and development of the recommendation algorithms forward with improved recommendation accuracy.  相似文献   

Recommendation is an effective marketing tool widely used in the e-commerce business, and can be made based on ratings predicted from the rating data of purchased items. To improve the accuracy of rating prediction, user reviews or product images have been used separately as side information to learn the latent features of users (items). In this study, we developed a hybrid approach to analyze both user sentiments from review texts and user preferences from item images to make item recommendations more personalized for users. The hybrid model consists of two parallel modules to perform a procedure named the multiscale semantic and visual analyses (MSVA). The first module is designated to conduct semantic analysis on review documents in various aspects with word-aware and scale-aware attention mechanisms, while the second module is assigned to extract visual features with block-aware and visual-aware attention mechanisms. The MSVA model was trained, validated and tested using Amazon Product Data containing sampled reviews varying from 492,970 to 1 million records across 22 different domains. Three state-of-the-art recommendation models were used as the baselines for performance comparisons. Averagely, MSVA reduced the mean squared error (MSE) of predicted ratings by 6.00%, 3.14% and 3.25% as opposed to the three baselines. It was demonstrated that combining semantic and visual analyses enhanced MSVA's performance across a wide variety of products, and the multiscale scheme used in both the review and visual modules of MSVA made significant contributions to the rating prediction.  相似文献   

Sequential recommendation models a user’s historical sequence to predict future items. Existing studies utilize deep learning methods and contrastive learning for data augmentation to alleviate data sparsity. However, these existing methods cannot learn accurate high-quality item representations while augmenting data. In addition, they usually ignore data noise and user cold-start issues. To solve the above issues, we investigate the possibility of Generative Adversarial Network (GAN) with contrastive learning for sequential recommendation to balance data sparsity and noise. Specifically, we propose a new framework, Enhanced Contrastive Learning with Generative Adversarial Network for Sequential Recommendation (ECGAN-Rec), which models the training process as a GAN and recommendation task as the main task of the discriminator. We design a sequence augmentation module and a contrastive GAN module to implement both data-level and model-level augmentations. In addition, the contrastive GAN learns more accurate high-quality item representations to alleviate data noise after data augmentation. Furthermore, we propose an enhanced Transformer recommender based on GAN to optimize the performance of the model. Experimental results on three open datasets validate the efficiency and effectiveness of the proposed model and the ability of the model to balance data noise and data sparsity. Specifically, the improvement of ECGAN-Rec in two evaluation metrics (HR@N and NDCG@N) compared to the state-of-the-art model performance on the Beauty, Sports and Yelp datasets are 34.95%, 36.68%, and 13.66%, respectively. Our implemented model is available via https://github.com/nishawn/ECGANRec-master.  相似文献   

Graph neural networks have been frequently applied in recommender systems due to their powerful representation abilities for irregular data. However, these methods still suffer from the difficulties such as the inflexible graph structure, sparse and highly imbalanced data, and relatively shallow networks, limiting rate prediction ability for recommendations. This paper presents a novel deep dynamic graph attention framework based on influence and preference relationship reconstruction (DGA-IPR) for recommender systems to learn optimal latent representations of users and items. The entire framework involves a user branch and an item branch. An influence-based dynamic graph attention (IDGA) module, a preference-based dynamic graph attention (PDGA) module, and an adaptive fine feature extraction (AFFE) module are respectively constructed for each branch. Concretely, the first two attention modules concentrate on reconstructing influence and preference relationship graphs, breaking imbalanced and fixed constraints of graph structures. Then a deep feature aggregation block and an adaptive feature fusion operation are built, improving the network depth and capturing potential high-order information expressions. Besides, AFFE is designed to acquire finer latent features for users and items. The DGA-IPR architecture is formed by integrating IDGA, PDGA, and AFFE for users and items, respectively. Experiments reveal the superiority of DGA-IPR over existing recommendation models.  相似文献   

This paper presents a robust and comprehensive graph-based rank aggregation approach, used to combine results of isolated ranker models in retrieval tasks. The method follows an unsupervised scheme, which is independent of how the isolated ranks are formulated. Our approach is able to combine arbitrary models, defined in terms of different ranking criteria, such as those based on textual, image or hybrid content representations.We reformulate the ad-hoc retrieval problem as a document retrieval based on fusion graphs, which we propose as a new unified representation model capable of merging multiple ranks and expressing inter-relationships of retrieval results automatically. By doing so, we claim that the retrieval system can benefit from learning the manifold structure of datasets, thus leading to more effective results. Another contribution is that our graph-based aggregation formulation, unlike existing approaches, allows for encapsulating contextual information encoded from multiple ranks, which can be directly used for ranking, without further computations and post-processing steps over the graphs. Based on the graphs, a novel similarity retrieval score is formulated using an efficient computation of minimum common subgraphs. Finally, another benefit over existing approaches is the absence of hyperparameters.A comprehensive experimental evaluation was conducted considering diverse well-known public datasets, composed of textual, image, and multimodal documents. Performed experiments demonstrate that our method reaches top performance, yielding better effectiveness scores than state-of-the-art baseline methods and promoting large gains over the rankers being fused, thus demonstrating the successful capability of the proposal in representing queries based on a unified graph-based model of rank fusions.  相似文献   

Detecting sentiments in natural language is tricky even for humans, making its automated detection more complicated. This research proffers a hybrid deep learning model for fine-grained sentiment prediction in real-time multimodal data. It reinforces the strengths of deep learning nets in combination to machine learning to deal with two specific semiotic systems, namely the textual (written text) and visual (still images) and their combination within the online content using decision level multimodal fusion. The proposed contextual ConvNet-SVMBoVW model, has four modules, namely, the discretization, text analytics, image analytics, and decision module. The input to the model is multimodal text, m ε {text, image, info-graphic}. The discretization module uses Google Lens to separate the text from the image, which is then processed as discrete entities and sent to the respective text analytics and image analytics modules. Text analytics module determines the sentiment using a hybrid of a convolution neural network (ConvNet) enriched with the contextual semantics of SentiCircle. An aggregation scheme is introduced to compute the hybrid polarity. A support vector machine (SVM) classifier trained using bag-of-visual-words (BoVW) for predicting the visual content sentiment. A Boolean decision module with a logical OR operation is augmented to the architecture which validates and categorizes the output on the basis of five fine-grained sentiment categories (truth values), namely ‘highly positive,’ ‘positive,’ ‘neutral,’ ‘negative’ and ‘highly negative.’ The accuracy achieved by the proposed model is nearly 91% which is an improvement over the accuracy obtained by the text and image modules individually.  相似文献   

Multi-feature fusion has achieved gratifying performance in image retrieval. However, some existing fusion mechanisms would unfortunately make the result worse than expected due to the domain and visual diversity of images. As a result, a burning problem for applying feature fusion mechanism is how to figure out and improve the complementarity of multi-level heterogeneous features. To this end, this paper proposes an adaptive multi-feature fusion method via cross-entropy normalization for effective image retrieval. First, various low-level features (e.g., SIFT) and high-level semantic features based on deep learning are extracted. Under each level of feature representation, the initial similarity scores of the query image w.r.t. the target dataset are calculated. Second, we use an independent reference dataset to approximate the tail of the attained initial similarity score ranking curve by cross-entropy normalization. Then the area under the ranking curve is calculated as the indicator of the merit of corresponding feature (i.e., a smaller area indicates a more suitable feature.). Finally, fusion weights of each feature are assigned adaptively by the statistically elaborated areas. Extensive experiments on three public benchmark datasets have demonstrated that the proposed method can achieve superior performance compared with the existing methods, improving the metrics mAP by relatively 1.04% (for Holidays), 1.22% (for Oxf5k) and the N-S by relatively 0.04 (for UKbench), respectively.  相似文献   

As an emerging task in opinion mining, End-to-End Multimodal Aspect-Based Sentiment Analysis (MABSA) aims to extract all the aspect-sentiment pairs mentioned in a pair of sentence and image. Most existing methods of MABSA do not explicitly incorporate aspect and sentiment information in their textual and visual representations and fail to consider the different contributions of visual representations to each word or aspect in the text. To tackle these limitations, we propose a multi-task learning framework named Cross-Modal Multitask Transformer (CMMT), which incorporates two auxiliary tasks to learn the aspect/sentiment-aware intra-modal representations and introduces a Text-Guided Cross-Modal Interaction Module to dynamically control the contributions of the visual information to the representation of each word in the inter-modal interaction. Experimental results demonstrate that CMMT consistently outperforms the state-of-the-art approach JML by 3.1, 3.3, and 4.1 absolute percentage points on three Twitter datasets for the End-to-End MABSA task, respectively. Moreover, further analysis shows that CMMT is superior to comparison systems in both aspect extraction (AE) and sentiment classification (SC), which would move the development of multimodal AE and SC algorithms forward with improved performance.  相似文献   

Content-based image retrieval (CBIR) with global features is notoriously noisy, especially for image queries with low percentages of relevant images in a collection. Moreover, CBIR typically ranks the whole collection, which is inefficient for large databases. We experiment with a method for image retrieval from multimedia databases, which improves both the effectiveness and efficiency of traditional CBIR by exploring secondary media. We perform retrieval in a two-stage fashion: first rank by a secondary medium, and then perform CBIR only on the top-K items. Thus, effectiveness is improved by performing CBIR on a ‘better’ subset. Using a relatively ‘cheap’ first stage, efficiency is also improved via the fewer CBIR operations performed. Our main novelty is that K is dynamic, i.e. estimated per query to optimize a predefined effectiveness measure. We show that our dynamic two-stage method can be significantly more effective and robust than similar setups with static thresholds previously proposed. In additional experiments using local feature derivatives in the visual stage instead of global, such as the emerging visual codebook approach, we find that two-stage does not work very well. We attribute the weaker performance of the visual codebook to the enhanced visual diversity produced by the textual stage which diminishes codebook’s advantage over global features. Furthermore, we compare dynamic two-stage retrieval to traditional score-based fusion of results retrieved visually and textually. We find that fusion is also significantly more effective than single-medium baselines. Although, there is no clear winner between two-stage and fusion, the methods exhibit different robustness features; nevertheless, two-stage retrieval provides efficiency benefits over fusion.  相似文献   

In recent years, fake news detection has been a significant task attracting much attention. However, most current approaches utilize the features from a single modality, such as text or image, while the comprehensive fusion between features of different modalities has been ignored. To deal with the above problem, we propose a novel model named Bidirectional Cross-Modal Fusion (BCMF), which comprehensively integrates the textual and visual representations in a bidirectional manner. Specifically, the proposed model is decomposed into four submodules, i.e., the input embedding, the image2text fusion, the text2image fusion, and the prediction module. We conduct intensive experiments on four real-world datasets, i.e., Weibo, Twitter, Politi, and Gossip. The results show 2.2, 2.5, 4.9, and 3.1 percentage points of improvements in classification accuracy compared to the state-of-the-art methods on Weibo, Twitter, Politi, and Gossip, respectively. The experimental results suggest that the proposed model could better capture integrated information of different modalities and has high generalizability among different datasets. Further experiments suggest that the bidirectional fusions, the number of multi-attention heads, and the aggregating function could impact the performance of the cross-modal fake news detection. The research sheds light on the role of bidirectional cross-modal fusion in leveraging multi-modal information to improve the effect of fake news detection.  相似文献   

Visual dialog, a visual-language task, enables an AI agent to engage in conversation with humans grounded in a given image. To generate appropriate answers for a series of questions in the dialog, the agent is required to understand the comprehensive visual content of an image and the fine-grained textual context of the dialog. However, previous studies typically utilized the object-level visual feature to represent a whole image, which only focuses on the local perspective of an image but ignores the importance of the global information in an image. In this paper, we proposed a novel model Human-Like Visual Cognitive and Language-Memory Network for Visual Dialog (HVLM), to simulate global and local dual-perspective cognitions in the human visual system and understand an image comprehensively. HVLM consists of two key modules, Local-to-Global Graph Convolutional Visual Cognition (LG-GCVC) and Question-guided Language Topic Memory (T-Mem). Specifically, in the LG-GCVC module, we design a question-guided dual-perspective reasoning to jointly learn visual contents from both local and global perspectives through a simple spectral graph convolution network. Furthermore, in the T-Mem module, we design an iterative learning strategy to gradually enhance fine-grained textual context details via an attention mechanism. Experimental results demonstrate the superiority of our proposed model, which obtains the comparable performance on benchmark datasets VisDial v1.0 and VisDial v0.9.  相似文献   

Image–text matching is a crucial branch in multimedia retrieval which relies on learning inter-modal correspondences. Most existing methods focus on global or local correspondence and fail to explore fine-grained global–local alignment. Moreover, the issue of how to infer more accurate similarity scores remains unresolved. In this study, we propose a novel unifying knowledge iterative dissemination and relational reconstruction (KIDRR) network for image–text matching. Particularly, the knowledge graph iterative dissemination module is designed to iteratively broadcast global semantic knowledge, enabling relevant nodes to be associated, resulting in fine-grained intra-modal correlations and features. Hence, vector-based similarity representations are learned from multiple perspectives to model multi-level alignments comprehensively. The relation graph reconstruction module is further developed to enhance cross-modal correspondences by constructing similarity relation graphs and adaptively reconstructing them. We conducted experiments on the datasets Flickr30K and MSCOCO, which have 31,783 and 123,287 images, respectively. Experiments show that KIDRR achieves improvements of nearly 2.2% and 1.6% relative to Recall@1 on Flicr30K and MSCOCO, respectively, compared to the current state-of-the-art baselines.  相似文献   

This paper constructs a novel enhanced latent semantic model based on users’ comments, and employs regularization factors to capture the temporal evolution characteristics of users’ potential topics for each commodity, so as to improve the accuracy of recommendation. The adaptive temporal weighting of multiple preference features is also improved to calculate the preferences of different users at different time periods using human forgetting features, item interest overlap, and similarity at the semantic level of the review text to improve the accuracy of sparse evaluation data. The paper conducts comparison experiments with six temporal matrix-based decomposition baseline methods in nine datasets, and the results show that the accuracy is 31.64% better than TimeSVD++, 21.08% better than BTMF, 15.51% better than TMRevCo, 13.99% better than BPTF, 9.24% better than TCMF, and 3.19% better than MUTPD ,which indicates that the model is more effective in capturing users’ temporal interest drift and better reflects the evolutionary relationship between users’ latent topics and item ratings.  相似文献   

Recommender Systems deal with the issue of overloading information by retrieving the most relevant sources in the wide range of web services. They help users by predicting their interests in many domains like e-government, social networks, e-commerce and entertainment. Collaborative Filtering (CF) is the most promising technique used in recommender systems to give suggestions based on liked-mind users’ preferences. Despite the widespread use of CF in providing personalized recommendation, this technique has problems including cold start, data sparsity and gray sheep. Eventually, these problems lead to the deterioration of the efficiency of CF. Most existing recommendation methods have been proposed to overcome the problems of CF. However, they fail to suggest the top-n recommendations based on the sequencing of the users’ priorities. In this research, to overcome the shortcomings of CF and current recommendation methods in ranking preference dataset, we have used a new graph-based structure to model the users’ priorities and capture the association between users and items. Users’ profiles are created based on their past and current interest. This is done because their interest can change with time. Our proposed algorithm keeps the preferred items of active user at the beginning of the recommendation list. This means these items come under top-n recommendations, which results in satisfaction among users. The experimental results demonstrate that our algorithm archives the significant improvement in comparison with CF and other proposed recommendation methods in terms of recall, precision, f-measure and MAP metrics using two benchmark datasets including MovieLens and Superstore.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号