首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 33 毫秒
1.
Sentiment analysis on Twitter has attracted much attention recently due to its wide applications in both, commercial and public sectors. In this paper we present SentiCircles, a lexicon-based approach for sentiment analysis on Twitter. Different from typical lexicon-based approaches, which offer a fixed and static prior sentiment polarities of words regardless of their context, SentiCircles takes into account the co-occurrence patterns of words in different contexts in tweets to capture their semantics and update their pre-assigned strength and polarity in sentiment lexicons accordingly. Our approach allows for the detection of sentiment at both entity-level and tweet-level. We evaluate our proposed approach on three Twitter datasets using three different sentiment lexicons to derive word prior sentiments. Results show that our approach significantly outperforms the baselines in accuracy and F-measure for entity-level subjectivity (neutral vs. polar) and polarity (positive vs. negative) detections. For tweet-level sentiment detection, our approach performs better than the state-of-the-art SentiStrength by 4–5% in accuracy in two datasets, but falls marginally behind by 1% in F-measure in the third dataset.  相似文献   

2.
In recent years, fake news detection has been a significant task attracting much attention. However, most current approaches utilize the features from a single modality, such as text or image, while the comprehensive fusion between features of different modalities has been ignored. To deal with the above problem, we propose a novel model named Bidirectional Cross-Modal Fusion (BCMF), which comprehensively integrates the textual and visual representations in a bidirectional manner. Specifically, the proposed model is decomposed into four submodules, i.e., the input embedding, the image2text fusion, the text2image fusion, and the prediction module. We conduct intensive experiments on four real-world datasets, i.e., Weibo, Twitter, Politi, and Gossip. The results show 2.2, 2.5, 4.9, and 3.1 percentage points of improvements in classification accuracy compared to the state-of-the-art methods on Weibo, Twitter, Politi, and Gossip, respectively. The experimental results suggest that the proposed model could better capture integrated information of different modalities and has high generalizability among different datasets. Further experiments suggest that the bidirectional fusions, the number of multi-attention heads, and the aggregating function could impact the performance of the cross-modal fake news detection. The research sheds light on the role of bidirectional cross-modal fusion in leveraging multi-modal information to improve the effect of fake news detection.  相似文献   

3.
OCR errors in text harm information retrieval performance. Much research has been reported on modelling and correction of Optical Character Recognition (OCR) errors. Most of the prior work employ language dependent resources or training texts in studying the nature of errors. However, not much research has been reported that focuses on improving retrieval performance from erroneous text in the absence of training data. We propose a novel approach for detecting OCR errors and improving retrieval performance from the erroneous corpus in a situation where training samples are not available to model errors. In this paper we propose a method that automatically identifies erroneous term variants in the noisy corpus, which are used for query expansion, in the absence of clean text. We employ an effective combination of contextual information and string matching techniques. Our proposed approach automatically identifies the erroneous variants of query terms and consequently leads to improvement in retrieval performance through query expansion. Our proposed approach does not use any training data or any language specific resources like thesaurus for identification of error variants. It also does not expend any knowledge about the language except that the word delimiter is blank space. We have tested our approach on erroneous Bangla (Bengali in English) and Hindi FIRE collections, and also on TREC Legal IIT CDIP and TREC 5 Confusion track English corpora. Our proposed approach has achieved statistically significant improvements over the state-of-the-art baselines on most of the datasets.  相似文献   

4.
Stance detection identifies a person’s evaluation of a subject, and is a crucial component for many downstream applications. In application, stance detection requires training a machine learning model on an annotated dataset and applying the model on another to predict stances of text snippets. This cross-dataset model generalization poses three central questions, which we investigate using stance classification models on 7 publicly available English Twitter datasets ranging from 297 to 48,284 instances. (1) Are stance classification models generalizable across datasets? We construct a single dataset model to train/test dataset-against-dataset, finding models do not generalize well (avg F1=0.33). (2) Can we improve the generalizability by aggregating datasets? We find a multi dataset model built on the aggregation of datasets has an improved performance (avg F1=0.69). (3) Given a model built on multiple datasets, how much additional data is required to fine-tune it? We find it challenging to ascertain a minimum number of data points due to the lack of pattern in performance. Investigating possible reasons for the choppy model performance we find that texts are not easily differentiable by stances, nor are annotations consistent within and across datasets. Our observations emphasize the need for an aggregated dataset as well as consistent labels for the generalizability of models.  相似文献   

5.
Paraphrase detection is an important task in text analytics with numerous applications such as plagiarism detection, duplicate question identification, and enhanced customer support helpdesks. Deep models have been proposed for representing and classifying paraphrases. These models, however, require large quantities of human-labeled data, which is expensive to obtain. In this work, we present a data augmentation strategy and a multi-cascaded model for improved paraphrase detection in short texts. Our data augmentation strategy considers the notions of paraphrases and non-paraphrases as binary relations over the set of texts. Subsequently, it uses graph theoretic concepts to efficiently generate additional paraphrase and non-paraphrase pairs in a sound manner. Our multi-cascaded model employs three supervised feature learners (cascades) based on CNN and LSTM networks with and without soft-attention. The learned features, together with hand-crafted linguistic features, are then forwarded to a discriminator network for final classification. Our model is both wide and deep and provides greater robustness across clean and noisy short texts. We evaluate our approach on three benchmark datasets and show that it produces a comparable or state-of-the-art performance on all three.  相似文献   

6.
Due to natural language morphology, words can take on various morphological forms. Morphological normalisation – often used in information retrieval and text mining systems – conflates morphological variants of a word to a single representative form. In this paper, we describe an approach to lexicon-based inflectional normalisation. This approach is in between stemming and lemmatisation, and is suitable for morphological normalisation of inflectionally complex languages. To eliminate the immense effort required to compile the lexicon by hand, we focus on the problem of acquiring automatically an inflectional morphological lexicon from raw corpora. We propose a convenient and highly expressive morphology representation formalism on which the acquisition procedure is based. Our approach is applied to the morphologically complex Croatian language, but it should be equally applicable to other languages of similar morphological complexity. Experimental results show that our approach can be used to acquire a lexicon whose linguistic quality allows for rather good normalisation performance.  相似文献   

7.
The rapid development of online social media makes Abusive Language Detection (ALD) a hot topic in the field of affective computing. However, most methods for ALD in social networks do not take into account the interactive relationships among user posts, which simply regard ALD as a task of text context representation learning. To solve this problem, we propose a pipeline approach that considers both the context of a post and the characteristics of interaction network in which it is posted. Specifically, our method is divided into pre-training and downstream tasks. First, to capture fine contextual features of the posts, we use Bidirectional Encoder Representation from Transformers (BERT) as Encoder to generate sentence representations. Later, we build a Relation-Special Network according to the semantic similarity between posts as well as the interaction network structural information. On this basis, we design a Relation-Special Graph Neural Network (RSGNN) to spread effective information in the interaction network and learn the classification of texts. The experiment proves that our method can effectively improve the detection effect of abusive posts over three public datasets. The results demonstrate that injecting interaction network structure into the abusive language detection task can significantly improve the detection results.  相似文献   

8.
Due to the particularity of Chinese word formation, the Chinese Named Entity Recognition (NER) task has attracted extensive attention over recent years. Recently, some researchers have tried to solve this problem by using a multimodal method combining acoustic features and text features. However, the text-speech data pairs required by the above methods are lacking in real-world scenarios, making it difficult to apply widely. To address this, we proposed a multimodal Chinese NER method called USAF, which uses synthesized acoustic features instead of actual human speech. USAF aligns text and acoustic features through unique position embeddings and uses a multi-head attention mechanism to fuse the features of the two modalities, which stably improves the performance of Chinese named entity recognition. To evaluate USAF, we implemented USAF on three Chinese NER datasets. Experimental results show that USAF witnesses a stable improvement compare to text-only methods on each dataset, and outperforms SOTA external-vocabulary-based method on two datasets. Specifically, compared to the SOTA external-vocabulary-based method, the F1 score of USAF is improved by 1.84 and 1.24 on CNERTA and Aishell3-NER, respectively.  相似文献   

9.
Despite growing efforts to halt distasteful content on social media, multilingualism has added a new dimension to this problem. The scarcity of resources makes the challenge even greater when it comes to low-resource languages. This work focuses on providing a novel method for abusive content detection in multiple low-resource Indic languages. Our observation indicates that a post’s tendency to attract abusive comments, as well as features such as user history and social context, significantly aid in the detection of abusive content. The proposed method first learns social and text context features in two separate modules. The integrated representation from these modules is learned and used for the final prediction. To evaluate the performance of our method against different classical and state-of-the-art methods, we have performed extensive experiments on SCIDN and MACI datasets consisting of 1.5M and 665K multilingual comments, respectively. Our proposed method outperforms state-of-the-art baseline methods with an average increase of 4.08% and 9.52% in the F1-score on SCIDN and MACI datasets, respectively.  相似文献   

10.
Since previous studies in cognitive psychology show that individuals’ affective states can help analyze and predict their future behaviors, researchers have explored emotion mining for predicting online activities, firm profitability, and so on. Existing emotion mining methods are divided into two categories: feature-based approaches that rely on handcrafted annotations and deep learning-based methods that thrive on computational resources and big data. However, neither category can effectively detect emotional expressions captured in text (e.g., social media postings). In addition, the utilization of these methods in downstream explanatory and predictive applications is also rare. To fill the aforementioned research gaps, we develop a novel deep learning-based emotion detector named DeepEmotionNet that can simultaneously leverage contextual, syntactic, semantic, and document-level features and lexicon-based linguistic knowledge to bootstrap the overall emotion detection performance. Based on three emotion detection benchmark corpora, our experimental results confirm that DeepEmotionNet outperforms state-of-the-art baseline methods by 4.9% to 29.8% in macro-averaged F-score. For the downstream application of DeepEmotionNet to a real-world financial application, our econometric analysis highlights that top executives’ emotions of fear and anger embedded in their social media postings are significantly associated with corporate financial performance. Furthermore, these two emotions can significantly improve the predictive power of corporate financial performance when compared to sentiments. To the best of our knowledge, this is the first study to develop a deep learning-based emotion detection method and successfully apply it to enhance corporate performance prediction.  相似文献   

11.
Multimodal fake news detection methods based on semantic information have achieved great success. However, these methods only exploit the deep features of multimodal information, which leads to a large loss of valid information at the shallow level. To address this problem, we propose a progressive fusion network (MPFN) for multimodal disinformation detection, which captures the representational information of each modality at different levels and achieves fusion between modalities at the same level and at different levels by means of a mixer to establish a strong connection between the modalities. Specifically, we use a transformer structure, which is effective in computer vision tasks, as a visual feature extractor to gradually sample features at different levels and combine features obtained from a text feature extractor and image frequency domain information at different levels for fine-grained modeling. In addition, we design a feature fusion approach to better establish connections between modalities, which can further improve the performance and thus surpass other network structures in the literature. We conducted extensive experiments on two real datasets, Weibo and Twitter, where our method achieved 83.3% accuracy on the Twitter dataset, which has increased by at least 4.3% compared to other state-of-the-art methods. This demonstrates the effectiveness of MPFN for identifying fake news, and the method reaches a relatively advanced level by combining different levels of information from each modality and a powerful modality fusion method.  相似文献   

12.
Sentiment analysis concerns about automatically identifying sentiment or opinion expressed in a given piece of text. Most prior work either use prior lexical knowledge defined as sentiment polarity of words or view the task as a text classification problem and rely on labeled corpora to train a sentiment classifier. While lexicon-based approaches do not adapt well to different domains, corpus-based approaches require expensive manual annotation effort.  相似文献   

13.
In this era, the proliferating role of social media in our lives has popularized the posting of the short text. The short texts contain limited context with unique characteristics which makes them difficult to handle. Every day billions of short texts are produced in the form of tags, keywords, tweets, phone messages, messenger conversations social network posts, etc. The analysis of these short texts is imperative in the field of text mining and content analysis. The extraction of precise topics from large-scale short text documents is a critical and challenging task. The conventional approaches fail to obtain word co-occurrence patterns in topics due to the sparsity problem in short texts, such as text over the web, social media like Twitter, and news headlines. Therefore, in this paper, the sparsity problem is ameliorated by presenting a novel fuzzy topic modeling (FTM) approach for short text through fuzzy perspective. In this research, the local and global term frequencies are computed through a bag-of-words (BOW) model. To remove the negative impact of high dimensionality on the global term weighting, the principal component analysis is adopted; thereafter the fuzzy c-means algorithm is employed to retrieve the semantically relevant topics from the documents. The experiments are conducted over the three real-world short text datasets: the snippets dataset is in the category of small dataset whereas the other two datasets, Twitter and questions, are the bigger datasets. Experimental results show that the proposed approach discovered the topics more precisely and performed better as compared to other state-of-the-art baseline topic models such as GLTM, CSTM, LTM, LDA, Mix-gram, BTM, SATM, and DREx+LDA. The performance of FTM is also demonstrated in classification, clustering, topic coherence and execution time. FTM classification accuracy is 0.95, 0.94, 0.91, 0.89 and 0.87 on snippets dataset with 50, 75, 100, 125 and 200 number of topics. The classification accuracy of FTM on questions dataset is 0.73, 0.74, 0.70, 0.68 and 0.78 with 50, 75, 100, 125 and 200 number of topics. The classification accuracies of FTM on snippets and questions datasets are higher than state-of-the-art baseline topic models.  相似文献   

14.
With the onset of COVID-19, the pandemic has aroused huge discussions on social media like Twitter, followed by many social media analyses concerning it. Despite such an abundance of studies, however, little work has been done on reactions from the public and officials on social networks and their associations, especially during the early outbreak stage. In this paper, a total of 9,259,861 COVID-19-related English tweets published from 31 December 2019 to 11 March 2020 are accumulated for exploring the participatory dynamics of public attention and news coverage during the early stage of the pandemic. An easy numeric data augmentation (ENDA) technique is proposed for generating new samples while preserving label validity. It attains superior performance on text classification tasks with deep models (BERT) than an easier data augmentation method. To demonstrate the efficacy of ENDA further, experiments and ablation studies have also been implemented on other benchmark datasets. The classification results of COVID-19 tweets show tweets peaks trigged by momentous events and a strong positive correlation between the daily number of personal narratives and news reports. We argue that there were three periods divided by the turning points on January 20 and February 23 and the low level of news coverage suggests the missed windows for government response in early January and February. Our study not only contributes to a deeper understanding of the dynamic patterns and relationships of public attention and news coverage on social media during the pandemic but also sheds light on early emergency management and government response on social media during global health crises.  相似文献   

15.
The detection and identification of traffic signs is a fundamental function of an intelligent transportation system. The extraction or identification of a road sign poses the same problems as object identification in natural contexts: conditions of illumination are variable and uncontrollable, and various objects frequently surround road signs. These difficulties make the extraction of features difficult. The fusion of time and space features of traffic signs is important for improving the performance of sign recognition. Deep learning-based algorithms are time-consuming to train based on a large amount of data. They are difficult to deploy on resource-constrained portable devices and conduct sign detection in real time. The accuracy of sign detection should be further improved, which is related to the safety of traffic participants. To improve the accuracy of feature extraction and classification of traffic signs, we propose MKL-SING, a hybrid approach based on multi-kernel support vector machine (MKL-SVM) for public transportation SIGN recognition. It contains three main components: a principal component analysis for image dimension reduction, a fused feature extractor, and a multi-kernel SVM-based classifier. The fused feature extractor extracts and fuses the time and space features of traffic signs. The multi-kernel SVM then classifies the traffic signs based on the fused features. Different kernel functions in the multi-kernel SVM are fused based on a feature weighting procedure. Compared with single-core SVM, multi-kernel SVM can better process massive data because it can project each kernel function into high-dimensional feature space to get global solutions. Finally, the performance of SVM-TSR is validated based on three traffic sign datasets. Experiment results show that SVM-TSR performs better than state-of-the-art methods in terms of dynamic traffic sign identification and recognition.  相似文献   

16.
In synthetic aperture radar (SAR) image change detection, the deep learning has attracted increasingly more attention because the difference images (DIs) of traditional unsupervised technology are vulnerable to speckle noise. However, most of the existing deep networks do not constrain the distributional characteristics of the hidden space, which may affect the feature representation performance. This paper proposes a variational autoencoder (VAE) network with the siamese structure to detect changes in SAR images. The VAE encodes the input as a probability distribution in the hidden space to obtain regular hidden layer features with a good representation ability. Furthermore, subnetworks with the same parameters and structure can extract the spatial consistency features of the original image, which is conducive to the subsequent classification. The proposed method includes three main steps. First, the training samples are selected based on the false labels generated by a clustering algorithm. Then, we train the proposed model with the semisupervised learning strategy, including unsupervised feature learning and supervised network fine-tuning. Finally, input the original data instead of the DIs in the trained network to obtain the change detection results. The experimental results on four real SAR datasets show the effectiveness and robustness of the proposed method.  相似文献   

17.
Stance detection is to distinguish whether the text’s author supports, opposes, or maintains a neutral stance towards a given target. In most real-world scenarios, stance detection needs to work in a zero-shot manner, i.e., predicting stances for unseen targets without labeled data. One critical challenge of zero-shot stance detection is the absence of contextual information on the targets. Current works mostly concentrate on introducing external knowledge to supplement information about targets, but the noisy schema-linking process hinders their performance in practice. To combat this issue, we argue that previous studies have ignored the extensive target-related information inhabited in the unlabeled data during the training phase, and propose a simple yet efficient Multi-Perspective Contrastive Learning Framework for zero-shot stance detection. Our framework is capable of leveraging information not only from labeled data but also from extensive unlabeled data. To this end, we design target-oriented contrastive learning and label-oriented contrastive learning to capture more comprehensive target representation and more distinguishable stance features. We conduct extensive experiments on three widely adopted datasets (from 4870 to 33,090 instances), namely SemEval-2016, WT-WT, and VAST. Our framework achieves 53.6%, 77.1%, and 72.4% macro-average F1 scores on these three datasets, showing 2.71% and 0.25% improvements over state-of-the-art baselines on the SemEval-2016 and WT-WT datasets and comparable results on the more challenging VAST dataset.  相似文献   

18.
Text summarization is a process of generating a brief version of documents by preserving the fundamental information of documents as much as possible. Although most of the text summarization research has been focused on supervised learning solutions, there are a few datasets indeed generated for summarization tasks, and most of the existing summarization datasets do not have human-generated goal summaries which are vital for both summary generation and evaluation. Therefore, a new dataset was presented for abstractive and extractive summarization tasks in this study. This dataset contains academic publications, the abstracts written by the authors, and extracts in two sizes, which were generated by human readers in this research. Then, the resulting extracts were evaluated to ensure the validity of the human extract production process. Moreover, the extractive summarization problem was reinvestigated on the proposed summarization dataset. Here the main point taken into account was to analyze the feature vector to generate more informative summaries. To that end, a comprehensive syntactic feature space was generated for the proposed dataset, and the impact of these features on the informativeness of the resulting summary was investigated. Besides, the summarization capability of semantic features was experienced by using GloVe and word2vec embeddings. Finally, the use of ensembled feature space, which corresponds to the joint use of syntactic and semantic features, was proposed on a long short-term memory-based neural network model. ROUGE metrics evaluated the model summaries, and the results of these evaluations showed that the use of the proposed ensemble feature space remarkably improved the single-use of syntactic or semantic features. Additionally, the resulting summaries of the proposed approach on ensembled features prominently outperformed or provided comparable performance than summaries obtained by state-of-the-art models for extractive summarization.  相似文献   

19.
The proliferation of false information is a growing problem in today's dynamic online environment. This phenomenon requires automated detection of fake news to reduce its harmful effect on society. Even though various methods are used to detect fake news, most methods only consider data-oriented text features; ignoring dual emotion features (publisher emotions and social emotions) and thus lack higher levels of accuracy. This study addresses this issue by utilizing dual emotion features to detect fake news. The study proposes a Deep Normalized Attention-based mechanism for enriched extraction of dual emotion features and an Adaptive Genetic Weight Update-Random Forest (AGWu-RF) for classification. First, the deep normalized attention-based mechanism incorporates BiGRU, which improves feature value by extracting long-range context information to eliminate gradient explosion issues. The genetic weight for the model is adjusted to RF and updated to achieve optimized hyper parameter values ​​that support the classifiers' detection accuracy. The proposed model outperforms baseline methods on standard benchmark metrics in three real-world datasets. It outperforms state-of-the-art approaches by 5%, 11%, and 14% in terms of accuracy, highlighting the significance of dual emotion capabilities and optimizations in improving fake news detection.  相似文献   

20.
As a prevalent problem in online advertising, CTR prediction has attracted plentiful attention from both academia and industry. Recent studies have been reported to establish CTR prediction models in the graph neural networks (GNNs) framework. However, most of GNNs-based models handle feature interactions in a complete graph, while ignoring causal relationships among features, which results in a huge drop in the performance on out-of-distribution data. This paper is dedicated to developing a causality-based CTR prediction model in the GNNs framework (Causal-GNN) integrating representations of feature graph, user graph and ad graph in the context of online advertising. In our model, a structured representation learning method (GraphFwFM) is designed to capture high-order representations on feature graph based on causal discovery among field features in gated graph neural networks (GGNNs), and GraphSAGE is employed to obtain graph representations of users and ads. Experiments conducted on three public datasets demonstrate the superiority of Causal-GNN in AUC and Logloss and the effectiveness of GraphFwFM in capturing high-order representations on causal feature graph.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号