首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 156 毫秒
1.
With the rapid development in mobile computing and Web technologies, online hate speech has been increasingly spread in social network platforms since it's easy to post any opinions. Previous studies confirm that exposure to online hate speech has serious offline consequences to historically deprived communities. Thus, research on automated hate speech detection has attracted much attention. However, the role of social networks in identifying hate-related vulnerable community is not well investigated. Hate speech can affect all population groups, but some are more vulnerable to its impact than others. For example, for ethnic groups whose languages have few computational resources, it is a challenge to automatically collect and process online texts, not to mention automatic hate speech detection on social media. In this paper, we propose a hate speech detection approach to identify hatred against vulnerable minority groups on social media. Firstly, in Spark distributed processing framework, posts are automatically collected and pre-processed, and features are extracted using word n-grams and word embedding techniques such as Word2Vec. Secondly, deep learning algorithms for classification such as Gated Recurrent Unit (GRU), a variety of Recurrent Neural Networks (RNNs), are used for hate speech detection. Finally, hate words are clustered with methods such as Word2Vec to predict the potential target ethnic group for hatred. In our experiments, we use Amharic language in Ethiopia as an example. Since there was no publicly available dataset for Amharic texts, we crawled Facebook pages to prepare the corpus. Since data annotation could be biased by culture, we recruit annotators from different cultural backgrounds and achieved better inter-annotator agreement. In our experimental results, feature extraction using word embedding techniques such as Word2Vec performs better in both classical and deep learning-based classification algorithms for hate speech detection, among which GRU achieves the best result. Our proposed approach can successfully identify the Tigre ethnic group as the highly vulnerable community in terms of hatred compared with Amhara and Oromo. As a result, hatred vulnerable group identification is vital to protect them by applying automatic hate speech detection model to remove contents that aggravate psychological harm and physical conflicts. This can also encourage the way towards the development of policies, strategies, and tools to empower and protect vulnerable communities.  相似文献   

2.
Hate speech detection refers broadly to the automatic identification of language that may be considered discriminatory against certain groups of people. The goal is to help online platforms to identify and remove harmful content. Humans are usually capable of detecting hatred in critical cases, such as when the hatred is non-explicit, but how do computer models address this situation? In this work, we aim to contribute to the understanding of ethical issues related to hate speech by analysing two transformer-based models trained to detect hate speech. Our study focuses on analysing the relationship between these models and a set of hateful keywords extracted from the three well-known datasets. For the extraction of the keywords, we propose a metric that takes into account the division among classes to favour the most common words in hateful contexts. In our experiments, we first compared the overlap between the extracted keywords with the words to which the models pay the most attention in decision-making. On the other hand, we investigate the bias of the models towards the extracted keywords. For the bias analysis, we characterize and use two metrics and evaluate two strategies to try to mitigate the bias. Surprisingly, we show that over 50% of the salient words of the models are not hateful and that there is a higher number of hateful words among the extracted keywords. However, we show that the models appear to be biased towards the extracted keywords. Experimental results suggest that fitting models with hateful texts that do not contain any of the keywords can reduce bias and improve the performance of the models.  相似文献   

3.
4.
Social networks have grown into a widespread form of communication that allows a large number of users to participate in conversations and consume information at any time. The casual nature of social media allows for nonstandard terminology, some of which may be considered rude and derogatory. As a result, a significant portion of social media users is found to express disrespectful language. This problem may intensify in certain developing countries where young children are granted unsupervised access to social media platforms. Furthermore, the sheer amount of social media data generated daily by millions of users makes it impractical for humans to monitor and regulate inappropriate content. If adolescents are exposed to these harmful language patterns without adequate supervision, they may feel obliged to adopt them. In addition, unrestricted aggression in online forums may result in cyberbullying and other dreadful occurrences. While computational linguistics research has addressed the difficulty of detecting abusive dialogues, issues remain unanswered for low-resource languages with little annotated data, leading the majority of supervised techniques to perform poorly. In addition, social media content is often presented in complex, context-rich formats that encourage creative user involvement. Therefore, we propose to improve the performance of abusive language detection and classification in a low-resource setting, using both the abundant unlabeled data and the context features via the co-training protocol that enables two machine learning models, each learning from an orthogonal set of features, to teach each other, resulting in an overall performance improvement. Empirical results reveal that our proposed framework achieves F1 values of 0.922 and 0.827, surpassing the state-of-the-art baselines by 3.32% and 45.85% for binary and fine-grained classification tasks, respectively. In addition to proving the efficacy of co-training in a low-resource situation for abusive language detection and classification tasks, the findings shed light on several opportunities to use unlabeled data and contextual characteristics of social networks in a variety of social computing applications.  相似文献   

5.
6.
The possibilities that emerge from micro-blogging generated content for crisis-related situations make automatic crisis management using natural language processing techniques a hot research topic. Our aim here is to contribute to this line of research focusing for the first time on French tweets related to ecological crises in order to support the French Civil Security and Crisis Management Department to provide immediate feedback on the expectations of the populations involved in the crisis. We propose a new dataset manually annotated according to three dimensions: relatedness, urgency and intentions to act. We then experiment with binary classification (useful vs. non useful), three-class (non useful vs. urgent vs. non urgent) and multiclass classification (i.e., intention to act categories) relying on traditional feature-based machine learning using both state of the art and new features. We also explore several deep learning models trained with pre-trained word embeddings as well as contextual embeddings. We then investigate three transfer learning strategies to adapt these models to the crisis domain. We finally experiment with multi-input architectures by incorporating different metadata extra-features to the network. Our deep models, evaluated in random sampling, out-of-event and out-of-type configurations, show very good performances outperforming several competitive baselines. Our results define the first contribution to the field of crisis management in French social media.  相似文献   

7.
With the onset of COVID-19, the pandemic has aroused huge discussions on social media like Twitter, followed by many social media analyses concerning it. Despite such an abundance of studies, however, little work has been done on reactions from the public and officials on social networks and their associations, especially during the early outbreak stage. In this paper, a total of 9,259,861 COVID-19-related English tweets published from 31 December 2019 to 11 March 2020 are accumulated for exploring the participatory dynamics of public attention and news coverage during the early stage of the pandemic. An easy numeric data augmentation (ENDA) technique is proposed for generating new samples while preserving label validity. It attains superior performance on text classification tasks with deep models (BERT) than an easier data augmentation method. To demonstrate the efficacy of ENDA further, experiments and ablation studies have also been implemented on other benchmark datasets. The classification results of COVID-19 tweets show tweets peaks trigged by momentous events and a strong positive correlation between the daily number of personal narratives and news reports. We argue that there were three periods divided by the turning points on January 20 and February 23 and the low level of news coverage suggests the missed windows for government response in early January and February. Our study not only contributes to a deeper understanding of the dynamic patterns and relationships of public attention and news coverage on social media during the pandemic but also sheds light on early emergency management and government response on social media during global health crises.  相似文献   

8.
Warning: This paper contains abusive samples that may cause discomfort to readers.Abusive language on social media reinforces prejudice against an individual or a specific group of people, which greatly hampers freedom of expression. With the rise of large-scale pre-trained language models, classification based on pre-trained language models has gradually become a paradigm for automatic abusive language detection. However, the effect of stereotypes inherent in language models on the detection of abusive language remains unknown, although this may further reinforce biases against the minorities. To this end, in this paper, we use multiple metrics to measure the presence of bias in language models and analyze the impact of these inherent biases in automatic abusive language detection. On the basis of this quantitative analysis, we propose two different debiasing strategies, token debiasing and sentence debiasing, which are jointly applied to reduce the bias of language models in abusive language detection without degrading the classification performance. Specifically, for the token debiasing strategy, we reduce the discrimination of the language model against protected attribute terms of a certain group by random probability estimation. For the sentence debiasing strategy, we replace protected attribute terms and augment the original text by counterfactual augmentation to obtain debiased samples, and use the consistency regularization between the original data and the augmented samples to eliminate the bias at the sentence level of the language model. The experimental results confirm that our method can not only reduce the bias of the language model in the abusive language detection task, but also effectively improve the performance of abusive language detection.  相似文献   

9.
Rumour stance classification, defined as classifying the stance of specific social media posts into one of supporting, denying, querying or commenting on an earlier post, is becoming of increasing interest to researchers. While most previous work has focused on using individual tweets as classifier inputs, here we report on the performance of sequential classifiers that exploit the discourse features inherent in social media interactions or ‘conversational threads’. Testing the effectiveness of four sequential classifiers – Hawkes Processes, Linear-Chain Conditional Random Fields (Linear CRF), Tree-Structured Conditional Random Fields (Tree CRF) and Long Short Term Memory networks (LSTM) – on eight datasets associated with breaking news stories, and looking at different types of local and contextual features, our work sheds new light on the development of accurate stance classifiers. We show that sequential classifiers that exploit the use of discourse properties in social media conversations while using only local features, outperform non-sequential classifiers. Furthermore, we show that LSTM using a reduced set of features can outperform the other sequential classifiers; this performance is consistent across datasets and across types of stances. To conclude, our work also analyses the different features under study, identifying those that best help characterise and distinguish between stances, such as supporting tweets being more likely to be accompanied by evidence than denying tweets. We also set forth a number of directions for future research.  相似文献   

10.
Climate change has become one of the most significant crises of our time. Public opinion on climate change is influenced by social media platforms such as Twitter, often divided into believers and deniers. In this paper, we propose a framework to classify a tweet’s stance on climate change (denier/believer). Existing approaches to stance detection and classification of climate change tweets either have paid little attention to the characteristics of deniers’ tweets or often lack an appropriate architecture. However, the relevant literature reveals that the sentimental aspects and time perspective of climate change conversations on Twitter have a major impact on public attitudes and environmental orientation. Therefore, in our study, we focus on exploring the role of temporal orientation and sentiment analysis (auxiliary tasks) in detecting the attitude of tweets on climate change (main task). Our proposed framework STASY integrates word- and sentence-based feature encoders with the intra-task and shared-private attention frameworks to better encode the interactions between task-specific and shared features. We conducted our experiments on our novel curated climate change CLiCS dataset (2465 denier and 7235 believer tweets), two publicly available climate change datasets (ClimateICWSM-2022 and ClimateStance-2022), and two benchmark stance detection datasets (SemEval-2016 and COVID-19-Stance). Experiments show that our proposed approach improves stance detection performance (with an average improvement of 12.14% on our climate change dataset, 15.18% on ClimateICWSM-2022, 12.94% on ClimateStance-2022, 19.38% on SemEval-2016, and 35.01% on COVID-19-Stance in terms of average F1 scores) by benefiting from the auxiliary tasks compared to the baseline methods.  相似文献   

11.
Stance is defined as the expression of a speaker's standpoint towards a given target or entity. To date, the most reliable method for measuring stance is opinion surveys. However, people's increased reliance on social media makes these online platforms an essential source of complementary information about public opinion. Our study contributes to the discussion surrounding replicable methods through which to conduct reliable stance detection by establishing a rule-based model, which we replicated for several targets independently. To test our model, we relied on a widely used dataset of annotated tweets - the SemEval Task 6A dataset, which contains 5 targets with 4,163 manually labelled tweets. We relied on “off-the-shelf” sentiment lexica to expand the scope of our custom dictionaries, while also integrating linguistic markers and using word-pairs dependency information to conduct stance classification. While positive and negative evaluative words are the clearest markers of expression of stance, we demonstrate the added value of linguistic markers to identify the direction of the stance more precisely. Our model achieves an average classification accuracy of 75% (ranging from 67% to 89% across targets). This study is concluded by discussing practical implications and outlooks for future research, while highlighting that each target poses specific challenges to stance detection.  相似文献   

12.
This article describes in-depth research on machine learning methods for sentiment analysis of Czech social media. Whereas in English, Chinese, or Spanish this field has a long history and evaluation datasets for various domains are widely available, in the case of the Czech language no systematic research has yet been conducted. We tackle this issue and establish a common ground for further research by providing a large human-annotated Czech social media corpus. Furthermore, we evaluate state-of-the-art supervised machine learning methods for sentiment analysis. We explore different pre-processing techniques and employ various features and classifiers. We also experiment with five different feature selection algorithms and investigate the influence of named entity recognition and preprocessing on sentiment classification performance. Moreover, in addition to our newly created social media dataset, we also report results for other popular domains, such as movie and product reviews. We believe that this article will not only extend the current sentiment analysis research to another family of languages, but will also encourage competition, potentially leading to the production of high-end commercial solutions.  相似文献   

13.
Hate speech is an increasingly important societal issue in the era of digital communication. Hateful expressions often make use of figurative language and, although they represent, in some sense, the dark side of language, they are also often prime examples of creative use of language. While hate speech is a global phenomenon, current studies on automatic hate speech detection are typically framed in a monolingual setting. In this work, we explore hate speech detection in low-resource languages by transferring knowledge from a resource-rich language, English, in a zero-shot learning fashion. We experiment with traditional and recent neural architectures, and propose two joint-learning models, using different multilingual language representations to transfer knowledge between pairs of languages. We also evaluate the impact of additional knowledge in our experiment, by incorporating information from a multilingual lexicon of abusive words. The results show that our joint-learning models achieve the best performance on most languages. However, a simple approach that uses machine translation and a pre-trained English language model achieves a robust performance. In contrast, Multilingual BERT fails to obtain a good performance in cross-lingual hate speech detection. We also experimentally found that the external knowledge from a multilingual abusive lexicon is able to improve the models’ performance, specifically in detecting the positive class. The results of our experimental evaluation highlight a number of challenges and issues in this particular task. One of the main challenges is related to the issue of current benchmarks for hate speech detection, in particular how bias related to the topical focus in the datasets influences the classification performance. The insufficient ability of current multilingual language models to transfer knowledge between languages in the specific hate speech detection task also remain an open problem. However, our experimental evaluation and our qualitative analysis show how the explicit integration of linguistic knowledge from a structured abusive language lexicon helps to alleviate this issue.  相似文献   

14.
邵春发 《科教文汇》2012,(20):82-83
网络的普及催生了一种新的语言模式,即网络语言。网络语言即时地反映了人们的社会心理和文化心理,然而在网络中言论的极大自由导致了网络用语鱼龙混杂,雅俗共存,本文试图从文化与社会的关系角度出发,谈谈网络语言的"是与非"。  相似文献   

15.
Ethnicity-targeted hate speech has been widely shown to influence on-the-ground inter-ethnic conflict and violence, especially in such multi-ethnic societies as Russia. Therefore, ethnicity-targeted hate speech detection in user texts is becoming an important task. However, it faces a number of unresolved problems: difficulties of reliable mark-up, informal and indirect ways of expressing negativity in user texts (such as irony, false generalization and attribution of unfavored actions to targeted groups), users’ inclination to express opposite attitudes to different ethnic groups in the same text and, finally, lack of research on languages other than English. In this work we address several of these problems in the task of ethnicity-targeted hate speech detection in Russian-language social media texts. This approach allows us to differentiate between attitudes towards different ethnic groups mentioned in the same text – a task that has never been addressed before. We use a dataset of over 2,6M user messages mentioning ethnic groups to construct a representative sample of 12K instances (ethnic group, text) that are further thoroughly annotated via a special procedure. In contrast to many previous collections that usually comprise extreme cases of toxic speech, representativity of our sample secures a realistic and, therefore, much higher proportion of subtle negativity which additionally complicates its automatic detection. We then experiment with four types of machine learning models, from traditional classifiers such as SVM to deep learning approaches, notably the recently introduced BERT architecture, and interpret their predictions in terms of various linguistic phenomena. In addition to hate speech detection with a text-level two-class approach (hate, no hate), we also justify and implement a unique instance-based three-class approach (positive, neutral, negative attitude, the latter implying hate speech). Our best results are achieved by using fine-tuned and pre-trained RuBERT combined with linguistic features, with F1-hate=0.760, F1-macro=0.833 on the text-level two-class problem comparable to previous studies, and F1-hate=0.813, F1-macro=0.824 on our unique instance-based three-class hate speech detection task. Finally, we perform error analysis, and it reveals that further improvement could be achieved by accounting for complex and creative language issues more accurately, i.e., by detecting irony and unconventional forms of obscene lexicon.  相似文献   

16.
以在中国知网检索到的学生厌学文献为资料,从研究对象、研究方法、研究内容等方面分析了国内厌学研究的现状,总结出厌学研究中存在概念界定不确切、调查工具不标准、形成机制分析不足、研究缺乏系统性和持续性等问题,据此提出通过专门研究对厌学进行准确的定义、编制合适的研究工具、深入分析厌学形成机制、探索解决厌学问题的方法和技术、加强厌学研究的系统性和持续性等进一步研究的建议。  相似文献   

17.
The research field of crisis informatics examines, amongst others, the potentials and barriers of social media use during disasters and emergencies. Social media allow emergency services to receive valuable information (e.g., eyewitness reports, pictures, or videos) from social media. However, the vast amount of data generated during large-scale incidents can lead to issue of information overload. Research indicates that supervised machine learning techniques are suitable for identifying relevant messages and filter out irrelevant messages, thus mitigating information overload. Still, they require a considerable amount of labeled data, clear criteria for relevance classification, a usable interface to facilitate the labeling process and a mechanism to rapidly deploy retrained classifiers. To overcome these issues, we present (1) a system for social media monitoring, analysis and relevance classification, (2) abstract and precise criteria for relevance classification in social media during disasters and emergencies, (3) the evaluation of a well-performing Random Forest algorithm for relevance classification incorporating metadata from social media into a batch learning approach (e.g., 91.28%/89.19% accuracy, 98.3%/89.6% precision and 80.4%/87.5% recall with a fast training time with feature subset selection on the European floods/BASF SE incident datasets), as well as (4) an approach and preliminary evaluation for relevance classification including active, incremental and online learning to reduce the amount of required labeled data and to correct misclassifications of the algorithm by feedback classification. Using the latter approach, we achieved a well-performing classifier based on the European floods dataset by only requiring a quarter of labeled data compared to the traditional batch learning approach. Despite a lesser effect on the BASF SE incident dataset, still a substantial improvement could be determined.  相似文献   

18.
Users’ ability to retweet information has made Twitter one of the most prominent social media platforms for disseminating emergency information during disasters. However, few studies have examined how Twitter’s features can support the different communication patterns that occur during different phases of disaster events. Based on the literature of disaster communication and Media Synchronicity Theory, we identify distinct disaster phases and the two communication types—crisis communication and risk communication—that occur during those phases. We investigate how Twitter’s representational features, including words, URLs, hashtags, and hashtag importance, influence the average retweet time—that is, the average time it takes for retweet to occur—as well as how such effects differ depending on the type of disaster communication. Our analysis of tweets from the 2013 Colorado floods found that adding more URLs to tweets increases the average retweet time more in risk-related tweets than it does in crisis-related tweets. Further, including key disaster-related hashtags in tweets contributed to faster retweets in crisis-related tweets than in risk-related tweets. Our findings suggest that the influence of Twitter’s media capabilities on rapid tweet propagation during disasters may differ based on the communication processes.  相似文献   

19.
Interest in real-time syndromic surveillance based on social media data has greatly increased in recent years. The ability to detect disease outbreaks earlier than traditional methods would be highly useful for public health officials. This paper describes a software system which is built upon recent developments in machine learning and data processing to achieve this goal. The system is built from reusable modules integrated into data processing pipelines that are easily deployable and configurable. It applies deep learning to the problem of classifying health-related tweets and is able to do so with high accuracy. It has the capability to detect illness outbreaks from Twitter data and then to build up and display information about these outbreaks, including relevant news articles, to provide situational awareness. It also provides nowcasting functionality of current disease levels from previous clinical data combined with Twitter data.The preliminary results are promising, with the system being able to detect outbreaks of influenza-like illness symptoms which could then be confirmed by existing official sources. The Nowcasting module shows that using social media data can improve prediction for multiple diseases over simply using traditional data sources.  相似文献   

20.
The emergence of social media and the huge amount of opinions that are posted everyday have influenced online reputation management. Reputation experts need to filter and control what is posted online and, more importantly, determine if an online post is going to have positive or negative implications towards the entity of interest. This task is challenging, considering that there are posts that have implications on an entity's reputation but do not express any sentiment. In this paper, we propose two approaches for propagating sentiment signals to estimate reputation polarity of tweets. The first approach is based on sentiment lexicons augmentation, whereas the second is based on direct propagation of sentiment signals to tweets that discuss the same topic. In addition, we present a polar fact filter that is able to differentiate between reputation-bearing and reputation-neutral tweets. Our experiments indicate that weakly supervised annotation of reputation polarity is feasible and that sentiment signals can be propagated to effectively estimate the reputation polarity of tweets. Finally, we show that learning PMI values from the training data is the most effective approach for reputation polarity analysis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号