首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Social networks have grown into a widespread form of communication that allows a large number of users to participate in conversations and consume information at any time. The casual nature of social media allows for nonstandard terminology, some of which may be considered rude and derogatory. As a result, a significant portion of social media users is found to express disrespectful language. This problem may intensify in certain developing countries where young children are granted unsupervised access to social media platforms. Furthermore, the sheer amount of social media data generated daily by millions of users makes it impractical for humans to monitor and regulate inappropriate content. If adolescents are exposed to these harmful language patterns without adequate supervision, they may feel obliged to adopt them. In addition, unrestricted aggression in online forums may result in cyberbullying and other dreadful occurrences. While computational linguistics research has addressed the difficulty of detecting abusive dialogues, issues remain unanswered for low-resource languages with little annotated data, leading the majority of supervised techniques to perform poorly. In addition, social media content is often presented in complex, context-rich formats that encourage creative user involvement. Therefore, we propose to improve the performance of abusive language detection and classification in a low-resource setting, using both the abundant unlabeled data and the context features via the co-training protocol that enables two machine learning models, each learning from an orthogonal set of features, to teach each other, resulting in an overall performance improvement. Empirical results reveal that our proposed framework achieves F1 values of 0.922 and 0.827, surpassing the state-of-the-art baselines by 3.32% and 45.85% for binary and fine-grained classification tasks, respectively. In addition to proving the efficacy of co-training in a low-resource situation for abusive language detection and classification tasks, the findings shed light on several opportunities to use unlabeled data and contextual characteristics of social networks in a variety of social computing applications.  相似文献   

2.
We study the selection of transfer languages for automatic abusive language detection. Instead of preparing a dataset for every language, we demonstrate the effectiveness of cross-lingual transfer learning for zero-shot abusive language detection. This way we can use existing data from higher-resource languages to build better detection systems for low-resource languages. Our datasets are from seven different languages from three language families. We measure the distance between the languages using several language similarity measures, especially by quantifying the World Atlas of Language Structures. We show that there is a correlation between linguistic similarity and classifier performance. This discovery allows us to choose an optimal transfer language for zero shot abusive language detection.  相似文献   

3.
4.
Sentiment analysis (SA) is a continuing field of research that lies at the intersection of many fields such as data mining, natural language processing and machine learning. It is concerned with the automatic extraction of opinions conveyed in a certain text. Due to its vast applications, many studies have been conducted in the area of SA especially on English texts, while other languages such as Arabic received less attention. This survey presents a comprehensive overview of the works done so far on Arabic SA (ASA). The survey groups published papers based on the SA-related problems they address and tries to identify the gaps in the current literature laying foundation for future studies in this field.  相似文献   

5.
Hate speech is an increasingly important societal issue in the era of digital communication. Hateful expressions often make use of figurative language and, although they represent, in some sense, the dark side of language, they are also often prime examples of creative use of language. While hate speech is a global phenomenon, current studies on automatic hate speech detection are typically framed in a monolingual setting. In this work, we explore hate speech detection in low-resource languages by transferring knowledge from a resource-rich language, English, in a zero-shot learning fashion. We experiment with traditional and recent neural architectures, and propose two joint-learning models, using different multilingual language representations to transfer knowledge between pairs of languages. We also evaluate the impact of additional knowledge in our experiment, by incorporating information from a multilingual lexicon of abusive words. The results show that our joint-learning models achieve the best performance on most languages. However, a simple approach that uses machine translation and a pre-trained English language model achieves a robust performance. In contrast, Multilingual BERT fails to obtain a good performance in cross-lingual hate speech detection. We also experimentally found that the external knowledge from a multilingual abusive lexicon is able to improve the models’ performance, specifically in detecting the positive class. The results of our experimental evaluation highlight a number of challenges and issues in this particular task. One of the main challenges is related to the issue of current benchmarks for hate speech detection, in particular how bias related to the topical focus in the datasets influences the classification performance. The insufficient ability of current multilingual language models to transfer knowledge between languages in the specific hate speech detection task also remain an open problem. However, our experimental evaluation and our qualitative analysis show how the explicit integration of linguistic knowledge from a structured abusive language lexicon helps to alleviate this issue.  相似文献   

6.
Due to the vast volumes of newly streamed data on the Internet and social media, the use of sentiment analysis (SA) to extract information and analyze people's opinions has become a trendy topic. Yet, the majority of research are attributed to the English language, despite the fact that other languages, such as Arabic, are among the most popular on the Internet. Considering the availability of numerous dialects of this language and how their data were annotated and processed, the absence of research in this field is evident. Understanding these initiatives merits a great deal of attention in Arabic SA research. To the best of our knowledge, this domain has not been considered before, and thus the aim of this study is to perform a systematic review with regard to SA and data annotations for Arabic dialects published between 2015 and 2023. The outcomes of this research offer a refined taxonomy of data annotation methods classified into three categories: (1) manual, (2) automatic, and (3) hybrid methods. In addition, a discussion of the research challenges, motivations, and recommendations is presented with detailed taxonomy analysis of current research trends, and from this, we identify new research gaps and propose new research implications and future directions that will encourage more scholars to contribute to Arabic SA research, facilitate more successful multilingual SA applications, and provide insights regarding Arabic SA in different contexts.  相似文献   

7.
With the explosion of multilingual content on Web, particularly in social media platforms, identification of languages present in the text is becoming an important task for various applications. While automatic language identification (ALI) in social media text is considered to be a non-trivial task due to the presence of slang words, misspellings, creative spellings and special elements such as hashtags, user mentions etc., ALI in multilingual environment becomes even more challenging task. In a highly multilingual society, code-mixing without affecting the underlying language sense has become a natural phenomenon. In such a dynamic environment, conversational text alone often fails to identify the underlying languages present in the text. This paper proposes various methods of exploiting social conversational features for enhancing ALI performance. Although social conversational features for ALI have been explored previously using methods like probabilistic language modeling, these models often fail to address issues related to code-mixing, phonetic typing, out-of-vocabulary etc. which are prevalent in a highly multilingual environment. This paper differs in the way the social conversational features are used to propose text refinement strategies that are suitable for ALI in highly multilingual environment. The contributions in this paper therefore includes the following. First, this paper analyzes the characteristics of various social conversational features by exploiting language usage patterns. Second, various methods of text refinement suitable for language identification are proposed. Third, the effects of the proposed refinement methods are investigated using various sentence level language identification frameworks. From various experimental observations over three conversational datasets collected from Facebook, Youtube and Twitter social media platforms, it is evident that our proposed method of ALI using social conversational features outperforms the baseline counterparts.  相似文献   

8.
With the rapid development in mobile computing and Web technologies, online hate speech has been increasingly spread in social network platforms since it's easy to post any opinions. Previous studies confirm that exposure to online hate speech has serious offline consequences to historically deprived communities. Thus, research on automated hate speech detection has attracted much attention. However, the role of social networks in identifying hate-related vulnerable community is not well investigated. Hate speech can affect all population groups, but some are more vulnerable to its impact than others. For example, for ethnic groups whose languages have few computational resources, it is a challenge to automatically collect and process online texts, not to mention automatic hate speech detection on social media. In this paper, we propose a hate speech detection approach to identify hatred against vulnerable minority groups on social media. Firstly, in Spark distributed processing framework, posts are automatically collected and pre-processed, and features are extracted using word n-grams and word embedding techniques such as Word2Vec. Secondly, deep learning algorithms for classification such as Gated Recurrent Unit (GRU), a variety of Recurrent Neural Networks (RNNs), are used for hate speech detection. Finally, hate words are clustered with methods such as Word2Vec to predict the potential target ethnic group for hatred. In our experiments, we use Amharic language in Ethiopia as an example. Since there was no publicly available dataset for Amharic texts, we crawled Facebook pages to prepare the corpus. Since data annotation could be biased by culture, we recruit annotators from different cultural backgrounds and achieved better inter-annotator agreement. In our experimental results, feature extraction using word embedding techniques such as Word2Vec performs better in both classical and deep learning-based classification algorithms for hate speech detection, among which GRU achieves the best result. Our proposed approach can successfully identify the Tigre ethnic group as the highly vulnerable community in terms of hatred compared with Amhara and Oromo. As a result, hatred vulnerable group identification is vital to protect them by applying automatic hate speech detection model to remove contents that aggravate psychological harm and physical conflicts. This can also encourage the way towards the development of policies, strategies, and tools to empower and protect vulnerable communities.  相似文献   

9.
Named entity recognition aims to detect pre-determined entity types in unstructured text. There is a limited number of studies on this task for low-resource languages such as Turkish. We provide a comprehensive study for Turkish named entity recognition by comparing the performances of existing state-of-the-art models on the datasets with varying domains to understand their generalization capability and further analyze why such models fail or succeed in this task. Our experimental results, supported by statistical tests, show that the highest weighted F1 scores are obtained by Transformer-based language models, varying from 80.8% in tweets to 96.1% in news articles. We find that Transformer-based language models are more robust to entity types with a small sample size and longer named entities compared to traditional models, yet all models have poor performance for longer named entities in social media. Moreover, when we shuffle 80% of words in a sentence to imitate flexible word order in Turkish, we observe more performance deterioration, 12% in well-written texts, compared to 7% in noisy text.  相似文献   

10.
Despite growing efforts to halt distasteful content on social media, multilingualism has added a new dimension to this problem. The scarcity of resources makes the challenge even greater when it comes to low-resource languages. This work focuses on providing a novel method for abusive content detection in multiple low-resource Indic languages. Our observation indicates that a post’s tendency to attract abusive comments, as well as features such as user history and social context, significantly aid in the detection of abusive content. The proposed method first learns social and text context features in two separate modules. The integrated representation from these modules is learned and used for the final prediction. To evaluate the performance of our method against different classical and state-of-the-art methods, we have performed extensive experiments on SCIDN and MACI datasets consisting of 1.5M and 665K multilingual comments, respectively. Our proposed method outperforms state-of-the-art baseline methods with an average increase of 4.08% and 9.52% in the F1-score on SCIDN and MACI datasets, respectively.  相似文献   

11.
Conceptual metaphor detection is a well-researched topic in Natural Language Processing. At the same time, conceptual metaphor use analysis produces unique insight into individual psychological processes and characteristics, as demonstrated by research in cognitive psychology. Despite the fact that state-of-the-art language models allow for highly effective automatic detection of conceptual metaphor in benchmark datasets, the models have never been applied to psychological tasks. The benchmark datasets differ a lot from experimental texts recorded or produced in a psychological setting, in their domain, genre, and the scope of metaphoric expressions covered.We present the first experiment to apply NLP metaphor detection methods to a psychological task, specifically, analyzing individual differences. For that, we annotate MetPersonality, a dataset of Russian texts written in a psychological experiment setting, with conceptual metaphor. With a widely used conceptual metaphor annotation procedure, we obtain low annotation quality, which arises from the dataset characteristics uncommon in typical automatic metaphor detection tasks. We suggest a novel conceptual metaphor annotation procedure to mitigate issues in annotation quality, increasing the inter-annotator agreement to a moderately high level. We leverage the annotated dataset and existing metaphor datasets in Russian to select, train and evaluate state-of-the-art metaphor detection models, obtaining acceptable results in the metaphor detection task. In turn, the most effective model is used to detect conceptual metaphor automatically in RusPersonality, a larger dataset containing meta-information on psychological traits of the participant authors. Finally, we analyze correlations of automatically detected metaphor use with psychological traits encoded in the Freiburg Personality Inventory (FPI).Our pioneering work on automatically-detected metaphor use and individual differences demonstrates the possibility of unprecedented large-scale research on the relation between of metaphor use and personality traits and dispositions, cognitive and emotional processing.  相似文献   

12.
[目的/意义] 从跨语言视角探究如何更好地解决低资源语言的实体抽取问题。[方法/过程] 以英语为源语言,西班牙语和荷兰语为目标语言,借助迁移学习和深度学习的思想,提出一种结合自学习和GRU-LSTM-CRF网络的无监督跨语言实体抽取方法。[结果/结论] 与有监督的跨语言实体抽取方法相比,本文提出的无监督跨语言实体抽取方法可以取得更好的效果,在西班牙语上,F1值为0.6419,在荷兰语上,F1值为0.6557。利用跨语言知识在源语言和目标语言间建立桥梁,提升低资源语言实体抽取的效果。  相似文献   

13.
The wide spread of false information has detrimental effects on society, and false information detection has received wide attention. When new domains appear, the relevant labeled data is scarce, which brings severe challenges to the detection. Previous work mainly leverages additional data or domain adaptation technology to assist detection. The former would lead to a severe data burden; the latter underutilizes the pre-trained language model because there is a gap between the downstream task and the pre-training task, which is also inefficient for model storage because it needs to store a set of parameters for each domain. To this end, we propose a meta-prompt based learning (MAP) framework for low-resource false information detection. We excavate the potential of pre-trained language models by transforming the detection tasks into pre-training tasks by constructing template. To solve the problem of the randomly initialized template hindering excavation performance, we learn optimal initialized parameters by borrowing the benefit of meta learning in fast parameter training. The combination of meta learning and prompt learning for the detection is non-trivial: Constructing meta tasks to get initialized parameters suitable for different domains and setting up the prompt model’s verbalizer for classification in the noisy low-resource scenario are challenging. For the former, we propose a multi-domain meta task construction method to learn domain-invariant meta knowledge. For the latter, we propose a prototype verbalizer to summarize category information and design a noise-resistant prototyping strategy to reduce the influence of noise data. Extensive experiments on real-world data demonstrate the superiority of the MAP in new domains of false information detection.  相似文献   

14.
Cross-lingual semantic interoperability has drawn significant attention in recent digital library and World Wide Web research as the information in languages other than English has grown exponentially. Cross-lingual information retrieval (CLIR) across different European languages, such as English, Spanish, and French, has been widely explored; however, CLIR across European languages and Oriental languages is still in the initial stage. To cross language boundary, corpus-based approach is promising to overcome the limitation of the knowledge-based and controlled vocabulary approaches but collecting parallel corpora between European language and Oriental language is not an easy task. Length-based and text-based approaches are two major approaches to align parallel documents. In this paper, we investigate several techniques using these approaches and compare their performances in aligning English and Chinese titles of parallel documents available on the Web.  相似文献   

15.
Recently, sentiment classification has received considerable attention within the natural language processing research community. However, since most recent works regarding sentiment classification have been done in the English language, there are accordingly not enough sentiment resources in other languages. Manual construction of reliable sentiment resources is a very difficult and time-consuming task. Cross-lingual sentiment classification aims to utilize annotated sentiment resources in one language (typically English) for sentiment classification of text documents in another language. Most existing research works rely on automatic machine translation services to directly project information from one language to another. However, different term distribution between original and translated text documents and translation errors are two main problems faced in the case of using only machine translation. To overcome these problems, we propose a novel learning model based on active learning and semi-supervised co-training to incorporate unlabelled data from the target language into the learning process in a bi-view framework. This model attempts to enrich training data by adding the most confident automatically-labelled examples, as well as a few of the most informative manually-labelled examples from unlabelled data in an iterative process. Further, in this model, we consider the density of unlabelled data so as to select more representative unlabelled examples in order to avoid outlier selection in active learning. The proposed model was applied to book review datasets in three different languages. Experiments showed that our model can effectively improve the cross-lingual sentiment classification performance and reduce labelling efforts in comparison with some baseline methods.  相似文献   

16.
The automatic classification of Arabic dialects is an ongoing research challenge, which has been explored in recent work that defines dialects based on increasingly limited geographic areas like cities and provinces. This paper focuses on a related, yet relatively unexplored topic: the effects of the geographical proximity of cities located in Arab countries on their dialectal similarity. Our work is twofold, reliant on: (1) comparing the textual similarities between dialects using cosine similarity and (2) measuring the geographical distance between locations. We study MADAR and NADI, two established datasets with Arabic dialects from many cities and provinces. Our results indicate that cities located in different countries may in fact have more dialectal similarity than cities within the same country, depending on their geographical proximity. The correlation between dialectal similarity and city proximity suggests that cities that are closer together are more likely to share dialectal attributes, regardless of country borders. This nuance provides the potential for important advancements in Arabic dialect research because it indicates that a more granular approach to dialect classification is essential to understanding how to frame the problem of Arabic dialect identification.  相似文献   

17.
Lacking a universal working language, information managers around the world cannot now deal reliably and efficiently with multilingual documentation. Language mismatch paralyzes international cooperative efforts such as multinational bibliographic standardization, linking of collections, and sharing the work of classification and indexing. Knowledge of the same second language by all information managers can open the communication channels needed for worldwide cooperation. Ethnic and ideological rivalries preclude success in this role by any of the conventional languages. The planned language, Esperanto, is the logical choice because of its neutrality, rational structure, clarity and expressive power. Pioneering projects in automatic language processing, not possible in English, are feasible in Esperanto.  相似文献   

18.
Warning: This paper contains examples of language and images which may be offensive.Misogyny is a form of hate against women and has been spreading exponentially through the Web, especially on social media platforms. Hateful content towards women can be conveyed not only by text but also using visual and/or audio sources or their combination, highlighting the necessity to address it from a multimodal perspective. One of the predominant forms of multimodal content against women is represented by memes, which are images characterized by pictorial content with an overlaying text introduced a posteriori. Its main aim is originally to be funny and/or ironic, making misogyny recognition in memes even more challenging. In this paper, we investigated 4 unimodal and 3 multimodal approaches to determine which source of information contributes more to the detection of misogynous memes. Moreover, a bias estimation technique is proposed to identify specific elements that compose a meme that could lead to unfair models, together with a bias mitigation strategy based on Bayesian Optimization. The proposed method is able to push the prediction probabilities towards the correct class for up to 61.43% of the cases. Finally, we identified the most challenging archetypes of memes that are still far to be properly recognized, highlighting the most relevant open research directions.  相似文献   

19.
农秋红 《大众科技》2012,(6):314-315,299
以广西区内开设有东盟小语种专业的几所院校为调查对象,采用问卷及访谈的形式对各院校的人才培养模式进行调查,从中发现现阶段东盟小语种高职教育存在的主要问题,同时获得东盟小语种专业人才培养的经验、启示.  相似文献   

20.
文章提出的基于三元组可比语料库的自动语言剖析技术扩大了该研究领域的内涵,使其包括面向自然语言处理的应用研究。从工程可实现性考虑,创新性地提出建造三元组可比语料库,利用n-元词串、关键词簇和语义多词表达等自动抽取技术,通过对比中式英语表达,发掘英语本族语言模型,实现改进和发展机器翻译、跨语言信息检索等自然语言处理应用的目标。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号