首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 484 毫秒
1.
赵月华  朱思成  苏新宁 《情报科学》2021,39(12):165-173
【 目的/意义】解决获取虚假网络医疗信息数据集时专业知识不足的问题,帮助在小样本领域构建虚假网络 医疗信息识别模型。【方法/过程】本文提出一种基于权威辟谣信息转化提取构建网络虚假医疗信息数据集的思路, 并依次构建传统机器学习模型、CNN模型和BERT模型进行分类识别。【结果/结论】结果表明,基于辟谣信息能够 实现以较低成本、不依赖专家标注构建虚假医疗信息数据集。通过对比实验发现,基于微博数据预训练的 BERT 模型准确率为 95.91%,F1值为 94.57%,相比于传统机器学习模型和 CNN模型提升分别接近 6%和 4%,表明本文构 建的基于预训练的BERT模型在网络虚假医疗信息识别任务上取得了更好的效果。【创新/局限】本文提出的方法能 以较低成本建立专业领域的虚假信息数据集,所构建的BERT虚假医疗信息识别模型在小样本领域也具有实用价 值,但在数据集规模、深度学习模型对比、模型性能评价指标等方面还有待拓展与延伸。  相似文献   

2.
Multimodal relation extraction is a critical task in information extraction, aiming to predict the class of relations between head and tail entities from linguistic sequences and related images. However, the current works are vulnerable to less relevant visual objects detected from images and are not able to sufficiently fuse visual information into text pre-trained models. To overcome these problems, we propose a Two-Stage Visual Fusion Network (TSVFN) that employs the multimodal fusion approach in vision-enhanced entity relation extraction. In the first stage, we design multimodal graphs, whose novelty lies mainly in transforming the sequence learning into the graph learning. In the second stage, we merge the transformer-based visual representation into the text pre-trained model by a multi-scale cross-model projector. Specifically, two multimodal fusion operations are implemented inside the pre-trained model respectively. We finally accomplish deep interaction of multimodal multi-structured data in two fusion stages. Extensive experiments are conducted on a dataset (MNRE), our model outperforms the current state-of-the-art method by 1.76%, 1.52%, 1.29%, and 1.17% in terms of accuracy, precision, recall, and F1 score, respectively. Moreover, our model also achieves excellent results under the condition of fewer samples.  相似文献   

3.
[目的/意义]为了帮助情报学学科背景的就业人员掌握市场对情报学人才的具体需要,为情报学的教育者拟定情报学的教育体系和人才培养的目标提供指导。[方法/过程]采集国内各大招聘网站情报学相关职位招聘公告,构建情报学招聘语料库,基于CRF机器学习模型和Bi-LSTM-CRF、BERT、BERT-Bi-LSTM-CRF深度学习模型,从语料库中抽取5类情报学招聘实体进行挖掘分析。[结果/结论]通过在已有2000篇经过标注的职位招聘公告语料库上开展情报学招聘实体自动抽取对比实验,识别效果最佳的CRF模型的整体F值为85.07%,其中对"专业要求"实体的识别F值达到了91.67%。BERT模型在"专业要求"实体识别任务中更是取得了92.10%的F值。使用CRF模型对全部符合要求的5287篇招聘公告进行实体抽取,构建了情报学招聘实体社会网络,并通过信息计量分析与社会网络分析的方式挖掘隐含知识。  相似文献   

4.
Section identification is an important task for library science, especially knowledge management. Identifying the sections of a paper would help filter noise in entity and relation extraction. In this research, we studied the paper section identification problem in the context of Chinese medical literature analysis, where the subjects, methods, and results are more valuable from a physician's perspective. Based on previous studies on English literature section identification, we experiment with the effective features to use with classic machine learning algorithms to tackle the problem. It is found that Conditional Random Fields, which consider sentence interdependency, is more effective in combining different feature sets, such as bag-of-words, part-of-speech, and headings, for Chinese literature section identification. Moreover, we find that classic machine learning algorithms are more effective than generic deep learning models for this problem. Based on these observations, we design a novel deep learning model, the Structural Bidirectional Long Short-Term Memory (SLSTM) model, which models word and sentence interdependency together with the contextual information. Experiments on a human-curated asthma literature dataset show that our approach outperforms the traditional machine learning methods and other deep learning methods and achieves close to 90% precision and recall in the task. The model shows good potential for use in other text mining tasks. The research has significant methodological and practical implications.  相似文献   

5.
An idiom is a common phrase that means something other than its literal meaning. Detecting idioms automatically is a serious challenge in natural language processing (NLP) domain applications like information retrieval (IR), machine translation and chatbot. Automatic detection of Idioms plays an important role in all these applications. A fundamental NLP task is text classification, which categorizes text into structured categories known as text labeling or categorization. This paper deals with idiom identification as a text classification task. Pre-trained deep learning models have been used for several text classification tasks; though models like BERT and RoBERTa have not been exclusively used for idiom and literal classification. We propose a predictive ensemble model to classify idioms and literals using BERT and RoBERTa, fine-tuned with the TroFi dataset. The model is tested with a newly created in house dataset of idioms and literal expressions, numbering 1470 in all, and annotated by domain experts. Our model outperforms the baseline models in terms of the metrics considered, such as F-score and accuracy, with a 2% improvement in accuracy.  相似文献   

6.
The spread of fake news has become a significant social problem, drawing great concern for fake news detection (FND). Pretrained language models (PLMs), such as BERT and RoBERTa can benefit this task much, leading to state-of-the-art performance. The common paradigm of utilizing these PLMs is fine-tuning, in which a linear classification layer is built upon the well-initialized PLM network, resulting in an FND mode, and then the full model is tuned on a training corpus. Although great successes have been achieved, this paradigm still involves a significant gap between the language model pretraining and target task fine-tuning processes. Fortunately, prompt learning, a new alternative to PLM exploration, can handle the issue naturally, showing the potential for further performance improvements. To this end, we propose knowledgeable prompt learning (KPL) for this task. First, we apply prompt learning to FND, through designing one sophisticated prompt template and the corresponding verbal words carefully for the task. Second, we incorporate external knowledge into the prompt representation, making the representation more expressive to predict the verbal words. Experimental results on two benchmark datasets demonstrate that prompt learning is better than the baseline fine-tuning PLM utilization for FND and can outperform all previous representative methods. Our final knowledgeable model (i.e, KPL) can provide further improvements. In particular, it achieves an average increase of 3.28% in F1 score under low-resource conditions compared with fine-tuning.  相似文献   

7.
Existing approaches in online health question answering (HQA) communities to identify the quality of answers either address it subjectively by human assessment or mainly using textual features. This process may be time-consuming and lose the semantic information of answers. We present an automatic approach for predicting answer quality that combines sentence-level semantics with textual and non-textual features in the context of online healthcare. First, we extend the knowledge adoption model (KAM) theory to obtain the six dimensions of quality measures for textual and non-textual features. Then we apply the Bidirectional Encoder Representations from Transformers (BERT) model for extracting semantic features. Next, the multi-dimensional features are processed for dimensionality reduction using linear discriminant analysis (LDA). Finally, we incorporate the preprocessed features into the proposed BK-XGBoost method to automatically predict the answer quality. The proposed method is validated on a real-world dataset with 48121 question-answer pairs crawled from the most popular online HQA communities in China. The experimental results indicate that our method competes against the baseline models on various evaluation metrics. We found up to 2.9% and 5.7% improvement in AUC value in comparison with BERT and XGBoost models respectively.  相似文献   

8.
马达  卢嘉蓉  朱侯 《情报科学》2023,41(2):60-68
【目的/意义】探究针对微博文本的基于深度学习的情绪分类有效方法,研究微博热点事件下用户转发言论的情绪类型与隐私信息传播的关系。【方法/过程】选用BERT、BERT+CNN、BERT+RNN和ERNIE四个深度学习分类模型设置对比实验,在重新构建情绪7分类语料库的基础上验证性能较好的模型。选取4个微博热点案例,从情绪分布、情感词词频、转发时间和转发次数四个方面展开实证分析。【结果/结论】通过实证研究发现,用户在传播隐私信息是急速且短暂的,传播时以“愤怒”和“厌恶”等为代表的消极情绪占主导地位,且会因隐私信息主体的不同而产生情绪类型和表达方式上的差异。【创新/局限】研究了用户在传播隐私信息行为时的情绪特征及二者的联系,为保护社交网络用户隐私信息安全提供有价值的理论和现实依据,但所构建的语料库数据量对于训练一个高准确率的深度学习模型而言还不够,且模型对于反话、反讽等文本的识别效果不佳。  相似文献   

9.
任妮  鲍彤  沈耕宇  郭婷 《情报科学》2021,39(11):96-102
【 目的/意义】开展面向领域的细粒度命名实体识别研究对于提升文本挖掘精度具有重要的意义,本文以番 茄病虫害命名实体为例,探索采用深度学习技术实现面向领域的细粒度命名实体识别研究方法。【目的/意义】文章 以电子书、论文、网页作为数据源,选择品种、病虫害、症状、时间、部位、防治药剂六类实体进行标注,利用BERT和 CBOW 预训练字向量分别输入 BiLSTM-CRF 模型训练,并在识别后补充规则控制实体的边界。【结果/结论】 BERT预训练的字向量和BiLSTM-CRF结合,在补充规则控制后F值达到了81.03%,优于其它模型,在番茄病虫害 领域的实体识别中具有较好的效果。【创新/局限】BERT预训练的字向量可以有效降低番茄病虫害领域实体因分 词错误带来的影响,针对不同实体的特点,补充规则可以有效控制实体边界,提高识别准确率。但本文的规则补充 仅在测试阶段,并没有加入训练过程,整体的准确率还有待提高。  相似文献   

10.
11.
With the advent of Web 2.0, there exist many online platforms that results in massive textual data production such as social networks, online blogs, magazines etc. This textual data carries information that can be used for betterment of humanity. Hence, there is a dire need to extract potential information out of it. This study aims to present an overview of approaches that can be applied to extract and later present these valuable information nuggets residing within text in brief, clear and concise way. In this regard, two major tasks of automatic keyword extraction and text summarization are being reviewed. To compile the literature, scientific articles were collected using major digital computing research repositories. In the light of acquired literature, survey study covers early approaches up to all the way till recent advancements using machine learning solutions. Survey findings conclude that annotated benchmark datasets for various textual data-generators such as twitter and social forms are not available. This scarcity of dataset has resulted into relatively less progress in many domains. Also, applications of deep learning techniques for the task of automatic keyword extraction are relatively unaddressed. Hence, impact of various deep architectures stands as an open research direction. For text summarization task, deep learning techniques are applied after advent of word vectors, and are currently governing state-of-the-art for abstractive summarization. Currently, one of the major challenges in these tasks is semantic aware evaluation of generated results.  相似文献   

12.
【目的/意义】金融领域实体关系抽取是构造金融知识库的基础,对金融领域的文本信息利用具有重要作 用。本文提出金融领域实体关系联合抽取模型,增加了对金融文本复杂重叠关系的识别,可以有效避免传统的流 水线模型中识别错误在不同任务之间的传递。【方法/过程】本文构建了高质量金融文本语料,提出一种新的序列 标注模式和实体关系匹配规则,在预训练语言模型BERT(Bidirectional Encoder Representations from Transformers) 的基础上结合双向门控循环单元 BiGRU(Bidirectional Gated Recurrent Units)与条件随机场 CRF(Conditional Random Field)构建了端到端的序列标注模型,实现了实体关系的联合抽取。【结果/结论】针对金融领域文本数据 进行实验,实验结果表明本文提出的联合抽取模型在关系抽取以及重叠关系抽取上的F1值分别达到了0.627和 0.543,初步验证了中文语境下本文模型对金融领域实体关系抽取的有效性。【创新/局限】结合金融文本特征提出 了新的序列标注模式并构建了基于BERT的金融领域实体关系联合抽取模型,实现了对金融文本中实体间重叠关 系的识别。  相似文献   

13.
Hate speech is an increasingly important societal issue in the era of digital communication. Hateful expressions often make use of figurative language and, although they represent, in some sense, the dark side of language, they are also often prime examples of creative use of language. While hate speech is a global phenomenon, current studies on automatic hate speech detection are typically framed in a monolingual setting. In this work, we explore hate speech detection in low-resource languages by transferring knowledge from a resource-rich language, English, in a zero-shot learning fashion. We experiment with traditional and recent neural architectures, and propose two joint-learning models, using different multilingual language representations to transfer knowledge between pairs of languages. We also evaluate the impact of additional knowledge in our experiment, by incorporating information from a multilingual lexicon of abusive words. The results show that our joint-learning models achieve the best performance on most languages. However, a simple approach that uses machine translation and a pre-trained English language model achieves a robust performance. In contrast, Multilingual BERT fails to obtain a good performance in cross-lingual hate speech detection. We also experimentally found that the external knowledge from a multilingual abusive lexicon is able to improve the models’ performance, specifically in detecting the positive class. The results of our experimental evaluation highlight a number of challenges and issues in this particular task. One of the main challenges is related to the issue of current benchmarks for hate speech detection, in particular how bias related to the topical focus in the datasets influences the classification performance. The insufficient ability of current multilingual language models to transfer knowledge between languages in the specific hate speech detection task also remain an open problem. However, our experimental evaluation and our qualitative analysis show how the explicit integration of linguistic knowledge from a structured abusive language lexicon helps to alleviate this issue.  相似文献   

14.
The impact of crisis events can be devastating in a multitude of ways, many of which are unpredictable due to the suddenness in which they occur. The evolution of social media (for example Twitter) has given directly affected individuals or those with valuable information a platform to effectively share their stories to the masses. As a result, these platforms have become vast repositories of helpful information for emergency organizations. However, different crisis events often contain event-specific keywords, which results in the difficult extraction of useful information with a single model. In this paper, we put forward TASR, which stands for Topic-Agnostic Stylometric Representations, a novice deep learning architecture that uses stylometric and adversarial learning to remove topical bias to better manage the unknown surrounding unseen events. As an alternative to domain adaptive approaches requiring data from the unseen event, it reduces the work for those responding to the onset of a crisis. Overall, we conduct a comprehensive study of the situational properties of TASR, the benefits of its architecture including its topic-agnostic and explainable properties, and how it improves upon comparable models in past research. From two experiments, on average, TASR is able to outperform state-of-the-art methods such as transfer learning and domain adoption by 11% in AUC. The ablation study illustrates how different architecture choices of TASR impact the results and that TASR has been optimized for this task. Finally, we conduct a case study to show that explainable results from our model can be used to help guide human analysts through crisis information extraction.  相似文献   

15.
In recent years, large-scale Pre-trained Language Models (PLMs) like BERT have achieved state-of-the-art results on many NLP tasks. We explore whether BERT understands deontic logic which is important for the fields of legal AI and digital government. We measure BERT's understanding of deontic logic through the Deontic Modality Classification (DMC) task. Experiments show that without fine-tuning or fine-tuning with only a small amount of data, BERT cannot achieve good performance on the DMC task. Therefore, we propose a new method for BERT fine-tuning and prediction, called DeonticBERT. The method incorporates heuristic knowledge from deontic logic theory as an inductive bias into BERT through a template function and a mapping between category labels and predicted words, to steer BERT understand the DMC task. This can also stimulate BERT to recall the deontic logic knowledge learned in pre-training. We use an English dataset widely used as well as a Chinese dataset we constructed to conduct experiments. Experimental results show that on the DMC task, DeonticBERT can achieve 66.9% and 91% accuracy under zero-shot and few-shot conditions, respectively, far exceeding other baselines. This demonstrates that DeonticBERT does enable BERT to understand deontic logic and can handle related tasks without using much fine-tuning data. Our research helps facilitate applying large-scale PLMs like BERT into legal AI and digital government.  相似文献   

16.
Multimodal fake news detection methods based on semantic information have achieved great success. However, these methods only exploit the deep features of multimodal information, which leads to a large loss of valid information at the shallow level. To address this problem, we propose a progressive fusion network (MPFN) for multimodal disinformation detection, which captures the representational information of each modality at different levels and achieves fusion between modalities at the same level and at different levels by means of a mixer to establish a strong connection between the modalities. Specifically, we use a transformer structure, which is effective in computer vision tasks, as a visual feature extractor to gradually sample features at different levels and combine features obtained from a text feature extractor and image frequency domain information at different levels for fine-grained modeling. In addition, we design a feature fusion approach to better establish connections between modalities, which can further improve the performance and thus surpass other network structures in the literature. We conducted extensive experiments on two real datasets, Weibo and Twitter, where our method achieved 83.3% accuracy on the Twitter dataset, which has increased by at least 4.3% compared to other state-of-the-art methods. This demonstrates the effectiveness of MPFN for identifying fake news, and the method reaches a relatively advanced level by combining different levels of information from each modality and a powerful modality fusion method.  相似文献   

17.
Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, comprising consecutive stages of language identification, tokenisation, part-of-speech tagging, named entity recognition and entity disambiguation (e.g. with respect to DBpedia). In this work, we describe a new Twitter entity disambiguation dataset, and conduct an empirical analysis of named entity recognition and disambiguation, investigating how robust a number of state-of-the-art systems are on such noisy texts, what the main sources of error are, and which problems should be further investigated to improve the state of the art.  相似文献   

18.
Depression is a widespread and intractable problem in modern society, which may lead to suicide ideation and behavior. Analyzing depression or suicide based on the posts of social media such as Twitter or Reddit has achieved great progress in recent years. However, most work focuses on English social media and depression prediction is typically formalized as being present or absent. In this paper, we construct a human-annotated dataset for depression analysis via Chinese microblog reviews which includes 6,100 manually-annotated posts. Our dataset includes two fine-grained tasks, namely depression degree prediction and depression cause prediction. The object of the former task is to classify a Microblog post into one of 5 categories based on the depression degree, while the object of the latter one is selecting one or multiple reasons that cause the depression from 7 predefined categories. To set up a benchmark, we design a neural model for joint depression degree and cause prediction, and compare it with several widely-used neural models such as TextCNN, BiLSTM and BERT. Our model outperforms the baselines and achieves at most 65+% F1 for depression degree prediction, 70+% F1 and 90+% AUC for depression cause prediction, which shows that neural models achieve promising results, but there is still room for improvement. Our work can extend the area of social-media-based depression analyses, and our annotated data and code can also facilitate related research.  相似文献   

19.
Nowadays a large amount of knowledge has been born on the Internet and the way of constructing knowledge graph is not uniform. Due to the recent outbreak of numerous diseases, the community has placed more importance on the healthcare system. Diabetes is a severe disease that affect people's health. To assist the health sector in combating this deadly disease, the authors developed a deep learning strategy for diabetes named entity extraction based on a fusion of text characteristic and relationship extraction utilizing text data as the object. This study aims to develop a multi-feature entity recognition model that considers the differences in text features across different fields. Firstly, in the word embedding layer, a multi-feature word embedding algorithm is proposed, which integrates Pinyin, radical, and the meaning of the character itself, so that the word embedding vector has the characteristics of Chinese characters and diabetes text. Then in modeling, CNN and BiLSTM are used to extract the local and global features before and after the text sequence, respectively, which solved the problem that the traditional method cannot capture the dependence before and after the text sequence. Finally, CRF is used to output the predicted tag sequence. The experimental results show that the multi-feature embedding algorithm and local features extracted by CNN can effectively improve the recognition effect of the entity recognition model.  相似文献   

20.
In this work, we propose BERT-WMAL, a hybrid model that brings together information coming from data through the recent transformer deep learning model and those obtained from a polarized lexicon. The result is a model for sentence polarity that manages to have performances comparable with those at the state-of-the-art, but with the advantage of being able to provide the end-user with an explanation regarding the most important terms involved with the provided prediction. The model has been evaluated on three polarity detection Italian dataset, i.e., SENTIPOLC, AGRITREND and ABSITA. While the first contains 7,410 tweets released for training and 2,000 for testing, the second and the third respectively include 1,000 tweets without splitting , and 2,365 reviews for training, 1,171 for testing. The use of lexicon-based information proves to be effective in terms of the F1 measure since it shows an improvement of F1 score on all the observed dataset: from 0.664 to 0.669 (i.e, 0.772%) on AGRITREND, from 0.728 to 0.734 (i.e., 0.854%) on SENTIPOLC and from 0.904 to 0.921 (i.e, 1.873%) on ABSITA. The usefulness of this model not only depends on its effectiveness in terms of the F1 measure, but also on its ability to generate predictions that are more explainable and especially convincing for the end-users. We evaluated this aspect through a user study involving four native Italian speakers, each evaluating 64 sentences with associated explanations. The results demonstrate the validity of this approach based on a combination of weights of attention extracted from the deep learning model and the linguistic knowledge stored in the WMAL lexicon. These considerations allow us to regard the approach provided in this paper as a promising starting point for further works in this research area.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号