首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
文本聚类算法的质量评价   总被引:4,自引:0,他引:4  
文本聚类是建立大规模文本集合的分类体系实例的有效手段之一。本文讨论了利用标准的分类测试集合进行聚类质量的量化评价的手段,选择了k-Means聚类算法、STC(后缀树聚类)算法和基于Ant的聚类算法进行了实验对比。对实验结果的分析表明,STC聚类算法由于在处理文本时充分考虑了文本的短语特性,其聚类效果较好;基于Ant的聚类算法的结果受参数输入的影响较大;在Ant聚类算法中引入文本特性可以提高聚类结果的质量。  相似文献   

2.
Natural language inference (NLI) is an increasingly important task of natural language processing, and the explainable NLI generates natural language explanations (NLEs) in addition to label prediction, to make NLI explainable and acceptable. However, NLEs generated by current models often present problems that disobey of commonsense or lack of informativeness. In this paper, we propose a knowledge enhanced explainable NLI framework (KxNLI) by leveraging Knowledge Graph (KG) to address these problems. The subgraphs from KG are constructed based on the concept set of the input sequence. Contextual embedding of input and the graph embedding of subgraphs, is used to guide the NLE generation by using a copy mechanism. Furthermore, the generated NLEs are used to augment the original data. Experimental results show that the performance of KxNLI can achieve state-of-the-art (SOTA) results on the SNLI dataset when the pretrained model is fine-tuned on the augmented data. Besides, the proposed mechanism of knowledge enhancement and rationales utilization can achieve ideal performance on vanilla seq2seq model, and obtain better transfer ability when transferred to the MultiNLI dataset. In order to comprehensively evaluate generated NLEs, we design two metrics from the perspectives of the accuracy and informativeness, to measure the quality of NLEs, respectively. The results show that KxNLI can provide high quality NLEs while making accurate prediction.  相似文献   

3.
石华  张素菊 《情报科学》2012,(7):1057-1060
计算机涉足语言翻译主要在两大领域:机器辅助翻译和机器翻译。机器辅助多发生在词汇或文化层级,机器翻译发生在语篇层级对翻译的意义更大。机器翻译就是电脑以软件和网络为媒介,提供译文,帮助人们消除语言障碍。但是,语言转换不仅是不同语言词汇间的转换,还是句法、语义的语篇整合。建立成功的语篇语料库是解决问题的方式之一。人工智能的发展也会促进机器翻译的提高。  相似文献   

4.
薛调 《现代情报》2017,37(10):72
以"清博指数"微信总榜中的17所高校图书馆为研究样本,利用统计方法和定性研究方法对头条文章从推送频率、标题字数、标题特征和标题内容4个方面进行了分析,构建了头条文章标题内容的主题模型。提出了树立头条意识、提高读者认知,确定合理的推送频率及推送方式,选择恰当的标题特征,甄选契合读者需求的标题内容四点增强高校图书馆微信公众号信息传播效果的建议。  相似文献   

5.
王丽峰 《科教文汇》2013,(22):117-117,121
语言是文化的载体,受文化的影响,也反映着文化。每一种语言都与某一特定的文化相对应,其语言结构、语言交际模式、篇章修辞原则等都在很大程度上受文化观念的影响甚至制约。阅读作为语言技能的重要组成部分,在英语学习中占有首要地位,是掌握语言知识、获取信息、提高语言应用能力的重要途径。  相似文献   

6.
第二外语教学是大学外语教学活动中非常重要的一部分,但目前我国各大学的第二外语教学活动还存在很多偏差。本文以日语作为第二外语为例,将教育心理学理论引入到外语教学活动中来,着重讨论了针对第二外语教学对象的个体差别,教学者应该采取的相应对策;并从教学者自身的教学策略和教学能力等条件出发,展开说明了教学者除了需要教授课本知识和专业知识之外,还要适当地发挥积极意识的影响作用。  相似文献   

7.
8.
Estimating the similarity between two legal case documents is an important and challenging problem, having various downstream applications such as prior-case retrieval and citation recommendation. There are two broad approaches for the task — citation network-based and text-based. Prior citation network-based approaches consider citations only to prior-cases (also called precedents) (PCNet). This approach misses important signals inherent in Statutes (written laws of a jurisdiction). In this work, we propose Hier-SPCNet that augments PCNet with a heterogeneous network of Statutes. We incorporate domain knowledge for legal document similarity into Hier-SPCNet, thereby obtaining state-of-the-art results for network-based legal document similarity.Both textual and network similarity provide important signals for legal case similarity; but till now, only trivial attempts have been made to unify the two signals. In this work, we apply several methods for combining textual and network information for estimating legal case similarity. We perform extensive experiments over legal case documents from the Indian judiciary, where the gold standard similarity between document-pairs is judged by law experts from two reputed Law institutes in India. Our experiments establish that our proposed network-based methods significantly improve the correlation with domain experts’ opinion when compared to the existing methods for network-based legal document similarity. Our best-performing combination method (that combines network-based and text-based similarity) improves the correlation with domain experts’ opinion by 11.8% over the best text-based method and 20.6% over the best network-based method. We also establish that our best-performing method can be used to recommend/retrieve citable and similar cases for a source (query) case, which are well appreciated by legal experts.  相似文献   

9.
The pre-trained language models (PLMs), such as BERT, have been successfully employed in two-phases ranking pipeline for information retrieval (IR). Meanwhile, recent studies have reported that BERT model is vulnerable to imperceptible textual perturbations on quite a few natural language processing (NLP) tasks. As for IR tasks, current established BERT re-ranker is mainly trained on large-scale and relatively clean dataset, such as MS MARCO, but actually noisy text is more common in real-world scenarios, such as web search. In addition, the impact of within-document textual noises (perturbations) on retrieval effectiveness remains to be investigated, especially on the ranking quality of BERT re-ranker, considering its contextualized nature. To mitigate this gap, we carry out exploratory experiments on the MS MARCO dataset in this work to examine whether BERT re-ranker can still perform well when ranking text with noise. Unfortunately, we observe non-negligible effectiveness degradation of BERT re-ranker over a total of ten different types of synthetic within-document textual noise. Furthermore, to address the effectiveness losses over textual noise, we propose a novel noise-tolerant model, De-Ranker, which is learned by minimizing the distance between the noisy text and its original clean version. Our evaluation on the MS MARCO and TREC 2019–2020 DL datasets demonstrates that De-Ranker can deal with synthetic textual noise more effectively, with 3%–4% performance improvement over vanilla BERT re-ranker. Meanwhile, extensive zero-shot transfer experiments on a total of 18 widely-used IR datasets show that De-Ranker can not only tackle natural noise in real-world text, but also achieve 1.32% improvement on average in terms of cross-domain generalization ability on the BEIR benchmark.  相似文献   

10.
GPS-enabled devices and social media popularity have created an unprecedented opportunity for researchers to collect, explore, and analyze text data with fine-grained spatial and temporal metadata. In this sense, text, time and space are different domains with their own representation scales and methods. This poses a challenge on how to detect relevant patterns that may only arise from the combination of text with spatio-temporal elements. In particular, spatio-temporal textual data representation has relied on feature embedding techniques. This can limit a model’s expressiveness for representing certain patterns extracted from the sequence structure of textual data. To deal with the aforementioned problems, we propose an Acceptor recurrent neural network model that jointly models spatio-temporal textual data. Our goal is to focus on representing the mutual influence and relationships that can exist between written language and the time-and-place where it was produced. We represent space, time, and text as tuples, and use pairs of elements to predict a third one. This results in three predictive tasks that are trained simultaneously. We conduct experiments on two social media datasets and on a crime dataset; we use Mean Reciprocal Rank as evaluation metric. Our experiments show that our model outperforms state-of-the-art methods ranging from a 5.5% to a 24.7% improvement for location and time prediction.  相似文献   

11.
12.
The presentation of search results on the web has been dominated by the textual form of document representation. On the other hand, the document’s visual aspects such as the layout, colour scheme, or presence of images have been studied in a limited context with regard to their effectiveness of search result presentation. This article presents a comparative evaluation of textual and visual forms of document representation as additional components of document surrogates. A total of 24 people were recruited for our task-based user study. The experimental results suggest that an increased level of document representation available in the search results can facilitate users’ interaction with a search interface. The results also suggest that the two forms of additional representations are likely beneficial to users’ information searching process in different contexts.  相似文献   

13.
Existing approaches in online health question answering (HQA) communities to identify the quality of answers either address it subjectively by human assessment or mainly using textual features. This process may be time-consuming and lose the semantic information of answers. We present an automatic approach for predicting answer quality that combines sentence-level semantics with textual and non-textual features in the context of online healthcare. First, we extend the knowledge adoption model (KAM) theory to obtain the six dimensions of quality measures for textual and non-textual features. Then we apply the Bidirectional Encoder Representations from Transformers (BERT) model for extracting semantic features. Next, the multi-dimensional features are processed for dimensionality reduction using linear discriminant analysis (LDA). Finally, we incorporate the preprocessed features into the proposed BK-XGBoost method to automatically predict the answer quality. The proposed method is validated on a real-world dataset with 48121 question-answer pairs crawled from the most popular online HQA communities in China. The experimental results indicate that our method competes against the baseline models on various evaluation metrics. We found up to 2.9% and 5.7% improvement in AUC value in comparison with BERT and XGBoost models respectively.  相似文献   

14.
Automated keyphrase extraction is a fundamental textual information processing task concerned with the selection of representative phrases from a document that summarize its content. This work presents a novel unsupervised method for keyphrase extraction, whose main innovation is the use of local word embeddings (in particular GloVe vectors), i.e., embeddings trained from the single document under consideration. We argue that such local representation of words and keyphrases are able to accurately capture their semantics in the context of the document they are part of, and therefore can help in improving keyphrase extraction quality. Empirical results offer evidence that indeed local representations lead to better keyphrase extraction results compared to both embeddings trained on very large third corpora or larger corpora consisting of several documents of the same scientific field and to other state-of-the-art unsupervised keyphrase extraction methods.  相似文献   

15.
近5年LISA数据库中数字参考咨询服务研究论文的定量分析   总被引:1,自引:0,他引:1  
本文以《剑桥科学文摘》(CSA)中的LISA数据库为文献来源,运用文献计量学方法并结合文献内容,从文献量、语种、著者、期刊和主题等方面对2000-2004年数字参考咨询领域研究论文进行了统计分析,以期能对本领域今后的研究提供一些参考与启示。  相似文献   

16.
Automatic text summarization attempts to provide an effective solution to today’s unprecedented growth of textual data. This paper proposes an innovative graph-based text summarization framework for generic single and multi document summarization. The summarizer benefits from two well-established text semantic representation techniques; Semantic Role Labelling (SRL) and Explicit Semantic Analysis (ESA) as well as the constantly evolving collective human knowledge in Wikipedia. The SRL is used to achieve sentence semantic parsing whose word tokens are represented as a vector of weighted Wikipedia concepts using ESA method. The essence of the developed framework is to construct a unique concept graph representation underpinned by semantic role-based multi-node (under sentence level) vertices for summarization. We have empirically evaluated the summarization system using the standard publicly available dataset from Document Understanding Conference 2002 (DUC 2002). Experimental results indicate that the proposed summarizer outperforms all state-of-the-art related comparators in the single document summarization based on the ROUGE-1 and ROUGE-2 measures, while also ranking second in the ROUGE-1 and ROUGE-SU4 scores for the multi-document summarization. On the other hand, the testing also demonstrates the scalability of the system, i.e., varying the evaluation data size is shown to have little impact on the summarizer performance, particularly for the single document summarization task. In a nutshell, the findings demonstrate the power of the role-based and vectorial semantic representation when combined with the crowd-sourced knowledge base in Wikipedia.  相似文献   

17.
18.
The identification of knowledge graph entity mentions in textual content has already attracted much attention. The major assumption of existing work is that entities are explicitly mentioned in text and would only need to be disambiguated and linked. However, this assumption does not necessarily hold for social content where a significant portion of information is implied. The focus of our work in this paper is to identify whether textual social content include implicit mentions of knowledge graph entities or not, hence forming a two-class classification problem. To this end, we adopt the systemic functional linguistic framework that allows for capturing meaning expressed through language. Based on this theoretical framework we systematically introduce two classes of features, namely syntagmatic and paradigmatic features, for implicit entity recognition. In our experiments, we show the utility of these features for the task, report on ablation studies, measure the impact of each feature subset on each other and also provide a detailed error analysis of our technique.  相似文献   

19.
Warning: This paper contains examples of offensive language, including insulting or objectifying expressions.Various existing studies have analyzed what social biases are inherited by NLP models. These biases may directly or indirectly harm people, therefore previous studies have focused only on human attributes. However, until recently no research on social biases in NLP regarding nonhumans existed. In this paper,1 we analyze biases to nonhuman animals, i.e. speciesist bias, inherent in English Masked Language Models such as BERT. We analyzed speciesist bias against 46 animal names using template-based and corpus-extracted sentences containing speciesist (or non-speciesist) language. We found that pre-trained masked language models tend to associate harmful words with nonhuman animals and have a bias toward using speciesist language for some nonhuman animal names. Our code for reproducing the experiments will be made available on GitHub.2  相似文献   

20.
文章从知识单元的角度,提出了一个基于专利文件知识结构的知识单元挖掘方法,并结合最大字符串匹配算法、停用词去除、词性标注预处理等自然语言处理手段,以及知识单元的位置权重,用程序实现了基于知识单元的中文专利知识挖掘。通过对比试验表明,这是一种有效分析专利文件技术细节的方法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号