首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Automatic document classification can be used to organize documents in a digital library, construct on-line directories, improve the precision of web searching, or help the interactions between user and search engines. In this paper we explore how linkage information inherent to different document collections can be used to enhance the effectiveness of classification algorithms. We have experimented with three link-based bibliometric measures, co-citation, bibliographic coupling and Amsler, on three different document collections: a digital library of computer science papers, a web directory and an on-line encyclopedia. Results show that both hyperlink and citation information can be used to learn reliable and effective classifiers based on a kNN classifier. In one of the test collections used, we obtained improvements of up to 69.8% of macro-averaged F 1 over the traditional text-based kNN classifier, considered as the baseline measure in our experiments. We also present alternative ways of combining bibliometric based classifiers with text based classifiers. Finally, we conducted studies to analyze the situation in which the bibliometric-based classifiers failed and show that in such cases it is hard to reach consensus regarding the correct classes, even for human judges.  相似文献   

2.
In this paper, we evaluate a number of machine learning techniques for the task of ranking answers to why-questions. We use TF-IDF together with a set of 36 linguistically motivated features that characterize questions and answers. We experiment with a number of machine learning techniques (among which several classifiers and regression techniques, Ranking SVM and SVM map ) in various settings. The purpose of the experiments is to assess how the different machine learning approaches can cope with our highly imbalanced binary relevance data, with and without hyperparameter tuning. We find that with all machine learning techniques, we can obtain an MRR score that is significantly above the TF-IDF baseline of 0.25 and not significantly lower than the best score of 0.35. We provide an in-depth analysis of the effect of data imbalance and hyperparameter tuning, and we relate our findings to previous research on learning to rank for Information Retrieval.  相似文献   

3.
运用图示法自动提取中文专利文本的语义信息   总被引:1,自引:0,他引:1  
姜春涛 《图书情报工作》2015,59(21):115-122
[目的/意义]提出利用图结构的表示法自动挖掘中文专利文本的语义信息,以为基于文本内容的专利智能分析提供语义支持。[方法/过程] 设计两种运用图结构的模型:①基于关键词的文本图模型;②基于依存关系树的文本图模型。第一种图模型通过计算关键词之间的相似性关系来定义;第二种图模型则由句中所提取的语法关系来定义。在案例研究中,借助频繁子图挖掘算法,对所建图模型进行子图挖掘, 并构建以子图为特征的文本分类器,用来检测所建图模型的表达性和有效性。[结果/结论]将所建的基于图模型的文本分类器应用于4个不同技术领域的专利文本数据集,并与经典文本分类器的测试结果相比较而知:前者在使用明显较少的特征数的基础上,分类性能较后者提升2.1%-10.5%。由此而推断,使用图结构的表达法并结合图挖掘技术从专利文本中所提取的语义信息是有效的,有助于进一步的专利文本分析。  相似文献   

4.
In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form “for document d i , category c′ is preferred to category c′′”; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.  相似文献   

5.
This paper presents a method for comparing the subject headings of Scopus and WoS classifiers that has been tested based on examples in the field of mathematical disciplines. Semantic relationships of subject headings are explored by using intelligent analysis of keyword and expression clustering. The results are presented in the form of a correspondence table for the subject headings of the classifiers.  相似文献   

6.
Efficient algorithms for ranking with SVMs   总被引:1,自引:0,他引:1  
RankSVM (Herbrich et al. in Advances in large margin classifiers. MIT Press, Cambridge, MA, 2000; Joachims in Proceedings of the ACM conference on knowledge discovery and data mining (KDD), 2002) is a pairwise method for designing ranking models. SVMLight is the only publicly available software for RankSVM. It is slow and, due to incomplete training with it, previous evaluations show RankSVM to have inferior ranking performance. We propose new methods based on primal Newton method to speed up RankSVM training and show that they are 5 orders of magnitude faster than SVMLight. Evaluation on the Letor benchmark datasets after complete training using such methods shows that the performance of RankSVM is excellent.  相似文献   

7.
In attempting to move questionnaire design from art to science,researchers use different evaluation techniques to help determinehow well questions are working. Techniques such as behaviorcoding, respondent debriefing, interviewer debriefing, cognitiveinterviewing, and nonresponse analysis all provide informationto help the questionnaire designer assess whether respondentsunderstand questions as intended and whether they are able toprovide adequate answers to them. However, these techniquesdo not actually measure question reliability. It is assumedthat questions that pass the screen of the questionnaire evaluationtechniques described above are also more likely to produce datathat are reliable and valid. In this paper, we use behaviorcoding data to predict test–retest reliability. Respondentbehavior codes significantly predict such reliability whereasinterviewer codes—at least in this survey—do not.We also report the results of sensitivity testing to determinewhat percentage of adequate respondent answers best predictstest—retest reliability.  相似文献   

8.
ABSTRACT

Moral Foundations Theory (MFT) and the Model of Intuitive Morality and Exemplars (MIME) contend that moral judgments are built on a universal set of basic moral intuitions. A large body of research has supported many of MFT’s and the MIME’s central hypotheses. Yet, an important prerequisite of this research—the ability to extract latent moral content represented in media stimuli with a reliable procedure—has not been systematically studied. In this article, we subject different extraction procedures to rigorous tests, underscore challenges by identifying a range of reliabilities, develop new reliability test and coding procedures employing computational methods, and provide solutions that maximize the reliability and validity of moral intuition extraction. In six content analytical studies, including a large crowd-based study, we demonstrate that: (1) traditional content analytical approaches lead to rather low reliabilities; (2) variation in coding reliabilities can be predicted by both text features and characteristics of the human coders; and (3) reliability is largely unaffected by the detail of coder training. We show that a coding task with simplified training and a coding technique that treats moral foundations as fast, spontaneous intuitions leads to acceptable inter-rater agreement, and potentially to more valid moral intuition extractions. While this study was motivated by issues related to MFT and MIME research, the methods and findings in this study have implications for extracting latent content from text narratives that go beyond moral information. Accordingly, we provide a tool for researchers interested in applying this new approach in their own work.  相似文献   

9.
Duplicate content on the Web occurs within the same website or across multiple websites. The latter is mainly associated with the existence of website replicas—sites that are perceptibly similar. Replication may be accidental, intentional or malicious, but no matter the reason, search engines suffer greatly either from unnecessarily storing and moving duplicate data, or from providing search results that do not offer real value to the users. In this paper, we model the detection of website replicas as a pairwise classification problem with distant supervision. That is, (heuristically) finding obvious replica and non-replica cases is trivial, but learning effective classifiers requires a representative set of non-obvious labeled examples, which are hard to obtain. We employ efficient Expectation-Maximization (EM) algorithms in order to find non-obvious examples from obvious ones, enlarging the training-set and improving the classifiers iteratively. Our classifiers employ association rules, being thus incrementally updated as the EM process iterates, making our algorithms time-efficient. Experiments show that: (1) replicas are fully eliminated at a false-positive rate lower than 0.005, incurring in + 19% reduction in the number of duplicate URLs, (2) reduction increases to + 21% by using our site-level algorithms in conjunction with existing URL-level algorithms, and (3) our classifiers are more than two orders of magnitude faster than semi-supervised alternative solutions.  相似文献   

10.
基于粗糙集加权的文本分类方法研究   总被引:6,自引:0,他引:6  
文本自动分类是当前智能信息处理中一类重要的研究课题。本文分析了基于统计理论的文本分类的基本特点,提出采用可变精度粗糙集模型中的分类质量构造新的特征词权重计算公式。这种新的加权方法,相对于广泛使用的逆文本频率加权方法,大大改进了文本样本在整个空间中的分布,使得类内距离减少,类间距离增大,在理论上将提高样本的可分性。最后利用支持向量机和K近邻两种分类器,验证了这种新的加权方法对分类效果确实有所提高。  相似文献   

11.
[目的/意义]在大数据时代,基于客观数据构建行之有效的社交网络舆情生态评价方法对网络生态治理和健康发展具有重要的意义。[方法/过程]本文以信息生态理论为基础,采用机器学习、敏感判断、关键词抽取等自然语言处理技术构建了社交网络舆情生态性评价算法。在数据处理过程中,采用基于Adaboost的集成学习方法,利用差异方法、特征集合构造分类器之间的互补效应,通过有效聚合多个基于统计和基于规则的情绪分析器,构建出情感分析模型,为评价指标体系提供支撑。实践层面,本文选出东北、沿海以及西部几个代表性区域运用所构建的评价算法对区域生态性进行评价和分析。[结果/结论]该评价方法的构建为政府、网站、网民携手净化社交网络空间具有重要的指导意义,并为社交网络舆情主题图谱的构建及调控策略的研究提供了重要的理论和实践基础。  相似文献   

12.
This paper describes a method for constructing a network of classifiers that forms a multidimensional representation of the ontology of scientific and technical information, as well as illustrating a method for investigating the effectiveness of automated determination of semantic relationships between classification headings.  相似文献   

13.
We present a novel approach to re-ranking a document list that was retrieved in response to a query so as to improve precision at the very top ranks. The approach is based on utilizing a second list that was retrieved in response to the query by using, for example, a different retrieval method and/or query representation. In contrast to commonly-used methods for fusion of retrieved lists that rely solely on retrieval scores (ranks) of documents, our approach also exploits inter-document-similarities between the lists—a potentially rich source of additional information. Empirical evaluation shows that our methods are effective in re-ranking TREC runs; the resultant performance also favorably compares with that of a highly effective fusion method. Furthermore, we show that our methods can potentially help to tackle a long-standing challenge, namely, integration of document-based and cluster-based retrieved results.  相似文献   

14.
15.
The retrieval of sentences that are relevant to a given information need is a challenging passage retrieval task. In this context, the well-known vocabulary mismatch problem arises severely because of the fine granularity of the task. Short queries, which are usually the rule rather than the exception, aggravate the problem. Consequently, effective sentence retrieval methods tend to apply some form of query expansion, usually based on pseudo-relevance feedback. Nevertheless, there are no extensive studies comparing different statistical expansion strategies for sentence retrieval. In this work we study thoroughly the effect of distinct statistical expansion methods on sentence retrieval. We start from a set of retrieved documents in which relevant sentences have to be found. In our experiments different term selection strategies are evaluated and we provide empirical evidence to show that expansion before sentence retrieval yields competitive performance. This is particularly novel because expansion for sentence retrieval is often done after sentence retrieval (i.e. expansion terms are mined from a ranked set of sentences) and there are no comparative results available between both types of expansion. Furthermore, this comparison is particularly valuable because there are important implications in time efficiency. We also carefully analyze expansion on weak and strong queries and demonstrate clearly that expanding queries before sentence retrieval is not only more convenient for efficiency purposes, but also more effective when handling poor queries.  相似文献   

16.
This paper presents data regarding the publication of Chinese English‐language journals (CELAJs), building on previously published information to investigate the status, growth, and international penetration of these journals. The article also presents three case studies of CELJs to demonstrate different strategies for achieving internationalization. We find that there has been rapid growth in CELJs between 2006 and 2011 but mostly in the science, technology and medicine disciplines. There are now 435 CELJs, of which 62.3% are published in association with a western publisher. Partnership has been shown to provide immediate benefits to an established successful journal (Cell Research), has helped to relaunch an established title in English (Bamboo and Silk), and has enabled the successful launch of a new journal (Global Health Research and Policy). The authors conclude that there are three criteria for successful international CELJs: increased visibility, good editorial boards, and international publishing partnerships.  相似文献   

17.
An Evaluation of Statistical Approaches to Text Categorization   总被引:122,自引:4,他引:118  
This paper focuses on a comparative evaluation of a wide-range of text categorization methods, including previously published results on the Reuters corpus and new results of additional experiments. A controlled study using three classifiers, kNN, LLSF and WORD, was conducted to examine the impact of configuration variations in five versions of Reuters on the observed performance of classifiers. Analysis and empirical evidence suggest that the evaluation results on some versions of Reuters were significantly affected by the inclusion of a large portion of unlabelled documents, mading those results difficult to interpret and leading to considerable confusions in the literature. Using the results evaluated on the other versions of Reuters which exclude the unlabelled documents, the performance of twelve methods are compared directly or indirectly. For indirect compararions, kNN, LLSF and WORD were used as baselines, since they were evaluated on all versions of Reuters that exclude the unlabelled documents. As a global observation, kNN, LLSF and a neural network method had the best performance; except for a Naive Bayes approach, the other learning algorithms also performed relatively well.  相似文献   

18.
乔建忠 《图书情报工作》2013,57(14):114-120
针对主题爬行技术中的单一分类算法在面对多主题Web抓取和分类需求时泛化能力不强的局限,设计一种利用多种强分类算法形成的分类器组合,主题爬行器根据当前主题任务在线评估并为分类器排名,从中选择最优分类器分类的策略,并开展在多个主题抓取任务下的分类实验,比较每种分类算法的准确率和组合后的平均分类准确率以及对分类效率等评价指标的综合分析,结果证明该策略对领域局域性有所克服,普适性较强。  相似文献   

19.
Although a growing body of literature points to the particular media diet of populist voters, we know too little about what specific media preferences characterize citizens with populist attitudes. This article investigates to what extent citizens with antiestablishment and exclusionist populist attitudes are attracted to attitudinal-congruent media content. We collected survey data using a nationally representative sample (N = 809) and found that citizens’ preferences for media content are in sync with their populist attitudes. Beyond having a tabloidized and entertainment-based media diet, populist voters self-select media content that actively articulates the divide between the “innocent” people and “culprit” others. These findings provide new insights into the appeal of different types of media populism among citizens with populist attitudes on different dimensions.  相似文献   

20.
We augment naive Bayes models with statistical n-gram language models to address short-comings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier which allows for a local Markov dependence among observations; a model we refer to as the C hain A ugmented N aive Bayes (CAN) Bayes classifier. CAN models have two advantages over standard naive Bayes classifiers. First, they relax some of the independence assumptions of naive Bayes—allowing a local Markov chain dependence in the observed variables—while still permitting efficient inference and learning. Second, they permit straightforward application of sophisticated smoothing techniques from statistical language modeling, which allows one to obtain better parameter estimates than the standard Laplace smoothing used in naive Bayes classification. In this paper, we introduce CAN models and apply them to various text classification problems. To demonstrate the language independent and task independent nature of these classifiers, we present experimental results on several text classification problems—authorship attribution, text genre classification, and topic detection—in several languages—Greek, English, Japanese and Chinese. We then systematically study the key factors in the CAN model that can influence the classification performance, and analyze the strengths and weaknesses of the model.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号