首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Textual entailment is a task for which the application of supervised learning mechanisms has received considerable attention as driven by successive Recognizing Data Entailment data challenges. We developed a linguistic analysis framework in which a number of similarity/dissimilarity features are extracted for each entailment pair in a data set and various classifier methods are evaluated based on the instance data derived from the extracted features. The focus of the paper is to compare and contrast the performance of single and ensemble based learning algorithms for a number of data sets. We showed that there is some benefit to the use of ensemble approaches but, based on the extracted features, Naïve Bayes proved to be the strongest learning mechanism. Only one ensemble approach demonstrated a slight improvement over the technique of Naïve Bayes.  相似文献   

2.
Because of the rapid increase of data in the cloud of Amazon Web Service (AWS), the traditional methods for analyzing this data are not good and inappropriate, so unconventional methods of analysis have been proposed by many data scientists such as concurrent/ parallel techniques to meeting the requirements of performance and scalability entailed in such big data analyses. In this paper we are used Hadoop Map Reduce system that contains Hadoop Distributed File System (HDFS) and Hadoop cluster. We optimized it by combining it with five efficient Data Mining (DM) algorithms such as Support Vector Machine (SVM), Decision Tree (DT), Random Forest (RF), Correlative Naïve Bayes classifier (CNB), and Fuzzy CNB (FCNB) for strong analytics of cloud big data. The proposed system applied on product review data that taken form the cloud of AWS. The Evaluation of Hadoop Map Reduce done with important benchmarks as Mean Absolute Percentage Error (MPAE), Root Mean Square Error (RMSE), and runtime for word count, sort, inverted index. Also, the evaluation of DM models with Hadoop Map Reduce system done by using accuracy, sensitivity, specificity, memory, and running time. Experiments have shown that FCNB is effective in addressing the problem of big data.  相似文献   

3.
We study several machine learning algorithms for cross-language patent retrieval and classification. In comparison with most of other studies involving machine learning for cross-language information retrieval, which basically used learning techniques for monolingual sub-tasks, our learning algorithms exploit the bilingual training documents and learn a semantic representation from them. We study Japanese–English cross-language patent retrieval using Kernel Canonical Correlation Analysis (KCCA), a method of correlating linear relationships between two variables in kernel defined feature spaces. The results are quite encouraging and are significantly better than those obtained by other state of the art methods. We also investigate learning algorithms for cross-language document classification. The learning algorithm are based on KCCA and Support Vector Machines (SVM). In particular, we study two ways of combining the KCCA and SVM and found that one particular combination called SVM_2k achieved better results than other learning algorithms for either bilingual or monolingual test documents.  相似文献   

4.
Because of the big volume of marketing data, a human analyst would be unable to uncover any useful information for marketing that could aid in the process of making decision. Smart Data Mining (SDM), which is considered an important field from Artificial Intelligence (AI) is completely assisting in the performance business management analytics and marketing information. In this study, most reliable six algorithms in SDM are applied; Naïve Bayes (NB), Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), ID3, and C4.5 on actual data of marketing for bank that taken from Cloud Internet of Thing (CIoT). The objectives of this study are to build an efficient framework to increase campaign of marketing for banks by identifying main characteristics that affect a success and to test the performance of CIoT and SDM algorithms. This study is expected to enhance the scientific contributions to investigating the marketing information capacities by integrating SDM with CIoT. The performances of SDM algorithms are calculated by eight measures; accuracy, balance accuracy, precision, mean absolute error, root mean absolute error, recall, F1- Score and running time. The experimental findings show that the proposed framework is successful, with higher accuracies and good performance. Results revealed that customer service & marketing tactics are essential for a Company’ success & survival. Also, the C4.5 has accomplished better achievement than the SVM, RF, LR, NB, & ID3. At the end, CIoT Platform was evaluated by response time, request rate & processing of bank data.  相似文献   

5.
基于支持向量机的股票投资价值分类模型研究   总被引:1,自引:0,他引:1  
本文遵循价值投资理念,建立基于支持向量机的股票投资价值分类模型。首先随机抽取500支A股股票作为样本,并选取对股票投资价值影响显著的财务指标构造样本特征集,然后采用支持向量机方法建立股票投资价值分类模型,最后将其与BP神经网络和RBF神经网络相比较,结果表明支持向量机的分类效果和泛化能力最优。  相似文献   

6.
支持向量机是一种基于统计学习理论的机器学习方法,针对小样本情况表现出了优良的性能,目前被广泛应用于模式识别、函数回归、故障诊断等方面。这里主要研究支持向量机分类问题,着重讨论了以下几个方面的内容。首先介绍了支持向量机分类器算法,并将其应用于数据分类,取得了较高的准确率,所用数据来自于UCI数据集。仿真结果表明该算法具有较快的收敛速度和较高的计算精度。  相似文献   

7.
This paper presents a binary classification of entrepreneurs in British historical data based on the recent availability of big data from the I-CeM dataset. The main task of the paper is to attribute an employment status to individuals that did not fully report entrepreneur status in earlier censuses (1851–1881). The paper assesses the accuracy of different classifiers and machine learning algorithms, including Deep Learning, for this classification problem. We first adopt a ground-truth dataset from the later censuses to train the computer with a Logistic Regression (which is standard in the literature for this kind of binary classification) to recognize entrepreneurs distinct from non-entrepreneurs (i.e. workers). Our initial accuracy for this base-line method is 0.74. We compare the Logistic Regression with ten optimized machine learning algorithms: Nearest Neighbors, Linear and Radial Support Vector Machine, Gaussian Process, Decision Tree, Random Forest, Neural Network, AdaBoost, Naive Bayes, and Quadratic Discriminant Analysis. The best results are boosting and ensemble methods. AdaBoost achieves an accuracy of 0.95. Deep-Learning, as a standalone category of algorithms, further improves accuracy to 0.96 without using the rich text-data that characterizes the OccString feature, a string of up to 500 characters with the full occupational statement of each individual collected in the earlier censuses. Finally, and now using this OccString feature, we implement both shallow (bag-of-words algorithm) learning and Deep Learning (Recurrent Neural Network with a Long Short-Term Memory layer) algorithms. These methods all achieve accuracies above 0.99 with Deep Learning Recurrent Neural Network as the best model with an accuracy of 0.9978. The results show that standard algorithms for classification can be outperformed by machine learning algorithms. This confirms the value of extending the techniques traditionally used in the literature for this type of classification problem.  相似文献   

8.
9.
Cognitive impairments like memory disorder and depressive disorders lead to fatal consequences if proper attention is not given to such health hazards. Their impact is extended to the socioeconomic status of the developed and low or middle-income countries in terms of loss of talented and skilled population. Additionally, financial burden is borne by the countries in terms of additional health budget allotment. This paper presents a novel strategy for early detection of cognitive deficiency to eliminate the economic repercussions caused by memory disorder and depressive disorders. In this work, Electroencephalogram (EEG) and a word learning neuropsychological test, i.e. California Verbal Learning Task (CVLT), are conjunctively used for memory assessment. The features of EEG and scores of CVLT are modeled by applying different machine learning techniques, namely K-Nearest Neighbor (KNN), Gaussian Naive Bayes (GNB), Decision Tree (DT), Random Forest (RF), and Support Vector Machine (SVM). Comparatively, experimental results have better classification accuracy than the existing schemes that considered EEG for estimating cognitive heuristics. More specifically, SVM attains the highest accuracy score of 81.56% among all machine learning algorithms, which can assist in the early detection of cognitive impairments. The proposed strategy can be helpful in clinical diagnosis of psychological health and improving quality of life as a whole.  相似文献   

10.
基于最小二乘支持向量机的数据挖掘应用研究   总被引:6,自引:0,他引:6  
蔡冬松  靖继鹏 《情报科学》2005,23(12):1877-1880
随着数据仓库技术、联机分析技术的发展。基于数据库的数据挖掘已成为一种重要的数据处理手段。最小二乘支持向量机作为一种新的机器学习方法。具有全局收敛性和良好的泛化能力。本文将其应用于数据挖掘的分类与预测研究。通过棱函数的选择及参数优化,并结合支持向量机、多层感知器神经网络模型及判别分析方法进行比较研究,证明最小二乘支持向量机作为一种有效的数据挖掘算法具有较高精度。  相似文献   

11.
多类支持向量机在实际应用领域(比如ORC,人脸识别等)是一个非常重要的问题。广泛应用的多类SVM方法包括:一对一,一对多和DAG等。众多实验表明一对一方法通常具有较高分类准确率,但传统一对一方法测试时间较长限制了其在大数据量识别任务中的应用。提出一种改进的一对一多类支持向量机,先通过粗分类快速选出候选类别,再对候选类别按原一对一方法进行投票。实验结果显示该方法不仅提高了分类效率,而且在一定程度上提高了分类准确率。  相似文献   

12.
互联网的发展逐渐改变了人们的生活方式,电子邮件因其方便、快捷的特点已受到人们的青睐。但许多垃圾邮件同时也在网络中蔓延,占据了邮件服务器的大量存储空间,用户往往需要花费大量的时间去删除这些垃圾邮件。因此,研究邮件的自动过滤具有重要意义。邮件的自动过滤主要有基于规则和基于统计两种方式。而目前基于统计的过滤器中,常用的贝叶斯方法等是建立在经验风险最小化的基础之上,过滤器推广性能较差。支持向量机(SVM)是在统计学习理论的基础上发展而来的一种新的模式识别方法,在解决有限样本、非线性及高维模式识别问题中表现出许多特有的优势。它不仅考虑了对推广能力的要求,而且追求在有限信息的条件下得到最优结果。因此,本文将支持向量机应用于邮件过滤,实验证明过滤效果较好。  相似文献   

13.
This paper proposes a novel hierarchical learning strategy to deal with the data sparseness problem in semantic relation extraction by modeling the commonality among related classes. For each class in the hierarchy either manually predefined or automatically clustered, a discriminative function is determined in a top-down way. As the upper-level class normally has much more positive training examples than the lower-level class, the corresponding discriminative function can be determined more reliably and guide the discriminative function learning in the lower-level one more effectively, which otherwise might suffer from limited training data. In this paper, two classifier learning approaches, i.e. the simple perceptron algorithm and the state-of-the-art Support Vector Machines, are applied using the hierarchical learning strategy. Moreover, several kinds of class hierarchies either manually predefined or automatically clustered are explored and compared. Evaluation on the ACE RDC 2003 and 2004 corpora shows that the hierarchical learning strategy much improves the performance on least- and medium-frequent relations.  相似文献   

14.
针对赤潮生物提出具有较高准确率的实时自动分类方法,本文提出采用ReliefF-SBS进行特征选择,即针对赤潮生物图像原始数据集进行特征分析,并在此基础上,对原始特征集进行特征选择以去除特征集中的无关特征和冗余特征,得到最优特征子集,减少它们对分类器分类精度的影响。文中给出了实验结果和分析,同时验证了对k-Nearest Neighbor algorithm(KNN)和Support Vector Machine(SVM)分类器分类效果的影响。  相似文献   

15.
We present a Life-Long Learning from Mistakes (3LM) algorithm for document classification, which could be used in various scenarios such as spam filtering, blog classification, and web resource categorization. We extend the ideas of online clustering and batch-mode centroid-based classification to online learning with negative feedback. The 3LM is a competitive learning algorithm, which avoids over-smoothing, characteristic of the centroid-based classifiers, by using a different class representative, which we call clusterhead. The clusterheads competing for vector-space dominance are drawn toward misclassified documents, eventually bringing the model to a “balanced state” for a fixed distribution of documents. Subsequently, the clusterheads oscillate between the misclassified documents, heuristically minimizing the rate of misclassifications, an NP-complete problem. Further, the 3LM algorithm prevents over-fitting by “leashing” the clusterheads to their respective centroids. A clusterhead provably converges if its class can be separated by a hyper-plane from all other classes. Lifelong learning with fixed learning rate allows 3LM to adapt to possibly changing distribution of the data and continually learn and unlearn document classes. We report on our experiments, which demonstrate high accuracy of document classification on Reuters21578, OHSUMED, and TREC07p-spam datasets. The 3LM algorithm did not show over-fitting, while consistently outperforming centroid-based, Naïve Bayes, C4.5, AdaBoost, kNN, and SVM whose accuracy had been reported on the same three corpora.  相似文献   

16.
针对现有国家创新能力评价方法存在的问题提出科技创新投入产出时滞影响的国家创新能力非线性测评方法。科学选取评价指标,构建基于时滞影响和最小二乘支持向量机的国家创新能力测评模型,以二十国集团(G20)成员国为例进行实证研究,通过将评价结果与相关研究报告的评价结果进行分析对比,确定方法的准确性和可行性,最后对G20成员国创新能力进行评价分析。文章提出的测评方法有助于真实准确地反映国家创新发展水平,以帮助决策者制订和实施创新战略。  相似文献   

17.
基于支持向量机的供应链金融信用风险评估研究   总被引:1,自引:0,他引:1  
胡海青  张琅  张道宏  陈亮 《软科学》2011,25(5):26-30,36
研究了在供应链金融模式下的信用风险评估,提出了综合考虑核心企业资信状况及供应链关系状况的信用风险评估指标体系,运用机器学习的方法支持向量机(SVM)建立信用风险评估模型。通过与用主成分分析和Logistic回归方法建立的信用风险评估模型进行实证结果对比,证实了基于SVM的信用风险评估体系更具有效性和优越性。  相似文献   

18.
In recent years, most content-based spam filters have been implemented using Machine Learning (ML) approaches by means of token-based representations of textual contents. After introducing multiple performance enhancements, the impact has been virtually irrelevant. Recent studies have introduced synset-based content representations as a reliable way to improve classification, as well as different forms to take advantage of semantic information to address problems, such as dimensionality reduction.These preliminary solutions present some limitations and enforce simplifications that must be gradually redefined in order to obtain significant improvements in spam content filtering. This study addresses the problem of feature reduction by introducing a new semantic-based proposal (SDRS) that avoids losing knowledge (lossless). Synset-features can be semantically grouped by taking advantage of taxonomic relations (mainly hypernyms) provided by BabelNet ontological dictionary (e.g. “Viagra” and “Cialis” can be summarized into the single features “anti-impotence drug”, “drug” or “chemical substance” depending on the generalization of 1, 2 or 3 levels).In order to decide how many levels should be used to generalize each synset of a dataset, our proposal takes advantage of Multi-Objective Evolutionary Algorithms (MOEA) and particularly, of the Non-dominated Sorting Genetic Algorithm (NSGA-II). We have compared the performance achieved by a Naïve Bayes classifier, using both token-based and synset-based dataset representations, with and without executing dimensional reductions. As a result, our lossless semantic reduction strategy was able to find optimal semantic-based feature grouping strategies for the input texts, leading to a better performance of Naïve Bayes classifiers.  相似文献   

19.
Breast cancer is one of the leading causes of death among women worldwide. Accurate and early detection of breast cancer can ensure long-term surviving for the patients. However, traditional classification algorithms usually aim only to maximize the classification accuracy, failing to take into consideration the misclassification costs between different categories. Furthermore, the costs associated with missing a cancer case (false negative) are clearly much higher than those of mislabeling a benign one (false positive). To overcome this drawback and further improving the classification accuracy of the breast cancer diagnosis, in this work, a novel breast cancer intelligent diagnosis approach has been proposed, which employed information gain directed simulated annealing genetic algorithm wrapper (IGSAGAW) for feature selection, in this process, we performs the ranking of features according to IG algorithm, and extracting the top m optimal feature utilized the cost sensitive support vector machine (CSSVM) learning algorithm. Our proposed feature selection approach which can not only help to reduce the complexity of SAGASW algorithm and effectively extracting the optimal feature subset to a certain extent, but it can also obtain the maximum classification accuracy and minimum misclassification cost. The efficacy of our proposed approach is tested on Wisconsin Original Breast Cancer (WBC) and Wisconsin Diagnostic Breast Cancer (WDBC) breast cancer data sets, and the results demonstrate that our proposed hybrid algorithm outperforms other comparison methods. The main objective of this study was to apply our research in real clinical diagnostic system and thereby assist clinical physicians in making correct and effective decisions in the future. Moreover our proposed method could also be applied to other illness diagnosis.  相似文献   

20.
梁明江  庄宇 《软科学》2012,26(4):114-117
以我国制造业上市公司为样本数据,用支持向量机作为基分类器的集成学习方法来预测企业的财务危机,通过具体实验分析可知:集成学习比单个基分类器的预测准确率提高了4个百分点,且稳定性更高,有效地提高了模型的预测精度,使得模型更具有准确性和应用性。基于支持向量机的集成学习方法在构建我国制造业上市公司财务危机预警模型上是有效的,且达到一定的财务危机预警效果。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号