首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 640 毫秒
1.
A large number of extractive summarization techniques have been developed in the past decade, but very few enquiries have been made as to how these differ from each other or what are the factors that actually affect these systems. Such meaningful comparison if available can be used to create a robust ensemble of these approaches, which has the possibility to consistently outperform each individual summarization system. In this work we examine the roles of three principle components of an extractive summarization technique: sentence ranking algorithm, sentence similarity metric and text representation scheme. We show that using a combination of several different sentence similarity measures, rather than only one, significantly improves performance of the resultant meta-system. Even simple ensemble techniques, when used in an informed manner, prove to be very effective in improving the overall performance and consistency of summarization systems. A statistically significant improvement of about 5% to 10% in ROUGE-1 recall was achieved by aggregating various sentence similarity measures. As opposed to this aggregation of several ranking algorithms did not show a significant improvement in ROUGE score, but even in this case the resultant meta-systems were more robust than candidate systems. The results suggest that new extractive summarization techniques should particularly focus on defining a better sentence similarity metric and use multiple sentence similarity scores and ranking algorithms in favour of a particular combination.  相似文献   

2.
Imbalanced sample distribution is usually the main reason for the performance degradation of machine learning algorithms. Based on this, this study proposes a hybrid framework (RGAN-EL) combining generative adversarial networks and ensemble learning method to improve the classification performance of imbalanced data. Firstly, we propose a training sample selection strategy based on roulette wheel selection method to make GAN pay more attention to the class overlapping area when fitting the sample distribution. Secondly, we design two kinds of generator training loss, and propose a noise sample filtering method to improve the quality of generated samples. Then, minority class samples are oversampled using the improved RGAN to obtain a balanced training sample set. Finally, combined with the ensemble learning strategy, the final training and prediction are carried out. We conducted experiments on 41 real imbalanced data sets using two evaluation indexes: F1-score and AUC. Specifically, we compare RGAN-EL with six typical ensemble learning; RGAN is compared with three typical GAN models. The experimental results show that RGAN-EL is significantly better than the other six ensemble learning methods, and RGAN is greatly improved compared with three classical GAN models.  相似文献   

3.
Students use general web search engines as their primary source of research while trying to find answers to school-related questions. Although search engines are highly relevant for the general population, they may return results that are out of educational context. Another rising trend; social community question answering websites are the second choice for students who try to get answers from other peers online. We attempt discovering possible improvements in educational search by leveraging both of these information sources. For this purpose, we first implement a classifier for educational questions. This classifier is built by an ensemble method that employs several regular learning algorithms and retrieval based approaches that utilize external resources. We also build a query expander to facilitate classification. We further improve the classification using search engine results and obtain 83.5% accuracy. Although our work is entirely based on the Turkish language, the features could easily be mapped to other languages as well. In order to find out whether search engine ranking can be improved in the education domain using the classification model, we collect and label a set of query results retrieved from a general web search engine. We propose five ad-hoc methods to improve search ranking based on the idea that the query-document category relation is an indicator of relevance. We evaluate these methods for overall performance, varying query length and based on factoid and non-factoid queries. We show that some of the methods significantly improve the rankings in the education domain.  相似文献   

4.
Automatic text summarization has been an active field of research for many years. Several approaches have been proposed, ranging from simple position and word-frequency methods, to learning and graph based algorithms. The advent of human-generated knowledge bases like Wikipedia offer a further possibility in text summarization – they can be used to understand the input text in terms of salient concepts from the knowledge base. In this paper, we study a novel approach that leverages Wikipedia in conjunction with graph-based ranking. Our approach is to first construct a bipartite sentence–concept graph, and then rank the input sentences using iterative updates on this graph. We consider several models for the bipartite graph, and derive convergence properties under each model. Then, we take up personalized and query-focused summarization, where the sentence ranks additionally depend on user interests and queries, respectively. Finally, we present a Wikipedia-based multi-document summarization algorithm. An important feature of the proposed algorithms is that they enable real-time incremental summarization – users can first view an initial summary, and then request additional content if interested. We evaluate the performance of our proposed summarizer using the ROUGE metric, and the results show that leveraging Wikipedia can significantly improve summary quality. We also present results from a user study, which suggests that using incremental summarization can help in better understanding news articles.  相似文献   

5.
针对图书、期刊论文等数字文献文本特征较少而导致特征向量语义表达不够准确、分类效果差的问题,本文提出一种基于特征语义扩展的数字文献分类方法。该方法首先利用TF-IDF方法获取对数字文献文本表示能力较强、具有较高TF-IDF值的核心特征词;其次分别借助知网(Hownet)语义词典以及开放知识库维基百科(Wikipedia)对核心特征词集进行语义概念的扩展,以构建维度较低、语义丰富的概念向量空间;最后采用MaxEnt、SVM等多种算法构造分类器实现对数字文献的自动分类。实验结果表明:相比传统基于特征选择的短文本分类方法,该方法能有效地实现对短文本特征的语义扩展,提高数字文献分类的分类性能。  相似文献   

6.
Weighted consensus multi-document summarization   总被引:1,自引:0,他引:1  
Multi-document summarization is a fundamental tool for document understanding and has received much attention recently. Given a collection of documents, a variety of summarization methods based on different strategies have been proposed to extract the most important sentences from the original documents. However, very few studies have been reported on aggregating different summarization methods to possibly generate better summary results. In this paper, we propose a weighted consensus summarization method to combine the results from single summarization systems. We evaluate and compare our proposed weighted consensus method with various baseline combination methods. Experimental results on DUC2002 and DUC2004 data sets demonstrate the performance improvement by aggregating multiple summarization systems, and our proposed weighted consensus summarization method outperforms other combination methods.  相似文献   

7.
In this paper, we define and present a comprehensive classification of user intent for Web searching. The classification consists of three hierarchical levels of informational, navigational, and transactional intent. After deriving attributes of each, we then developed a software application that automatically classified queries using a Web search engine log of over a million and a half queries submitted by several hundred thousand users. Our findings show that more than 80% of Web queries are informational in nature, with about 10% each being navigational and transactional. In order to validate the accuracy of our algorithm, we manually coded 400 queries and compared the results from this manual classification to the results determined by the automated method. This comparison showed that the automatic classification has an accuracy of 74%. Of the remaining 25% of the queries, the user intent is vague or multi-faceted, pointing to the need for probabilistic classification. We discuss how search engines can use knowledge of user intent to provide more targeted and relevant results in Web searching.  相似文献   

8.
Opinion summarization can facilitate user’s decision-making by mining the salient review information. However, due to the lack of sufficient annotated data, most of the early works are based on extractive methods, which restricts the performance of opinion summarization. In this work, we aim to improve the informativeness of opinion summarization to provide better guidance to users. We consider the setting with only reviews without corresponding summaries, and propose an aspect-augmented model for unsupervised abstractive opinion summarization, denoted as AsU-OSum. We first employ an aspect-based sentiment analysis system to extract opinion phrases from reviews. Then, we construct a heterogeneous graph consisting of reviews and opinion clusters as nodes, which is used to enhance the Transformer-based encoder–decoder framework. Furthermore, we design a novel cascaded attention mechanism to prompt the decoder to pay more attention to the aspects that are more likely to appear in summary. During training, we introduce a sentiment accuracy reward that further enhances the learning ability of our model. We conduct comprehensive experiments on the Yelp, Amazon, and Rotten Tomatoes datasets. Automatic evaluation results show that our model is competitive and performs better than the state-of-the-art (SOTA) models on some ROUGE metrics. Human evaluation results further verify that our model can generate more informative summaries and reduce redundancy.  相似文献   

9.
This paper examines several different approaches to exploiting structural information in semi-structured document categorization. The methods under consideration are designed for categorization of documents consisting of a collection of fields, or arbitrary tree-structured documents that can be adequately modeled with such a flat structure. The approaches range from trivial modifications of text modeling to more elaborate schemes, specifically tailored to structured documents. We combine these methods with three different text classification algorithms and evaluate their performance on four standard datasets containing different types of semi-structured documents. The best results were obtained with stacking, an approach in which predictions based on different structural components are combined by a meta classifier. A further improvement of this method is achieved by including the flat text model in the final prediction.  相似文献   

10.
Textual entailment is a task for which the application of supervised learning mechanisms has received considerable attention as driven by successive Recognizing Data Entailment data challenges. We developed a linguistic analysis framework in which a number of similarity/dissimilarity features are extracted for each entailment pair in a data set and various classifier methods are evaluated based on the instance data derived from the extracted features. The focus of the paper is to compare and contrast the performance of single and ensemble based learning algorithms for a number of data sets. We showed that there is some benefit to the use of ensemble approaches but, based on the extracted features, Naïve Bayes proved to be the strongest learning mechanism. Only one ensemble approach demonstrated a slight improvement over the technique of Naïve Bayes.  相似文献   

11.
The purpose of extractive speech summarization is to automatically select a number of indicative sentences or paragraphs (or audio segments) from the original spoken document according to a target summarization ratio and then concatenate them to form a concise summary. Much work on extractive summarization has been initiated for developing machine-learning approaches that usually cast important sentence selection as a two-class classification problem and have been applied with some success to a number of speech summarization tasks. However, the imbalanced-data problem sometimes results in a trained speech summarizer with unsatisfactory performance. Furthermore, training the summarizer by improving the associated classification accuracy does not always lead to better summarization evaluation performance. In view of such phenomena, we present in this paper an empirical investigation of the merits of two schools of training criteria to alleviate the negative effects caused by the aforementioned problems, as well as to boost the summarization performance. One is to learn the classification capability of a summarizer on the basis of the pair-wise ordering information of sentences in a training document according to a degree of importance. The other is to train the summarizer by directly maximizing the associated evaluation score or optimizing an objective that is linked to the ultimate evaluation. Experimental results on the broadcast news summarization task suggest that these training criteria can give substantial improvements over a few existing summarization methods.  相似文献   

12.
Financial decisions are often based on classification models which are used to assign a set of observations into predefined groups. Different data classification models were developed to foresee the financial crisis of an organization using their historical data. One important step towards the development of accurate financial crisis prediction (FCP) model involves the selection of appropriate variables (features) which are relevant for the problems at hand. This is termed as feature selection problem which helps to improve the classification performance. This paper proposes an Ant Colony Optimization (ACO) based financial crisis prediction (FCP) model which incorporates two phases: ACO based feature selection (ACO-FS) algorithm and ACO based data classification (ACO-DC) algorithm. The proposed ACO-FCP model is validated using a set of five benchmark dataset includes both qualitative and quantitative. For feature selection design, the developed ACO-FS method is compared with three existing feature selection algorithms namely genetic algorithm (GA), Particle Swarm Optimization (PSO) algorithm and Grey Wolf Optimization (GWO) algorithm. In addition, a comparison of classification results is also made between ACO-DC and state of art methods. Experimental analysis shows that the ACO-FCP ensemble model is superior and more robust than its counterparts. In consequence, this study strongly recommends that the proposed ACO-FCP model is highly competitive than traditional and other artificial intelligence techniques.  相似文献   

13.
提出基于集成学习的项目绩效预测方法,利用多分类集成监督学习算法,对网络爬虫得到的已结题项目数据中隐含的关于项目绩效的信息进行有效挖掘,形成项目绩效预测模型.基于国家自然科学基金项目数据,利用多种指标对模型的性能进行评估,将模型对项目的绩效预测结果与专家的评估结果进行比较,结果显示模型的有效性.  相似文献   

14.
Performance of text classification models tends to drop over time due to changes in data, which limits the lifetime of a pretrained model. Therefore an ability to predict a model’s ability to persist over time can help design models that can be effectively used over a longer period of time. In this paper, we provide a thorough discussion into the problem, establish an evaluation setup for the task. We look at this problem from a practical perspective by assessing the ability of a wide range of language models and classification algorithms to persist over time, as well as how dataset characteristics can help predict the temporal stability of different models. We perform longitudinal classification experiments on three datasets spanning between 6 and 19 years, and involving diverse tasks and types of data. By splitting the longitudinal datasets into years, we perform a comprehensive set of experiments by training and testing across data that are different numbers of years apart from each other, both in the past and in the future. This enables a gradual investigation into the impact of the temporal gap between training and test sets on the classification performance, as well as measuring the extent of the persistence over time. Through experimenting with a range of language models and algorithms, we observe a consistent trend of performance drop over time, which however differs significantly across datasets; indeed, datasets whose domain is more closed and language is more stable, such as with book reviews, exhibit a less pronounced performance drop than open-domain social media datasets where language varies significantly more. We find that one can estimate how a model will retain its performance over time based on (i) how well the model performs over a restricted time period and its extrapolation to a longer time period, and (ii) the linguistic characteristics of the dataset, such as the familiarity score between subsets from different years. Findings from these experiments have important implications for the design of text classification models with the aim of preserving performance over time.  相似文献   

15.
Recently, the Transformer model architecture and the pre-trained Transformer-based language models have shown impressive performance when used in solving both natural language understanding and text generation tasks. Nevertheless, there is little research done on using these models for text generation in Arabic. This research aims at leveraging and comparing the performance of different model architectures, including RNN-based and Transformer-based ones, and different pre-trained language models, including mBERT, AraBERT, AraGPT2, and AraT5 for Arabic abstractive summarization. We first built an Arabic summarization dataset of 84,764 high-quality text-summary pairs. To use mBERT and AraBERT in the context of text summarization, we employed a BERT2BERT-based encoder-decoder model where we initialized both the encoder and decoder with the respective model weights. The proposed models have been tested using ROUGE metrics and manual human evaluation. We also compared their performance on out-of-domain data. Our pre-trained Transformer-based models give a large improvement in performance with ~79% less data. We found that AraT5 scores ~3 ROUGE higher than a BERT2BERT-based model that is initialized with AraBERT, indicating that an encoder-decoder pre-trained Transformer is more suitable for summarizing Arabic text. Also, both of these two models perform better than AraGPT2 by a clear margin, which we found to produce summaries with high readability but with relatively lesser quality. On the other hand, we found that both AraT5 and AraGPT2 are better at summarizing out-of-domain text. We released our models and dataset publicly1,.2  相似文献   

16.
This paper presents a binary classification of entrepreneurs in British historical data based on the recent availability of big data from the I-CeM dataset. The main task of the paper is to attribute an employment status to individuals that did not fully report entrepreneur status in earlier censuses (1851–1881). The paper assesses the accuracy of different classifiers and machine learning algorithms, including Deep Learning, for this classification problem. We first adopt a ground-truth dataset from the later censuses to train the computer with a Logistic Regression (which is standard in the literature for this kind of binary classification) to recognize entrepreneurs distinct from non-entrepreneurs (i.e. workers). Our initial accuracy for this base-line method is 0.74. We compare the Logistic Regression with ten optimized machine learning algorithms: Nearest Neighbors, Linear and Radial Support Vector Machine, Gaussian Process, Decision Tree, Random Forest, Neural Network, AdaBoost, Naive Bayes, and Quadratic Discriminant Analysis. The best results are boosting and ensemble methods. AdaBoost achieves an accuracy of 0.95. Deep-Learning, as a standalone category of algorithms, further improves accuracy to 0.96 without using the rich text-data that characterizes the OccString feature, a string of up to 500 characters with the full occupational statement of each individual collected in the earlier censuses. Finally, and now using this OccString feature, we implement both shallow (bag-of-words algorithm) learning and Deep Learning (Recurrent Neural Network with a Long Short-Term Memory layer) algorithms. These methods all achieve accuracies above 0.99 with Deep Learning Recurrent Neural Network as the best model with an accuracy of 0.9978. The results show that standard algorithms for classification can be outperformed by machine learning algorithms. This confirms the value of extending the techniques traditionally used in the literature for this type of classification problem.  相似文献   

17.
建立在小波分析基础上的综合脉冲星时算法,能把脉冲星的观测计时残差在小波域分解,提取出不同频率范围的分量,然后用小波方差表征脉冲星在不同频率范围的稳定度来对单脉冲星时进行加权平均,得到综合脉冲星时;脉冲星的计时残差包括了计时参考的原子钟的误差和与脉冲星本身有关的计时误差两部分,用维纳滤波的方法可以将两者进行一定区分,并消除掉估计的参考钟误差,将剩余部分作为计时残差实现对脉冲星计时的综合。实验证明,小波分析和维纳滤波方法比经典的加权算法更好,得到的综合脉冲星时的长期稳定度有了较大提高。  相似文献   

18.
This paper presents a classifier for text data samples consisting of main text and additional components, such as Web pages and technical papers. We focus on multiclass and single-labeled text classification problems and design the classifier based on a hybrid composed of probabilistic generative and discriminative approaches. Our formulation considers individual component generative models and constructs the classifier by combining these trained models based on the maximum entropy principle. We use naive Bayes models as the component generative models for the main text and additional components such as titles, links, and authors, so that we can apply our formulation to document and Web page classification problems. Our experimental results for four test collections confirmed that our hybrid approach effectively combined main text and additional components and thus improved classification performance.  相似文献   

19.
20.
作为一种工业产权的外观专利在产业振兴中起着重要作用。外观专利采用的洛迦诺分类标准不支持图像分类,而AI算法直接在洛迦诺分类标准图像数据集并不能有效提升分类精度。因此,对外观专利审查员来说,外观专利的分类检索具有相当大的挑战。为此,本文提出先领域、再功能、后视觉的四级外观专利图像分类新标准,在此分类标准基础上构建了Patent-MNIST外观专利标准测试集。利用典型的AI算法来测试验证,结果表明,VGG19等经典算法在Patent-MNIST数据集中的分类准确度比直接使用洛迦诺分类标准数据集有显著的性能提升。同时,在本文的分类标准数据集的基础上,还需要进一步地针对外观专利来改进算法。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号