首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 519 毫秒
1.
With the rapid evolution of the mobile environment, the demand for natural language applications on mobile devices is increasing. This paper proposes an automatic word spacing system, the first step module of natural language processing (NLP) for many languages with their own word spacing rules, that is designed for mobile devices with limited hardware resources. The proposed system uses two stages. In the first stage, it preliminarily corrects word spacing errors by using a modified hidden Markov model based on character unigrams. In the second stage, the proposed system re-corrects the miscorrected word spaces by using lexical rules based on character bigrams or longer combinations. By using this hybrid method, the proposed system improves the robustness against unknown word patterns, reduces memory usage, and increases accuracy. To evaluate the proposed system in a realistic mobile environment, we constructed a mobile-style colloquial corpus using a simple simulation method. In experiments with a commercial mobile phone, the proposed system showed good performances (a response time of 0.20 s per sentence, a memory usage of 2.04 MB, and an accuracy of 92–95%) in the various evaluation measures.  相似文献   

2.
An automatic method for correcting spelling and typing errors from teletypewriter keyboard input is proposed. The computerized correcting process is presented as a heuristic tree search. The correct spellings are stored character-by-character in a psuedo-binary tree. The search examines a small subset of the database (selected branches of the tree) while checking for insertion, substitution, deletion and transposition errors. The correction procedure utilizes the inherent redundancy of natural language. Multiple errors can be handled if at least two correct characters appear between errors. Test results indicate that this approach has the highest error correction accuracy to date.  相似文献   

3.
OCR errors in text harm information retrieval performance. Much research has been reported on modelling and correction of Optical Character Recognition (OCR) errors. Most of the prior work employ language dependent resources or training texts in studying the nature of errors. However, not much research has been reported that focuses on improving retrieval performance from erroneous text in the absence of training data. We propose a novel approach for detecting OCR errors and improving retrieval performance from the erroneous corpus in a situation where training samples are not available to model errors. In this paper we propose a method that automatically identifies erroneous term variants in the noisy corpus, which are used for query expansion, in the absence of clean text. We employ an effective combination of contextual information and string matching techniques. Our proposed approach automatically identifies the erroneous variants of query terms and consequently leads to improvement in retrieval performance through query expansion. Our proposed approach does not use any training data or any language specific resources like thesaurus for identification of error variants. It also does not expend any knowledge about the language except that the word delimiter is blank space. We have tested our approach on erroneous Bangla (Bengali in English) and Hindi FIRE collections, and also on TREC Legal IIT CDIP and TREC 5 Confusion track English corpora. Our proposed approach has achieved statistically significant improvements over the state-of-the-art baselines on most of the datasets.  相似文献   

4.
起伏地形下日照时间计算模型的修正   总被引:1,自引:0,他引:1  
潘永地 《资源科学》2010,32(8):1493-1498
通过对日照时间计算模型的原理分析,发现该模型具有系统误差,误差产生的原因在于模型中忽略了气象站的日照百分率是基于天文可照时间计算这一事实。文中利用平原、丘陵、山区的实际资料对模型进行了试验分析,探讨了其系统误差的变化规律。针对原模型误差的产生原因,提出了去除系统误差的日照时间计算修正模型。从公式解释和实际资料两方面对修正模型进行了系统误差和非系统误差的分析,证明修正模型消除了系统误差,使总体误差大大减小,其总的误差仅由天空状况不一致及DEM分辨率导致的误差引起,误差大小与测站到计算点的距离及天气系统有关。在对实际资料计算结果的分析后,指出在目前的气象站点密度配置条件下用修正模型计算日照时间能够满足实际业务及服务的需要。  相似文献   

5.
Identification of autoregressive models with exogenous input (ARX) is a classical problem in system identification. This article considers the errors-in-variables (EIV) ARX model identification problem, where input measurements are also corrupted with noise. The recently proposed Dynamic Iterative Principal Components Analysis (DIPCA) technique solves the EIV identification problem but is only applicable to white measurement errors. We propose a novel identification algorithm based on a modified DIPCA approach for identifying the EIV-ARX model for single-input, single-output (SISO) systems where the output measurements are corrupted with coloured noise consistent with the ARX model. Most of the existing methods assume important parameters like input-output orders, delay, or noise-variances to be known. This work’s novelty lies in the joint estimation of error variances, process order, delay, and model parameters. The central idea used to obtain all these parameters in a theoretically rigorous manner is based on transforming the lagged measurements using the appropriate error covariance matrix, which is obtained using estimated error variances and model parameters. Simulation studies on two systems are presented to demonstrate the efficacy of the proposed algorithm.  相似文献   

6.
[目的]利用向量空间描述语义信息,研究基于词向量包的自动文摘方法;[方法]文摘是文献内容缩短的精确表达;而词向量包可以在同一个向量空间下表示词、短语、句子、段落和篇章,其空间距离用于反映语义相似度。提出一种基于词向量包的自动文摘方法,用词向量包的表示距离衡量句子与整篇文献的语义相似度,将与文献语义相似的句子抽取出来最终形成文摘;[结果]在DUC01数据集上,实验结果表明,该方法能够生成高质量的文摘,结果明显优于其它方法;[结论]实验证明该方法明显提升了自动文摘的性能。  相似文献   

7.
Modern OCR engines incorporate some form of error correction, typically based on dictionaries. However, there are still residual errors that decrease performance of natural language processing algorithms applied to OCR text. In this paper, we present a statistical learning model for post-processing OCR errors, either in a fully automatic manner or followed by minimal user interaction to further reduce error rate. Our model employs web-scale corpora and integrates a rich set of linguistic features. Through an interdependent learning pipeline, our model produces and continuously refines the error detection and suggestion of candidate corrections. Evaluated on a historical biology book with complex error patterns, our model outperforms various baseline methods in the automatic mode and shows an even greater advantage when involving minimal user interaction. Quantitative analysis of each computational step further suggests that our proposed model is well-suited for handling volatile and complex OCR error patterns, which are beyond the capabilities of error correction incorporated in OCR engines.  相似文献   

8.
Auto-Regressive-Moving-Average with eXogenous input (ARMAX) models play an important role in control engineering for describing practical systems. However, ARMAX models can be non-realistic in many practical contexts because they do not consider the measurement errors on the output of the process. Due to the auto-regressive nature of ARMAX processes, a measurement error may affect multiple data entries, making the estimation problem very challenging. This problem can be solved by enhancing the ARMAX model with additive error terms on the output, and this paper develops a moving horizon estimator for such an extended ARMAX model. In the proposed method, measurement errors are modeled as nuisance variables and estimated simultaneously with the states. Identifiability was achieved by regularizing the least-squares cost with the ?2-norm of the nuisance variables, which leads to an optimization problem that has an analytical solution. For the proposed estimator, convergence results are established and unbiasedness properties are also proved. Insights on how to select the tuning parameter in the cost function are provided. Because of the explicit modeling of output noise, the impact of a measurement error on multiple data entries can be estimated and reduced. Examples are given to demonstrate the effectiveness of the proposed estimator in dealing with additive output noise as well as outliers.  相似文献   

9.
Recently, using a pretrained word embedding to represent words achieves success in many natural language processing tasks. According to objective functions, different word embedding models capture different aspects of linguistic properties. However, the Semantic Textual Similarity task, which evaluates similarity/relation between two sentences, requires to take into account of these linguistic aspects. Therefore, this research aims to encode various characteristics from multiple sets of word embeddings into one embedding and then learn similarity/relation between sentences via this novel embedding. Representing each word by multiple word embeddings, the proposed MaxLSTM-CNN encoder generates a novel sentence embedding. We then learn the similarity/relation between our sentence embeddings via Multi-level comparison. Our method M-MaxLSTM-CNN consistently shows strong performances in several tasks (i.e., measure textual similarity, identify paraphrase, recognize textual entailment). Our model does not use hand-crafted features (e.g., alignment features, Ngram overlaps, dependency features) as well as does not require pre-trained word embeddings to have the same dimension.  相似文献   

10.
In contrast with their monolingual counterparts, little attention has been paid to the effects that misspelled queries have on the performance of Cross-Language Information Retrieval (CLIR) systems. The present work makes a first attempt to fill this gap by extending our previous work on monolingual retrieval in order to study the impact that the progressive addition of misspellings to input queries has, this time, on the output of CLIR systems. Two approaches for dealing with this problem are analyzed in this paper. Firstly, the use of automatic spelling correction techniques for which, in turn, we consider two algorithms: the first one for the correction of isolated words and the second one for a correction based on the linguistic context of the misspelled word. The second approach to be studied is the use of character n-grams both as index terms and translation units, seeking to take advantage of their inherent robustness and language-independence. All these approaches have been tested on a from-Spanish-to-English CLIR system, that is, Spanish queries on English documents. Real, user-generated spelling errors have been used under a methodology that allows us to study the effectiveness of the different approaches to be tested and their behavior when confronted with different error rates. The results obtained show the great sensitiveness of classic word-based approaches to misspelled queries, although spelling correction techniques can mitigate such negative effects. On the other hand, the use of character n-grams provides great robustness against misspellings.  相似文献   

11.
Multi-Document Summarization of Scientific articles (MDSS) is a challenging task that aims to generate concise and informative summaries for multiple scientific articles on a particular topic. However, despite recent advances in abstractive models for MDSS, grammatical correctness and contextual coherence remain challenging issues. In this paper, we introduce EDITSum, a novel abstractive MDSS model that leverages sentence-level planning to guide summary generation. Our model incorporates neural topic model information as explicit guidance and sequential latent variables information as implicit guidance under a variational framework. We propose a hierarchical decoding strategy that generates the sentence-level planning by a sentence decoder and then generates the final summary conditioned on the planning by a word decoder. Experimental results show that our model outperforms previous state-of-the-art models by a significant margin on ROUGE-1 and ROUGE-L metrics. Ablation studies demonstrate the effectiveness of the individual modules proposed in our model, and human evaluations provide strong evidence that our model generates more coherent and error-free summaries. Our work highlights the importance of high-level planning in addressing intra-sentence errors and inter-sentence incoherence issues in MDSS.  相似文献   

12.
In the past decade, news consumption has shifted from printed news media to online alternatives. Although these come with advantages, online news poses challenges as well. Notable here is the increased competition between online newspapers and other online news providers to attract readers. Hereby, speed is often favored over quality. As a consequence, the need for new tools to monitor online news accuracy has grown. In this work, a fundamentally new and automated procedure for the monitoring of online news accuracy is proposed. The approach relies on the fact that online news articles are often updated after initial publication, thereby also correcting errors. Automated observation of the changes being made to online articles and detection of the errors that are corrected may offer useful insights concerning news accuracy. The potential of the presented automated error correction detection model is illustrated by building supervised classification models for the detection of objective, subjective and linguistic errors in online news updates respectively. The models are built using a large news update data set being collected during two consecutive years for six different Flemish online newspapers. A subset of 21,129 changes is then annotated using a combination of automated and human annotation via an online annotation platform. Finally, manually crafted features and text embeddings obtained by four different language models (TF-IDF, word2vec, BERTje and SBERT) are fed to three supervised machine learning algorithms (logistic regression, support vector machines and decision trees) and performance of the obtained models is subsequently evaluated. Results indicate that small differences in performance exist between the different learning algorithms and language models. Using the best-performing models, F2-scores of 0.45, 0.25 and 0.80 are obtained for the classification of objective, subjective and linguistic errors respectively.  相似文献   

13.
As a hot spot these years, cross-domain sentiment classification aims to learn a reliable classifier using labeled data from a source domain and evaluate the classifier on a target domain. In this vein, most approaches utilized domain adaptation that maps data from different domains into a common feature space. To further improve the model performance, several methods targeted to mine domain-specific information were proposed. However, most of them only utilized a limited part of domain-specific information. In this study, we first develop a method of extracting domain-specific words based on the topic information derived from topic models. Then, we propose a Topic Driven Adaptive Network (TDAN) for cross-domain sentiment classification. The network consists of two sub-networks: a semantics attention network and a domain-specific word attention network, the structures of which are based on transformers. These sub-networks take different forms of input and their outputs are fused as the feature vector. Experiments validate the effectiveness of our TDAN on sentiment classification across domains. Case studies also indicate that topic models have the potential to add value to cross-domain sentiment classification by discovering interpretable and low-dimensional subspaces.  相似文献   

14.
Though over the years, mathematical modelling of fuzzy PID controllers is carried out extensively with two-dimensional and three-dimensional input spaces, the modelling is rarely attempted using one-dimensional input space. In this paper, this gap is reduced by proposing a simple approach where each of the fuzzy P, fuzzy I, and fuzzy D components is modelled using one-dimensional input space and merged to provide the complete PID action. Another speciality of the proposed approach is that it does not require any AND or OR operator for obtaining the mathematical models of individual PID components. To the best of author’s knowledge, such a modelling approach is completely new. This newly introduced idea of modelling is further extended to fractional order fuzzy PID controllers. Applicability of the proposed fuzzy controllers is delineated with four simulation examples and one real-time experimentation case study. To understand the usefulness of the proposed control schemes, performances of the newly obtained controllers are compared with the results available in literature. As the proposed controllers are model-free controllers, they can easily be implemented for other control applications also.  相似文献   

15.
This paper targets the development of an inertial navigation error-budget system for performance validation before actual field operation. The paper starts by studying the various errors that an inertial measurement unit (IMU) incorporates. A systematic approach of error modeling is proposed. The error models are integrated in time and added to the true measurement of the IMU to obtain the observed measurements. Simulation results are presented to show the contribution of the errors to the final measurement of the IMU. The IMU error model is blended with a GPS measurements’ model and the performance of a GPS/IMU extended Kalman filter (EKF) to IMU errors is shown. The simulated IMU errors are essential to study IMU quality effect on an inertial navigation system's (INS) state estimate accuracy.  相似文献   

16.
To automate the process of emotion recognition, in this study, we develop a computational approach for continuously tracking and analyzing users’ emotions while chatting online. Our work has several unique features: it provides relative probabilities of possible emotions for a word, constructs a distribution for each chatting message accordingly, performs a clustering procedure for the message distribution, and aggregates the emotions of continuous chatting sentences to draw the conclusion. To evaluate the proposed approach, we conducted experiments in two phases. The first phase was to evaluate the effectiveness of the proposed computational approach in analyzing the chatting sentences. The participants were asked to focus on tagging emotions toward each sentence for a pre-designed dialogue. The second phase involves a real-time chatting between two online users. The participants were asked to choose topics and freely chat with each other. The messages were analyzed, and the results were provided to the users for their evaluations. The results show that our approach is both effective and efficient in tracking the emotions of chatting users. Additional analyses and further discussions were carried out to further evaluate the quantitative experimental results. All the findings confirmed the usefulness and feasibility of the presented approach.  相似文献   

17.
In the context of social media, users usually post relevant information corresponding to the contents of events mentioned in a Web document. This information posses two important values in that (i) it reflects the content of an event and (ii) it shares hidden topics with sentences in the main document. In this paper, we present a novel model to capture the nature of relationships between document sentences and post information (comments or tweets) in sharing hidden topics for summarization of Web documents by utilizing relevant post information. Unlike previous methods which are usually based on hand-crafted features, our approach ranks document sentences and user posts based on their importance to the topics. The sentence-user-post relation is formulated in a share topic matrix, which presents their mutual reinforcement support. Our proposed matrix co-factorization algorithm computes the score of each document sentence and user post and extracts the top ranked document sentences and comments (or tweets) as a summary. We apply the model to the task of summarization on three datasets in two languages, English and Vietnamese, of social context summarization and also on DUC 2004 (a standard corpus of the traditional summarization task). According to the experimental results, our model significantly outperforms the basic matrix factorization and achieves competitive ROUGE-scores with state-of-the-art methods.  相似文献   

18.
This paper presents a simplified design methodology for robust event-driven tracking control of uncertain nonlinear pure-feedback systems with input quantization. All nonlinearities and quantization parameters are assumed to be completely unknown. Different from the existing event-driven control approaches for systems with completely unknown nonlinearities, the main contribution of this paper is to design a simple event-based tracking scheme with preassigned performance, without the use of adaptive function approximators and adaptive mirror models. It is shown in the Lyapunov sense that the proposed event-driven low-complexity tracker consisting of nonlinearly transformed error surfaces and a triggering condition can achieve the preselected transient and steady-state performance of control errors in the presence of the input quantization.  相似文献   

19.
【目的/意义】目前在多文档自动摘要方面,研究者们主要关注于获取多文档集合中的重要主题内容,提出的很多自动摘要方法在提高摘要代表性的同时却忽略了文档中的潜在主题。【方法/过程】针对于多文档自动摘要中存在的冗余度较高且不能全面反映主题内容的问题,本文提出了一种基于句子主题发现的多文档自动摘要方法。该方法将多篇文档转换为句子集合,利用LDA主题模型对句子进行聚类分析与主题发现,并通过word2vec训练词向量计算句子的相似度;最终在主题之下通过TextRank算法来计算句子重要性,并结合句子的统计特征生成多文档集合的摘要。【结果/结论】通过人工测评的结果表明,本文提出的多文档自动摘要方法在主题覆盖性、简洁性、语法性等方面都取得了不错的效果。  相似文献   

20.
提出了一种同步DS-CDMA无线Ad Hoc网络中的有效用户识别方法,采用跨层设计,将MAC层SEEDEX协议中的调度信息用于物理层的帧编码和用户识别. 接收机首先判断是否存在有效用户,如存在则使用当前时隙所有可能发送数据的节点的扩频码来确定有效用户,不存在则直接丢弃该数据帧,从而减少了接收机的能量消耗. 仿真结果表明,该方法同现存的同类算法比较,减小了运算量,节省了接收机能量,改善了接收机错判概率.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号