首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
OCR errors in text harm information retrieval performance. Much research has been reported on modelling and correction of Optical Character Recognition (OCR) errors. Most of the prior work employ language dependent resources or training texts in studying the nature of errors. However, not much research has been reported that focuses on improving retrieval performance from erroneous text in the absence of training data. We propose a novel approach for detecting OCR errors and improving retrieval performance from the erroneous corpus in a situation where training samples are not available to model errors. In this paper we propose a method that automatically identifies erroneous term variants in the noisy corpus, which are used for query expansion, in the absence of clean text. We employ an effective combination of contextual information and string matching techniques. Our proposed approach automatically identifies the erroneous variants of query terms and consequently leads to improvement in retrieval performance through query expansion. Our proposed approach does not use any training data or any language specific resources like thesaurus for identification of error variants. It also does not expend any knowledge about the language except that the word delimiter is blank space. We have tested our approach on erroneous Bangla (Bengali in English) and Hindi FIRE collections, and also on TREC Legal IIT CDIP and TREC 5 Confusion track English corpora. Our proposed approach has achieved statistically significant improvements over the state-of-the-art baselines on most of the datasets.  相似文献   

2.
Automatic word spacing in Korean remains a significant task in natural language processing owing to the extremely complex word spacing rules involved. Most previous models remove all spaces in input sentences and insert new spaces in the modified input sentences. If input sentences include only a small number of spacing errors, the previous models often return sentences with even more spacing errors than the input sentences because they remove the correct spaces that were typed intentionally by the users. To reduce this problem, we propose an automatic word spacing model based on a neural network that effectively uses word spacing information from input sentences. The proposed model comprises a space insertion layer and a spacing-error correction layer. Using an approach similar to previous models, the space insertion layer inserts word spaces into input sentences from which all spaces have been removed. The spacing error correction layer post-corrects the spacing errors of the space insertion model using word spacing typed by users. Because the two layers are tightly connected in the proposed model, the backpropagation flows are not blocked. As a result, the space insertion and error correction are performed simultaneously. In experiments, the proposed model outperformed all compared models on all measures on the same test data. In addition, it exhibited reliable performance (word-unit F1-measures of 94.17%~97.87%) regardless of how many word spacing errors were present in the input sentences.  相似文献   

3.
基于马尔可夫模型的图书馆用户聚类分群方法研究   总被引:1,自引:0,他引:1       下载免费PDF全文
吴艳玲  孙思阳 《情报科学》2021,39(11):167-172
【目的/意义】针对图书馆用户群体聚类分群不稳定且错误率较高的问题,提出基于马尔可夫模型的图书馆 用户聚类分群方法,提升图书馆用户聚类分群精准度。【方法/过程】采用一阶马尔可夫混合模型构建用户动作序列 模型,通过模型产生用户行为聚类,体现用户动作的动态性,采用自适应自然梯度算法,依据用户行为分离状态自 适应调整自身步长,优化模型参数学习中模型自动选择问题,实现最佳图书馆用户聚类分群。【结果/结论】通过实 验结果能够证明,实际聚类数量小于L值时,提出方法能够实现参数学习过程中模型的自动选择。提出方法的分群 数量最多,能够划分出最大的取值区间,聚类错误率最低为0.22%,聚类性能比较稳定,分群结果更加精准,达到了 设计的预期。【创新/局限】采用一阶马尔可夫混合模型实现了图书馆用户聚类分群。后续将进一步研究可考虑用 户序列间关联的高阶马尔可夫分量模型,以提高分群算法的准确性和稳定性。  相似文献   

4.
An automatic method for correcting spelling and typing errors from teletypewriter keyboard input is proposed. The computerized correcting process is presented as a heuristic tree search. The correct spellings are stored character-by-character in a psuedo-binary tree. The search examines a small subset of the database (selected branches of the tree) while checking for insertion, substitution, deletion and transposition errors. The correction procedure utilizes the inherent redundancy of natural language. Multiple errors can be handled if at least two correct characters appear between errors. Test results indicate that this approach has the highest error correction accuracy to date.  相似文献   

5.
目前国内的女书文字大多采用手写的方式保存。介绍了女书OCR技术,讨论了女书OCR的整体流程,具体包括二值化,文字分割,特征提取和文字识别等方法,最终实现了对手写女书文字的识别和存储。  相似文献   

6.
张超  杨健 《科技广场》2007,(5):137-138
为了对标准的印刷体进行快速准确的识别,本文应用Office中自带的MODI组件进行OCR程序的开发。经过实验,它的处理速度和识别率都令人满意,应用MODI可以简化OCR程序的开发,如果能对图像文件进行先期处理,则可以得到更高的识别率。  相似文献   

7.
An error-correction code (ECC) sequencing approach has recently been reported to effectively reduce sequencing errors by interrogating a DNA fragment with three orthogonal degenerate sequencing-by-synthesis (SBS) reactions. However, similar to other non-single-molecule SBS methods, the reaction will gradually lose its synchronization within a molecular colony in ECC sequencing. This phenomenon, called dephasing, causes sequencing error, and in ECC sequencing, induces distinctive dephasing patterns. To understand the characteristic dephasing patterns of the dual-base flowgram in ECC sequencing and to generate a correction algorithm, we built a virtual sequencer in silico. Starting from first principles and based on sequencing chemical reactions, we simulated ECC sequencing results, identified the key factors of dephasing in ECC sequencing chemistry and designed an effective dephasing algorithm. The results show that our dephasing algorithm is applicable to sequencing signals with at least 500 cycles, or 1000-bp average read length, with acceptably low error rate for further parity checks and ECC deduction. Our virtual sequencer with our dephasing algorithm can further be extended to a dichromatic form of ECC sequencing, allowing for a potentially much more accurate sequencing approach.  相似文献   

8.
Ethnicity-targeted hate speech has been widely shown to influence on-the-ground inter-ethnic conflict and violence, especially in such multi-ethnic societies as Russia. Therefore, ethnicity-targeted hate speech detection in user texts is becoming an important task. However, it faces a number of unresolved problems: difficulties of reliable mark-up, informal and indirect ways of expressing negativity in user texts (such as irony, false generalization and attribution of unfavored actions to targeted groups), users’ inclination to express opposite attitudes to different ethnic groups in the same text and, finally, lack of research on languages other than English. In this work we address several of these problems in the task of ethnicity-targeted hate speech detection in Russian-language social media texts. This approach allows us to differentiate between attitudes towards different ethnic groups mentioned in the same text – a task that has never been addressed before. We use a dataset of over 2,6M user messages mentioning ethnic groups to construct a representative sample of 12K instances (ethnic group, text) that are further thoroughly annotated via a special procedure. In contrast to many previous collections that usually comprise extreme cases of toxic speech, representativity of our sample secures a realistic and, therefore, much higher proportion of subtle negativity which additionally complicates its automatic detection. We then experiment with four types of machine learning models, from traditional classifiers such as SVM to deep learning approaches, notably the recently introduced BERT architecture, and interpret their predictions in terms of various linguistic phenomena. In addition to hate speech detection with a text-level two-class approach (hate, no hate), we also justify and implement a unique instance-based three-class approach (positive, neutral, negative attitude, the latter implying hate speech). Our best results are achieved by using fine-tuned and pre-trained RuBERT combined with linguistic features, with F1-hate=0.760, F1-macro=0.833 on the text-level two-class problem comparable to previous studies, and F1-hate=0.813, F1-macro=0.824 on our unique instance-based three-class hate speech detection task. Finally, we perform error analysis, and it reveals that further improvement could be achieved by accounting for complex and creative language issues more accurately, i.e., by detecting irony and unconventional forms of obscene lexicon.  相似文献   

9.
Machine learning algorithms enable advanced decision making in contemporary intelligent systems. Research indicates that there is a tradeoff between their model performance and explainability. Machine learning models with higher performance are often based on more complex algorithms and therefore lack explainability and vice versa. However, there is little to no empirical evidence of this tradeoff from an end user perspective. We aim to provide empirical evidence by conducting two user experiments. Using two distinct datasets, we first measure the tradeoff for five common classes of machine learning algorithms. Second, we address the problem of end user perceptions of explainable artificial intelligence augmentations aimed at increasing the understanding of the decision logic of high-performing complex models. Our results diverge from the widespread assumption of a tradeoff curve and indicate that the tradeoff between model performance and explainability is much less gradual in the end user’s perception. This is a stark contrast to assumed inherent model interpretability. Further, we found the tradeoff to be situational for example due to data complexity. Results of our second experiment show that while explainable artificial intelligence augmentations can be used to increase explainability, the type of explanation plays an essential role in end user perception.  相似文献   

10.
本文利用我国地壳形变监测网15个双频GPS观测站2000年7月1日到8日、14日到18日两个连续时段的数据评价了电离层网格校正算法在磁暴期间和平静期的精度。文中描述了算法的思想,重点计算了位于不同纬度区域以及不同空间环境下用户站的精度。并从系统的完整性出发,利用地学统计方法进一步分析了校正误差的空间相关性分布。这为网格点置信区间距离函数的构建提供了有用的信息。计算结果表明,位于中高纬的用户站精度较高,平均值约为0.4米左右,而低纬地区精度相对降低;此次磁暴期间,算法的精度明显降低,对于低纬地区的影响更为显著;校正误差随距离增大相关性降低,电离层平静期间,误差较小,周围邻近点的相关性较高,磁暴期间,误差增大,相关性减弱。  相似文献   

11.
【目的/意义】基于网络视频的用户交互式持续学习行为对于推动网络视频学习资源价值最大化以及教育 信息化发展具有重要意义,对其影响因素及作用机理的探究有助于增加用户粘性并推动在线学习良好生态的构 建。【方法/过程】通过采集交互式学习视频的186038条弹幕数据,并以此为原始资料采用扎根理论的质性研究方法 对弹幕数据进行三级编码提取出概念和范畴,进而构建影响因素理论模型。【结果/结论】研究发现,个体因素、课程 因素和教师因素通过影响用户交互行为因素达到影响用户满足感,并进一步影响用户交互式持续学习行为;此外, 用户的个体特征会调节用户交互行为因素对用户满足感的影响,且教师的授课风格会直接影响用户交互式持续学 习行为。【创新/局限】本文采用扎根理论并以弹幕数据为原始资料构建了交互式持续学习行为影响因素模型,各影 响因素的定量分析有待后续验证,亦可通过对比分析其他类型网络视频学习资源拓展本文研究结论。  相似文献   

12.
The prevalence of Location-based Social Networks (LBSNs) services makes next personalized Point-of-Interest (POI) predictions become a trending research topic. However, due to device failure or intention camouflage, geolocation information missing prevents existing POI-oriented researches for advanced user preference analysis. To this end, we propose a novel model named Bi-STAN, which fuses bi-direction spatiotemporal transition patterns and personalized dynamic preferences, to identify where the user has visited at a past specific time, namely missing check-in POI identification. Furthermore, to relieve data sparsity issues, Bi-STAN explicitly exploits spatiotemporal characteristics by doing bilateral traceback to search related items with high predictive power from user mobility traces. Specifically, Bi-STAN introduces (1) a temporal-aware attention semantic category encoder to unveil the latent semantic category transition patterns by modeling temporal periodicity and attenuation; (2) a spatial-aware attention POI encoder to capture the latent POI transition pattern by modeling spatial regularity and proximity; (3) a multitask-oriented decoder to incorporate personalized and temporal variance preference into learned transition patterns for missing check-in POI and category identification. Based on the complementarity and compatibility of multi-task learning, we further develop Bi-STAN with a self-adaptive learning rate for model optimization. Experimental results on two real-world datasets show the effectiveness of our proposed method. Significantly, Bi-STAN can also be adaptively applied to next POI prediction task with outstanding performances.  相似文献   

13.
To achieve personalized recommendations, the recommender system selects the items that users may like by learning the collected user–item interaction data. However, the acquisition and use of data usually form a feedback loop, which leads to recommender systems suffering from popularity bias. To solve this problem, we propose a novel dual disentanglement of user–item interaction for recommendation with causal embedding (DDCE). Different from the existing work, our innovation is we take into account double-end popularity bias from the user-side and the item-side. Firstly, we perform a causal analysis of the reasons for user–item interaction and obtain the causal embedding representation of each part according to the analysis results. Secondly, on the item-side, we consider the influence of item attributes on popularity to improve the reliability of the item popularity. Then, on the user-side, we consider the effect of the time series when obtaining users’ interest. We model the contrastive learning task to disentangle users’ long–short-term interests, which avoids the bias of long–short-term interests overlapping, and use the attention mechanism to realize the dynamic integration of users’ long–short-term interests. Finally, we realize the disentanglement of user–item interaction reasons by decoupling user interest and item popularity. We experiment on two real-world datasets (Douban Movie and KuaiRec) to verify the significance of DDCE, the average improvement of DDCE in three evaluation metrics (NDCG, HR, and Recall) compared to the state-of-the-art model are 5.1106% and 4.1277% (MF as the backbone), 3.8256% and 3.2790% (LightGCN as the backbone), respectively.  相似文献   

14.
自动问答系统在搜索引擎的基础上融入了自然语言的知识与应用,与传统的依靠关键字匹配的搜索引擎相比,能够更好地满足用户的检索需求。介绍了计算机操作系统自动问答系统模型,阐述了具体开发过程,设计并实现了基于计算机操作系统领域的自动问答系统,实践表明该系统能够较为准确地回答用户问题。  相似文献   

15.
Recent research in the human computer interaction and information retrieval areas has revealed that search response latency exhibits a clear impact on the user behavior in web search. Such impact is reflected both in users’ subjective perception of the usability of a search engine and in their interaction with the search engine in terms of the number of search results they engage with. However, a similar impact analysis has been missing so far in the context of sponsored search. Since the predominant business model for commercial search engines is advertising via sponsored search results (i.e., search advertisements), understanding how response latency influences the user interaction with the advertisements displayed on the search engine result pages is crucial to increase the revenue of a commercial search engine. To this end, we conduct a large-scale analysis using query logs obtained from a commercial web search. We analyze the short-term and long-term impact of search response latency on the querying and clicking behaviors of users using desktop and mobile devices to access the search engine, as well as the corresponding impact on the revenue of the search engine. This analysis demonstrates the importance of serving sponsored search results with low latency and provides insight into the ad serving policy of commercial search engines to ensure long-term user engagement and search revenue.  相似文献   

16.
王莉 《中国科技信息》2005,(11):383-384
英语学习者在学习过程中必然要犯这样或那样的错误,分析学生的错误对教学工作起着重要作用.错误分析可以使教师了解学生学习过程中所存在的问题和面临的困难,并以此调整教学内容和教学计划,指导教学工作,使所有的教学时间有效地用于提高学生的英语水平上.语言学家根据不同的理论对学习者所产生的错误有不同的分类方法。本文列举出学生普遍存在的错误典型并将之归纳为语言错误和语用错误,即是否违反了语言的结构规则:语法,词汇,(本文没有表述语音错误),还是违反了语言的使用规则:语言的使用时间,场合是否得体.通过分析错误,指出教学和学习中存在的问题。由此说明错误分析是十分必要的。  相似文献   

17.
In this paper, we define and present a comprehensive classification of user intent for Web searching. The classification consists of three hierarchical levels of informational, navigational, and transactional intent. After deriving attributes of each, we then developed a software application that automatically classified queries using a Web search engine log of over a million and a half queries submitted by several hundred thousand users. Our findings show that more than 80% of Web queries are informational in nature, with about 10% each being navigational and transactional. In order to validate the accuracy of our algorithm, we manually coded 400 queries and compared the results from this manual classification to the results determined by the automated method. This comparison showed that the automatic classification has an accuracy of 74%. Of the remaining 25% of the queries, the user intent is vague or multi-faceted, pointing to the need for probabilistic classification. We discuss how search engines can use knowledge of user intent to provide more targeted and relevant results in Web searching.  相似文献   

18.
移动用户画像构建研究   总被引:2,自引:0,他引:2  
基站通信网络数据蕴含着丰富的移动用户行为,从移动用户频繁活动、规律行为以及移动速度3方面建构移动用户行为画像,可以为个性化服务提供更完整丰富的信息。在分析和挖掘某电信运营商3万位移动用户记录的基站数据的基础上,本文采用频繁模式挖掘、构建概率矩阵、计算熵等方法,从用户基站日志中所包含的地理位置信息中构建移动用户行为画像。研究结果表明,该画像模型可显示移动用户的频繁活动规律、周期性行为及出行方式,可作为分析移动用户群体行为及用户间交互行为的基础。  相似文献   

19.
张素云 《科教文汇》2011,(36):109-109,126
纠错作为一种教学形式,在高中数学课堂教学中有着重要作用。这样做究竟错在哪里?在高中数学课堂教学中时时注意纠错,让学生自己发现错误、自己改正错误,从而防止以后再出现这样的错误,以弥补学生在知识上的缺陷和逻辑推理上的缺陷、提高解题的准确性、增强思维的严谨性,能进一步开拓思路、训练思维、培养能力、激活心智。  相似文献   

20.
Work performed under the SPElling Error Detection COrrection Project (SPEEDCOP) supported by National Science Foundation (NSF) at Chemical Abstracts Service (CAS) to devise effective automatic methods of detecting and correcting misspellings in scholarly and scientific text is described. The investigation was applied to 50,000 word/misspelling pairs collected from six datasets (Chemical Industry Notes (CIN), Biological Abstracts (BA). Chemical Abstracts (CA), Americal Chemical Society primary journal keyboarding (ACS), Information Science Abstracts (ISA), and Distributed On-Line Editing (DOLE) (a CAS internal dataset especially suited to spelling error studies). The purpose of this study was to determine the utility of trigram analysis in the automatic detection and/or correction of misspellings. Computer programs were developed to collect data on trigram distribution in each dataset and to explore the potential of trigram analysis for detecting spelling errors, verifying correctly-spelled words, locating the error site within a misspelling, and distinguishing between the basic kinds of spelling errors. The results of the trigram analysis were largely independent of the dataset to which it was applied but trigram compositions varied with the dataset. The trigram analysis technique developed determined the error site within a misspelling accurately, but did not distinguish effectively between different error types or between valid words and misspellings. However, methods for increasing its accuracy are suggested.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号