首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
A variety of data structures such as inverted file, multi-lists, quad tree, k-d tree, range tree, polygon tree, quintary tree, multidimensional tries, segment tree, doubly chained tree, the grid file, d-fold tree. super B-tree, Multiple Attribute Tree (MAT), etc. have been studied for multidimensional searching and related problems. Physical data base organization, which is an important application of multidimensional searching, is traditionally and mostly handled by employing inverted file. This study proposes MAT data structure for bibliographic file systems, by illustrating the superiority of MAT data structure over inverted file. Both the methods are compared in terms of preprocessing, storage and query costs. Worst-case complexity analysis of both the methods, for a partial match query, is carried out in two cases: (a) when directory resides in main memory, (b) when directory resides in secondary memory. In both cases, MAT data structure is shown to be more efficient than the inverted file method. Arguments are given to illustrate the superiority of MAT data structure in an average case also. An efficient adaptation of MAT data structure, that exploits the special features of MAT structure and bibliographic files, is proposed for bibliographic file systems. In this adaptation, suitable techniques for fixing and ranking of the attributes for MAT data structure are proposed. Conclusions and proposals for future research are presented.  相似文献   

3.
Signature files and inverted files are well-known index structures. In this paper we undertake a direct comparison of the two for searching for partially-specified queries in a large lexicon stored in main memory. Using n-grams to index lexicon terms, a bit-sliced signature file can be compressed to a smaller size than an inverted file if each n-gram sets only one bit in the term signature. With a signature width less than half the number of unique n-grams in the lexicon, the signature file method is about as fast as the inverted file method, and significantly smaller. Greater flexibility in memory usage and faster index generation time make signature files appropriate for searching large lexicons or other collections in an environment where memory is at a premium.  相似文献   

4.
5.
The inverted file is the most popular indexing mechanism for document search in an information retrieval system. Compressing an inverted file can greatly improve document search rate. Traditionally, the d-gap technique is used in the inverted file compression by replacing document identifiers with usually much smaller gap values. However, fluctuating gap values cannot be efficiently compressed by some well-known prefix-free codes. To smoothen and reduce the gap values, we propose a document-identifier reassignment algorithm. This reassignment is based on a similarity factor between documents. We generate a reassignment order for all documents according to the similarity to reassign closer identifiers to the documents having closer relationships. Simulation results show that the average gap values of sample inverted files can be reduced by 30%, and the compression rate of d-gapped inverted file with prefix-free codes can be improved by 15%.  相似文献   

6.
Existing approaches to learning path recommendation for online learning communities mainly rely on the individual characteristics of users or the historical records of their learning processes, but pay less attention to the semantics of users’ postings and the context. To facilitate the knowledge understanding and personalized learning of users in online learning communities, it is necessary to conduct a fine-grained analysis of user data to capture their dynamical learning characteristics and potential knowledge levels, so as to recommend appropriate learning paths. In this paper, we propose a fine-grained and multi-context-aware learning path recommendation model for online learning communities based on a knowledge graph. First, we design a multidimensional knowledge graph to solve the problem of monotonous and incomplete entity information presentation of the single layer knowledge graph. Second, we use the topic preference features of users’ postings to determine the starting point of learning paths. We then strengthen the distant relationship of knowledge in the global context using the multidimensional knowledge graph when generating and recommending learning paths. Finally, we build a user background similarity matrix to establish user connections in the local context to recommend users with similar knowledge levels and learning preferences and synchronize their subsequent postings. Experiment results show that the proposed model can recommend appropriate learning paths for users, and the recommended similar users and postings are effective.  相似文献   

7.
Rapid appraisal of damages related to hazard events is of importance to first responders, government agencies, insurance industries, and other private and public organizations. While satellite monitoring, ground-based sensor systems, inspections and other technologies provide data to inform post-disaster response, crowdsourcing through social media is an additional and novel data source. In this study, the use of social media data, principally Twitter postings, is investigated to make approximate but rapid early assessments of damages following a disaster. The goal is to explore the potential utility of using social media data for rapid damage assessment after sudden-onset hazard events and to identify insights related to potential challenges. This study defines a text-based damage assessment scale for earthquake damages, and then develops a text classification model for rapid damage assessment. Although the accuracy remains a challenge compared to ground-based instrumental readings and inspections, the proposed damage assessment model features rapidity with large amounts of data at spatial densities that exceed those of conventional sensor networks. The 2019 Ridgecrest, California earthquake sequence is investigated as a case study.  相似文献   

8.
In this work the applicability of magnetic bubble memories for the processing of inverted files has been discussed. Four novel models of magnetic bubble memories are presented to demonstrate the storage structures and the data processing. The first model employs an organization of major-minor loops. On the basis of such organization a uniform ladder is formed so that the data can be rearranged by using four operations (global shift, detached shift, exchange and delta exchange). The second model makes use of the on-chip decoder (also known as self-contained magnetic bubble-domain memory chip). For this model a hashing scheme is relied upon to perform the required data operations. The third and fourth models are different combinations of the first two models. The latter two models may provide a relatively high-speed performance as well as a reasonable system complexity.For each model the algorithms of data retrieval, sorting, deletion, insertion and updating are given. Also, a comparison of the four models has been carried out in order to determine the most convenient magnetic bubble memory structure for the processing of inverted files.  相似文献   

9.
The vector space model (VSM) is a textual representation method that is widely used in documents classification. However, it remains to be a space-challenging problem. One attempt to alleviate the space problem is by using dimensionality reduction techniques, however, such techniques have deficiencies such as losing some important information. In this paper, we propose a novel text classification method that neither uses VSM nor dimensionality reduction techniques. The proposed method is a space efficient method that utilizes the first order Markov model for hierarchical Arabic text classification. For each category and sub-category, a Markov chain model is prepared based on the neighboring characters sequences. The prepared models are then used for scoring documents for classification purposes. For evaluation, we used a hierarchical Arabic text data collection that contains 11,191 documents that belong to eight topics distributed into 3-levels. The experimental results show that the Markov chains based method significantly outperforms the baseline system that employs the latent semantic indexing (LSI) method. That is, the proposed method enhances the F1-measure by 3.47%. The novelty of this work lies on the idea of decomposing words into sequences of characters, which found to be a promising approach in terms of space and accuracy. Based on our best knowledge, this is the first attempt to conduct research for hierarchical Arabic text classification with such relatively large data collection.  相似文献   

10.
This paper presents a systematic analysis of twenty four performance measures used in the complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class, multi-labelled, and hierarchical. For each classification task, the study relates a set of changes in a confusion matrix to specific characteristics of data. Then the analysis concentrates on the type of changes to a confusion matrix that do not change a measure, therefore, preserve a classifier’s evaluation (measure invariance). The result is the measure invariance taxonomy with respect to all relevant label distribution changes in a classification problem. This formal analysis is supported by examples of applications where invariance properties of measures lead to a more reliable evaluation of classifiers. Text classification supplements the discussion with several case studies.  相似文献   

11.
Detection at an early stage is vital for the diagnosis of the majority of critical illnesses and is the same for identifying people suffering from depression. Nowadays, a number of researches have been done successfully to identify depressed persons based on their social media postings. However, an unexpected bias has been observed in these studies, which can be due to various factors like unequal data distribution. In this paper, the imbalance found in terms of participation in the various age groups and demographics is normalized using the one-shot decision approach. Further, we present an ensemble model combining SVM and KNN with the intrinsic explainability in conjunction with noisy label correction approaches, offering an innovative solution to the problem of distinguishing between depression symptoms and suicidal ideas. We achieved a final classification accuracy of 98.05%, with the proposed ensemble model ensuring that the data classification is not biased in any manner.  相似文献   

12.
余嘉 《大众科技》2012,(7):291-292,290
高校的二级财务会计档案既是反映经济业务的重要凭证据,又是高校档案管理的重要组成部分.随着高校的“统一领导,分级管理,集中核算”二级财务管理体制改革的不断深入,会计档案工作中新情况、新问题不断出现,针对管理好高校二级财务会计档案,提出了完善高校二级财务会计档案管理的一些对策.  相似文献   

13.
This paper describes a computer program which converts the text of user input and system responses from an on-line search system into fixed format records which describe the interaction. It also outlines the syntax of the query language and the format of the output record produced by the parser. The study discusses problems in constructing the parser, the logic of the parser and its performance characterestics, as well as recommendations for improving the process of logging on-line searches.  相似文献   

14.
高茜 《科教文汇》2011,(3):182-183
照片档案是一起特殊的载体形式,配以文字说明,能比较直观地记载事物发展的全过程,生动再现原有事物动态或现场环境及历史原貌。2005年是佛山电视台踏上全新发展平台的一年,其自办栏目与主办活动日益呈现出专业化、多样化、时代化的发展趋势。为了能够点滴记录我台逐渐走向成熟的过程,照片档案的建立与管理担当起一项十分重要的工作,可为我台记录发展过程的形象、直观见证,在日后工作有着极其重要的参考价值。  相似文献   

15.
Real-world datasets often present different types of data quality problems, such as the presence of outliers, missing values, inaccurate representations and duplicate entities. In order to identify duplicate entities, a task named Entity Resolution (ER), we may employ a variety of classification techniques. Rule-based techniques for classification have gained increasing attention from the state of the art due to the possibility of incorporating automatic learning approaches for generating Rule-Based Entity Resolution (RbER) algorithms. However, these algorithms present a series of drawbacks: i) The generation of high-quality RbER algorithms usually require high computational and/or manual labeling costs; ii) the impossibility of tuning RbER algorithm parameters; iii) the inability to incorporate user preferences regarding the ER results in the algorithm functioning; and iv) the logical (binary) nature of the RbER algorithms usually fall short when tackling special cases, i.e., challenging duplicate and non-duplicate pairs of entities. To overcome these drawbacks, we propose Rule Assembler, a configurable approach that classifies duplicate entities based on confidence scores produced by logical rules, taking into account tunable parameters as well as user preferences. Experiments carried out using both real-world and synthetic datasets have demonstrated the ability of the proposed approach to enhance the results produced by baseline RbER algorithms and basic assembling approaches. Furthermore, we demonstrate that the proposed approach does not entail a significant overhead over the classification step and conclude that the Rule Assembler parameters APA, WPA, TβM and Max are more suitable to be used in practical scenarios.  相似文献   

16.
A prefix trie index (originally called trie hashing) is applied to the problem of providing fast search times, fast load times and fast update properties in a bibliographic or full text retrieval system. For all but the largest dictionaries a single key search in the dictionary under trie hashing takes exactly one disk read. Front compression of search keys is used to enhance performance. Partial combining of the postings into the dictionary is analyzed as a method to give both faster retrieval and improved update properties for the trie hashing inverted file. Statistics are given for a test database consisting of an online catalog at the Graduate School of Library and Information Science Library of the University of Western Ontario. The effect of changing various parameters of prefix tries are tested in this application.  相似文献   

17.
bidirectional delta file is a novel concept, introduced in this paper, for a two way delta file. Previous work focuses on single way differential compression called forwards and backwards delta files. Here we suggest to efficiently combine them into a single file so that the combined file is smaller than the combination of the two individual ones. Given the bidirectional delta file of two files S and T and the original file S, one can decode it in order to produce T. The same bidirectional delta file is used together with the file T in order to reconstruct S. This paper presents two main strategies for producing an efficient bidirectional delta file in terms of the memory storage it requires; a quadratic time, optimal, dynamic programming algorithm, and a linear time, greedy algorithm. Although the dynamic programming algorithm often produces better results than the greedy algorithm, it is impractical for large files, and it is only used for theoretical comparisons. Experiments between the implemented algorithms and the traditional way of using both forwards and backwards delta files are presented, comparing their processing time and their compression performance. These experiments show memory storage savings of about 25% using this bidirectional delta approach as compared to the compressed delta file constructed using the traditional way, while preserving approximately the same processing time for decoding.  相似文献   

18.
We report on the design and construction of features of an automated query system which will assist pharmacologists who are not information specialists to access the Derwent Drug File (DDF) pharmacological database. Our approach was to first elucidate those search skills of the search intermediary which might prove tractable to automation. Modules were then produced which assist in the three important subtasks of search statement generation, namely vocabulary selection, the choice of context indicators and query reformulation. Vocabulary selection is facilitated by approximate string matching, morphological analysis, browsing and menu searching. The context of the study, such as treatment or metabolism, is determined using a system of advisory menus. The task of query reformulation is performed using user feedback on retrieved documents, thesaurus relations between document index terms and term postings data. Use is made of diverse information sources, including electronic forms of printed search aids, a thesaurus and a medical dictionary. The system will be of use both to semicasual users and experienced intermediaries. Many of the ideas developed should prove transportable to domains other than pharmacology: the techniques for thesaurus manipulation are designed for use with any hierarchical thesaurus.  相似文献   

19.
目的:探讨电子病历档案接口的研发与利用。方法:设计病案统计管理系统与H1S(医院信息管理系统)的接口衔接,病历档案的开发利用。结果:极大地节约了时间,提高了工作效率和准确性。结论:通过接口转接病历信息避免了重复劳动,达到了资源共享。编辑分析汇总后的资料,挂接在医院局域网上,为医院管理、临床诊疗、教学、科研和病历档案管理提供了方便、快捷的服务。  相似文献   

20.
Summarisation is traditionally used to produce summaries of the textual contents of documents. In this paper, it is argued that summarisation methods can also be applied to the logical structure of XML documents. Structure summarisation selects the most important elements of the logical structure and ensures that the user’s attention is focused towards sections, subsections, etc. that are believed to be of particular interest. Structure summaries are shown to users as hierarchical tables of contents. This paper discusses methods for structure summarisation that use various features of XML elements in order to select document portions that a user’s attention should be focused to. An evaluation methodology for structure summarisation is also introduced and summarisation results using various summariser versions are presented and compared to one another. We show that data sets used in information retrieval evaluation can be used effectively in order to produce high quality (query independent) structure summaries. We also discuss the choice and effectiveness of particular summariser features with respect to several evaluation measures.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号