首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Distributed memory information retrieval systems have been used as a means of managing the vast volume of documents in an information retrieval system, and to improve query response time. However, proper allocation of documents plays an important role in improving the performance of such systems. Maximising the amount of parallelism can be achieved by distributing the documents, while the inter-node communication cost is minimised by avoiding documents distribution. Unfortunately, these two factors contradict each other. Finding an optimal allocation satisfying the above objectives is referred to as distributed memory document allocation problem (DDAP), and it is an NP-Complete problem. Heuristic algorithms are usually employed to find an optimal solution to this problem. Genetic algorithm is one such algorithms. In this paper, a genetic algorithm is developed to find an optimal document allocation for DDAP. Several well-known network topologies are investigated to evaluate the performance of the algorithm. The approach relies on the fact that documents of an information retrieval system are clustered by some arbitrary method. The advantages of a clustered document approach specially in a distributed memory information retrieval system are well-known.Since genetic algorithms work with a set of candidate solutions, parallelisation based on a Single Instruction Multiple Data (SIMD) paradigm seems to be the natural way to obtain a speedup. Using this approach, the population of strings is distributed among the processing elements. Each string is processed independently. The performance gain comes from the parallel execution of the strings, and hence, it is heavily dependent on the population size. The approach is favoured for genetic algorithms' applications where the parameter set for a particular run is well-known in advance, and where such applications require a big population size to solve the problem. DDAP fits nicely into the above requirements. The aim of the parallelisation is two-fold: the first one is to speedup the allocation process in DDAP which usually consists of thousands of documents and has to use a big population size, and second, it can be seen as an attempt to port the genetic algorithm's processes into SIMD machines.  相似文献   

2.
当前政府办公自动化系统的公文管理功能普遍存在问题,其中的瓶颈亦即最根本性的问题是公文流程问题。当前政府公文流程存在功能分散、相对封闭、文档分治等主要问题。基于这种流程开发的办公自动化系统忽略了档案管理中对于电子公文长期保存的需求,所保存下来的电子公文将来无法作为档案发挥应有的凭证作用。要解决这个问题,必须研究并设计国家层面的电子公文流程,实现文档一体化的全程管理,统一文档管理体制,更新办公自动化系统开发模式。  相似文献   

3.
维基解密网站泄露美国政府机密文件一事,反映了美国政府涉密电子文件共享管理中制度执行不力、涉密人员数量过于庞大、涉密人员监管不到位、涉密文件范围扩大化等问题,应当从增强政府涉密电子文件共享系统安全的技术保障、提高制度规范执行力、加强涉及人员的全程管理、增加政府行政透明度等方面多管齐下,来解决政府涉密电子文件信息的安全问题。  相似文献   

4.
Fusion Via a Linear Combination of Scores   总被引:9,自引:2,他引:7  
We present a thorough analysis of the capabilities of the linear combination (LC) model for fusion of information retrieval systems. The LC model combines the results lists of multiple IR systems by scoring each document using a weighted sum of the scores from each of the component systems. We first present both empirical and analytical justification for the hypotheses that such a model should only be used when the systems involved have high performance, a large overlap of relevant documents, and a small overlap of nonrelevant documents. The empirical approach allows us to very accurately predict the performance of a combined system. We also derive a formula for a theoretically optimal weighting scheme for combining 2 systems. We introduce d—the difference between the average score on relevant documents and the average score on nonrelevant documents—as a performance measure which not only allows mathematical reasoning about system performance, but also allows the selection of weights which generalize well to new documents. We describe a number of experiments involving large numbers of different IR systems which support these findings.  相似文献   

5.
The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.  相似文献   

6.
Many federal agencies face challenges with designing geospatial data management systems. This paper presents and documents a needs-assessment process that can be employed to prioritize agencies’ geospatial information needs; identify agencies’ capacity to manage a centralized geodatabase; determine agencies’ capacity to deliver Web-mapping services to the public; and identify barriers, such as data security and limited financial resources, that constrain agencies’ ability to design and manage a geospatial data management system. The paper details the needs-assessment process and documents its application to the National Park Service (NPS) Conservation and Outdoor Recreation (COR) Branch programs. The NPS COR Branch is comprised of nine disparate programs, such as the National Trails System and the Rivers, Trails, and Conservation Assistance program, each of which has specific geospatial data management and delivery needs. The needs-assessment process, tested through its application to the NPS COR Branch programs, provides a comprehensive and logical workflow for system developers and administrators to use as they create or refine geospatial data management systems.  相似文献   

7.
Storage of growing collections is an ongoing problem for libraries. Past attempts at using the industrial solution of automated storage and retrieval systems (AS/RS) ended in failure. However, improvements in these mechanisms, especially computer control and the ability to interface with online library catalogs, make them a viable option for libraries. Questions remain about the appropriateness of treating intellectual material like industrial parts. In addition, access is still an issue especially in regard to government depository documents. A literature review shows that while there is a tremendous amount of research available on the design of AS/RS. little is written about its application in libraries.  相似文献   

8.
Abstract

Community support systems (community platforms) that are providing a rich communication medium for work‐ or interest groups are gaining more and more attention in application areas ranging from leisure support and customer support to ‘knowledge management. One of these application areas is the support of teaching and research activities in universities. In this article we first identify possibilities for community platforms in universities and present some applications at Technische Universität München (TUM). From the current situation at TUM we motivate that a key feature of future community platforms has to be interoperability, and concentrate on how to provide interoperability in general, and how we are doing it in the environment at TUM. In particular we focus on service independent identity management as one central aspect of interoperability.  相似文献   

9.
Document theory is the least explored area of study about documents. It lags significantly behind applied document research, which summarizes various document processing practices that have accumulated for thousands of years. This problem has recently been complicated by the rise of so-called general document theories. The boundaries of the document concept have become blurred due to the development of parallel areas of study and their forced differentiation into “classic” and “library” document science. In addition, knowledge about objects that are referred to as documents that can neither be properly integrated nor applied in practice is being developed. This situation is mainly due to the lack of attention that is paid by document scientists to the theoretical and methodological issues of document science. This paper reviews the origins, nature, and the social roles of documents from the perspective of a synergetic paradigm and has the goal of constructing a synergetic document theory.  相似文献   

10.
The most common approach to measuring the effectiveness of Information Retrieval systems is by using test collections. The Contextual Suggestion (CS) TREC track provides an evaluation framework for systems that recommend items to users given their geographical context. The specific nature of this track allows the participating teams to identify candidate documents either from the Open Web or from the ClueWeb12 collection, a static version of the web. In the judging pool, the documents from the Open Web and ClueWeb12 collection are distinguished. Hence, each system submission should be based only on one resource, either Open Web (identified by URLs) or ClueWeb12 (identified by ids). To achieve reproducibility, ranking web pages from ClueWeb12 should be the preferred method for scientific evaluation of CS systems, but it has been found that the systems that build their suggestion algorithms on top of input taken from the Open Web achieve consistently a higher effectiveness. Because most of the systems take a rather similar approach to making CSs, this raises the question whether systems built by researchers on top of ClueWeb12 are still representative of those that would work directly on industry-strength web search engines. Do we need to sacrifice reproducibility for the sake of representativeness? We study the difference in effectiveness between Open Web systems and ClueWeb12 systems through analyzing the relevance assessments of documents identified from both the Open Web and ClueWeb12. Then, we identify documents that overlap between the relevance assessments of the Open Web and ClueWeb12, observing a dependency between relevance assessments and the source of the document being taken from the Open Web or from ClueWeb12. After that, we identify documents from the relevance assessments of the Open Web which exist in the ClueWeb12 collection but do not exist in the ClueWeb12 relevance assessments. We use these documents to expand the ClueWeb12 relevance assessments. Our main findings are twofold. First, our empirical analysis of the relevance assessments of 2  years of CS track shows that Open Web documents receive better ratings than ClueWeb12 documents, especially if we look at the documents in the overlap. Second, our approach for selecting candidate documents from ClueWeb12 collection based on information obtained from the Open Web makes an improvement step towards partially bridging the gap in effectiveness between Open Web and ClueWeb12 systems, while at the same time we achieve reproducible results on well-known representative sample of the web.  相似文献   

11.
This paper presents a Graph Inference retrieval model that integrates structured knowledge resources, statistical information retrieval methods and inference in a unified framework. Key components of the model are a graph-based representation of the corpus and retrieval driven by an inference mechanism achieved as a traversal over the graph. The model is proposed to tackle the semantic gap problem—the mismatch between the raw data and the way a human being interprets it. We break down the semantic gap problem into five core issues, each requiring a specific type of inference in order to be overcome. Our model and evaluation is applied to the medical domain because search within this domain is particularly challenging and, as we show, often requires inference. In addition, this domain features both structured knowledge resources as well as unstructured text. Our evaluation shows that inference can be effective, retrieving many new relevant documents that are not retrieved by state-of-the-art information retrieval models. We show that many retrieved documents were not pooled by keyword-based search methods, prompting us to perform additional relevance assessment on these new documents. A third of the newly retrieved documents judged were found to be relevant. Our analysis provides a thorough understanding of when and how to apply inference for retrieval, including a categorisation of queries according to the effect of inference. The inference mechanism promoted recall by retrieving new relevant documents not found by previous keyword-based approaches. In addition, it promoted precision by an effective reranking of documents. When inference is used, performance gains can generally be expected on hard queries. However, inference should not be applied universally: for easy, unambiguous queries and queries with few relevant documents, inference did adversely affect effectiveness. These conclusions reflect the fact that for retrieval as inference to be effective, a careful balancing act is involved. Finally, although the Graph Inference model is developed and applied to medical search, it is a general retrieval model applicable to other areas such as web search, where an emerging research trend is to utilise structured knowledge resources for more effective semantic search.  相似文献   

12.
Patent prior art search is a type of search in the patent domain where documents are searched for that describe the work previously carried out related to a patent application. The goal of this search is to check whether the idea in the patent application is novel. Vocabulary mismatch is one of the main problems of patent retrieval which results in low retrievability of similar documents for a given patent application. In this paper we show how the term distribution of the cited documents in an initially retrieved ranked list can be used to address the vocabulary mismatch. We propose a method for query modeling estimation which utilizes the citation links in a pseudo relevance feedback set. We first build a topic dependent citation graph, starting from the initially retrieved set of feedback documents and utilizing citation links of feedback documents to expand the set. We identify the important documents in the topic dependent citation graph using a citation analysis measure. We then use the term distribution of the documents in the citation graph to estimate a query model by identifying the distinguishing terms and their respective weights. We then use these terms to expand our original query. We use CLEF-IP 2011 collection to evaluate the effectiveness of our query modeling approach for prior art search. We also study the influence of different parameters on the performance of the proposed method. The experimental results demonstrate that the proposed approach significantly improves the recall over a state-of-the-art baseline which uses the link-based structure of the citation graph but not the term distribution of the cited documents.  相似文献   

13.
Given a user question, the goal of a Question Answering (QA) system is to retrieve answers rather than full documents or even best-matching passages, as most Information Retrieval systems currently do. In this paper, we present BRUJA, a QA system for the management of multilingual collections. BRUJ rkstions (English, Spanish and French). The BRUJA architecture is not formed with three monolingual QA systems but instead uses English as Interlingua to make usual QA tasks such as question classifications and answer extractions. In addition, BRUJA uses Cross Language Information Retrieval (CLIR) techniques to retrieve relevant documents from a multilingual collection. On the one hand, we have more documents to find answers from but on the other hand, we are introducing noise into the system because of translations to the Interlingua (English) and the CLIR module. The question is whether the difficulty of managing three languages is worth it or whether a monolingual QA system delivers better results. We report on in-depth experimentation and demonstrate that our multilingual QA system gets better results than its monolingual counterpart whenever it uses good translation resources and, especially, CLIR techniques that are state-of-the-art.  相似文献   

14.
The problem of finding documents written in a language that the searcher cannot read is perhaps the most challenging application of cross-language information retrieval technology. In interactive applications, that task involves at least two steps: (1) the machine locates promising documents in a collection that is larger than the searcher could scan, and (2) the searcher recognizes documents relevant to their intended use from among those nominated by the machine. This article presents the results of experiments designed to explore three techniques for supporting interactive relevance assessment: (1) full machine translation, (2) rapid term-by-term translation, and (3) focused phrase translation. Machine translation was found to better support this task than term-by-term translation, and focused phrase translation further improved recall without an adverse effect on precision. The article concludes with an assessment of the strengths and weaknesses of the evaluation framework used in this study and some remarks on implications of these results for future evaluation campaigns.  相似文献   

15.
学位论文提交与发布系统比较分析   总被引:4,自引:0,他引:4  
针对大学图书馆网络环境下学位论文提交与发布系统软件的选型问题,介绍 已通过CALIS认证的四套学位论文提交与发布系统,并从论文提交、论文审核与编目、文 档标准化制作、论文发布与检索以及论文回溯制作等方面进行比较分析,以期为各学校 图书馆购买学位论文提交与发布系统软件提供参考。  相似文献   

16.
In this paper we look at some of the problems in interacting with best-match retrieval systems. In particular, we examine the areas of interaction, some investigations of the complexity and breadth of interaction and attempts to categorise user's information seeking behaviour. We suggest that one of the difficulties of traditional IR systems in supporting information seeking is the way the information content of documents is represented. We discuss an alternative representation, based on how information is used within documents.  相似文献   

17.
近年来,我国的电子文件发展呈现繁荣上升的趋势,在不断发展的同时,也要随着大环境的变迁而改变。云计算的出现与应用给电子文件带来了新的机遇与挑战,使得电子文件呈现动态发展。在这个发展过程中,电子文件的数据连续性、单套制的转变以及安全问题是本文论述的中心。电子文件自身的变化也非常明显,如何适应云计算大环境下电子文件发展的方向并作出相应的改进与创新,是未来电子文件的发展需要思考的问题。  相似文献   

18.
Egghe and Proot [Egghe, L., & Proot, G. (2007). The estimation of the number of lost multi-copy documents: A new type of informetrics theory. Journal of Informetrics] introduce a simple probabilistic model to estimate the number of lost multi-copy documents based on the numbers of retrieved ones. We show that their model in practice can essentially be described by the well-known Poisson approximation to the binomial. This enables us to adopt a traditional maximum likelihood estimation (MLE) approach which allows the construction of (approximate) confidence intervals for the parameters of interest, thereby resolving an open problem left by the authors. We further show that the general estimation problem is a variant of a well-known unseen species problem. This work should be viewed as supplementing that of Egghe and Proot [Egghe, L., & Proot, G. (2007). The estimation of the number of lost multi-copy documents: A new type of informetrics theory. Journal of Informetrics]. It turns out that their results are broadly in line with those produced by this rather more robust statistical analysis.  相似文献   

19.
Most recent document standards like XML rely on structured representations. On the other hand, current information retrieval systems have been developed for flat document representations and cannot be easily extended to cope with more complex document types. The design of such systems is still an open problem. We present a new model for structured document retrieval which allows computing scores of document parts. This model is based on Bayesian networks whose conditional probabilities are learnt from a labelled collection of structured documents—which is composed of documents, queries and their associated assessments. Training these models is a complex machine learning task and is not standard. This is the focus of the paper: we propose here to train the structured Bayesian Network model using a cross-entropy training criterion. Results are presented on the INEX corpus of XML documents.  相似文献   

20.
Entity extraction is critical in the intelligent advancement across diverse domains. Nevertheless, a challenge to its effectiveness arises from the data imbalance, where certain entities are common while others are scarce. To address this issue, this study proposes a novel text generation approach that harnesses Zipf's law, which is a powerful tool from informetrics for studying human language. By employing characteristics of Zipf's law, words within the documents are classified as common and rare ones. Subsequently, sentences are classified into common and rare ones, and are further processed by text generation models accordingly. Rare entities within the generated sentences are then labeled using human-designed rules, serving as a supplement to the raw dataset, thereby mitigating the imbalance problem. The study presents a case of extracting entities from technical documents, and the extensive experimental results on two datasets prove the effectiveness of the proposed method. Furthermore, the significance and potential of Zipf's law in driving the progress of artificial intelligence (AI) is discussed, broadening the scope and coverage of informetrics. By incorporating the foundational principles of informetrics into text generation, this study showcases the pivotal role of informetrics in shaping the design and developmental of AI systems.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号