首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 31 毫秒
Generalized Hamming Distance   总被引:4,自引:0,他引:4  
Many problems in information retrieval and related fields depend on a reliable measure of the distance or similarity between objects that, most frequently, are represented as vectors. This paper considers vectors of bits. Such data structures implement entities as diverse as bitmaps that indicate the occurrences of terms and bitstrings indicating the presence of edges in images. For such applications, a popular distance measure is the Hamming distance. The value of the Hamming distance for information retrieval applications is limited by the fact that it counts only exact matches, whereas in information retrieval, corresponding bits that are close by can still be considered to be almost identical. We define a Generalized Hamming distance that extends the Hamming concept to give partial credit for near misses, and suggest a dynamic programming algorithm that permits it to be computed efficiently. We envision many uses for such a measure. In this paper we define and prove some basic properties of the Generalized Hamming distance, and illustrate its use in the area of object recognition. We evaluate our implementation in a series of experiments, using autonomous robots to test the measure's effectiveness in relating similar bitstrings.  相似文献   

Text Categorization (TC) is the automated assignment of text documents to predefined categories based on document contents. TC has been an application for many learning approaches, which prove effective. Nevertheless, TC provides many challenges to machine learning. In this paper, we suggest, for text categorization, the integration of external WordNet lexical information to supplement training data for a semi-supervised clustering algorithm which can learn from both training and test documents to classify new unseen documents. This algorithm is the Semi-Supervised Fuzzy c-Means (ssFCM). Our experiments use Reuters 21578 database and consist of binary classifications for categories selected from the 115 TOPICS classes of the Reuters collection. Using the Vector Space Model, each document is represented by its original feature vector augmented with external feature vector generated using WordNet. We verify experimentally that the integration of WordNet helps ssFCM improve its performance, effectively addresses the classification of documents into categories with few training documents and does not interfere with the use of training data.  相似文献   

With a central focus on thecultural contexts of Pacific island societies,this essay examines the entanglement ofcolonial power relations in local recordkeepingpractices. These cultural contexts include theon-going exchange between oral and literatecultures, the aftermath of colonialdisempowerment and reassertion of indigenousrights and identities, the difficulty ofmaintaining full archival systems in isolated,resource-poor micro-states, and the drivinginfluence of development theory. The essayopens with a discussion of concepts ofexploration and evangelism in cross-culturalanalysis as metaphors for archival endeavour. It then explores the cultural exchanges betweenoral memory and written records, orality, andliteracy, as means of keeping evidence andremembering. After discussing the relation ofrecords to processes of political and economicdisempowerment, and the reclaiming of rightsand identities, it returns to the patterns ofarchival development in the Pacific region toconsider how archives can better integrate intotheir cultural and political contexts, with theaim of becoming more valued parts of theircommunities.  相似文献   

Zusammenfassung. Das System fur die interaktive, automatische Stundenplanung ist im Rahmen der Forschungsarbeiten des Bereichs Planungstechnik und Deklarative Programmierung in Fraunhofer FIRST zur Erweiterung der Constraint-basierten Programmierung entwickelt worden. Mit dem System wird die Stundenplanung der Medizinischen Fakultat Charité seit dem Sommersemester 1998 vorgenommen. Seitdem wurde das System kontinuierlich weiterentwickelt. Der erfolgreiche Einsatz des Systems zeigte, dass die gewahlten Methoden und Verfahren sehr geeignet fur die Behandlung derartiger Probleme sind. Die Vorteile einer kombinierten interaktiven und automatischen Stundenplanerzeugung konnten eindeutig nachgewiesen werden.CR Subject Classification: I.2.8, I.2.3, J.1, K.3.2, D.3.3, D.1.6Eingegangen am 15. März 2003 / Angenommen am 9. März 2004, Online publiziert: 1. Juli 2004  相似文献   

Detection As Multi-Topic Tracking   总被引:1,自引:0,他引:1  
The topic tracking task from TDT is a variant of information filtering tasks that focuses on event-based topics in streams of broadcast news. In this study, we compare tracking to another TDT task, detection, which has the goal of partitioning all arriving news into topics, regardless of whether the topics are of interest to anyone, and even when a new topic appears that had not been previous anticipated. There are clear relationships between the two tasks (under some assumptions, a perfect tracking system could solve the detection problem), but they are evaluated quite differently. We describe the two tasks and discuss their similarities. We show how viewing detection as a form of multi-topic parallel tracking can illuminate the performance tradeoffs of detection over tracking.  相似文献   

Kleinbergs HITS algorithm (Kleinberg 1999), which was originally developed in a Web context, tries to infer the authoritativeness of a Web page in relation to a specific query using the structure of a subgraph of the Web graph, which is obtained considering this specific query. Recent applications of this algorithm in contexts far removed from that of Web searching (Bacchin, Ferro and Melucci 2002, Ng et al. 2001) inspired us to study the algorithm in the abstract, independently of its particular applications, trying to mathematically illuminate its behaviour. In the present paper we detail this theoretical analysis. The original work starts from the definition of a revised and more general version of the algorithm, which includes the classic one as a particular case. We perform an analysis of the structure of two particular matrices, essential to studying the behaviour of the algorithm, and we prove the convergence of the algorithm in the most general case, finding the analytic expression of the vectors to which it converges. Then we study the symmetry of the algorithm and prove the equivalence between the existence of symmetry and the independence from the order of execution of some basic operations on initial vectors. Finally, we expound some interesting consequences of our theoretical results.Supported in part by a grant from the Italian National Research Council (CNR) research project Technologies and Services for Enhanced Content Delivery.  相似文献   

Information Retrieval systems typically sort the result with respect to document retrieval status values (RSV). According to the Probability Ranking Principle, this ranking ensures optimum retrieval quality if the RSVs are monotonously increasing with the probabilities of relevance (as e.g. for probabilistic IR models). However, advanced applications like filtering or distributed retrieval require estimates of the actual probability of relevance. The relationship between the RSV of a document and its probability of relevance can be described by a normalisation function which maps the retrieval status value onto the probability of relevance (mapping functions). In this paper, we explore the use of linear and logistic mapping functions for different retrieval methods. In a series of upper-bound experiments, we compare the approximation quality of the different mapping functions. We also investigate the effect on the resulting retrieval quality in distributed retrieval (only merging, without resource selection). These experiments show that good estimates of the actual probability of relevance can be achieved, and that the logistic model outperforms the linear one. Retrieval quality for distributed retrieval is only slightly improved by using the logistic function.  相似文献   

Zusammenfassung. Dadurch, dass Literaturnachweise und Publikationen zunehmend in elektronischer und auch vernetzter Form angeboten werden, haben Anzahl und Größe der von wissenschaftlichen Bibliotheken angebotenen Datenbanken erheblich zugenommen. In den verbreiteten Metasuchen über mehrere Datenbanken sind Suchen mit natürlichsprachlichen Suchbegriffen heute der kleinste gemeinsame Nenner. Sie führen aber wegen der bekannten Mängel des booleschen Retrievals häufig zu Treffermengen, die entweder zu speziell oder zu lang und zu unspezifisch sind. Die Technische Fakultät der Universität Bielefeld und die Universitätsbibliothek Bielefeld haben einen auf Fuzzy- Suchlogik basierenden Rechercheassistenten entwickelt, der die Suchanfragen der Benutzer in Teilsuchfragen an die externen Datenbanken zerlegt und die erhaltenen Teilsuchergebnisse in einer nach Relevanz sortierten Liste kumuliert. Es ist möglich, Suchbegriffe zu gewichten und durch Fuzzy- Aggregationsoperatoren zu verknüpfen, die auf der Benutzeroberfläche durch natürlichsprachliche Fuzzy-Quantoren wie möglichst viele, einige u.a. repräsentiert werden. Die Suchparameter werden in der intuitiv bedienbaren einfachen Suche automatisch nach heuristischen Regeln ermittelt, können in einer erweiterten Suche aber auch explizit eingestellt werden. Die Suchmöglichkeiten werden durch Suchen nach ähnlichen Dokumenten und Vorschlagslisten für weitere Suchbegriffe ergänzt. Wir beschreiben die Ausgangssituation, den theoretischen Ansatz, die Benutzeroberfläche und berichten über eine Evalution zur Benutzung und einen Vergleichstest betreffend die Effizienz der Retrievalmethodik.CR Subject Classification: H.3.3, H.3.5Eingegangen am 3. März 2004 / Angenommen am 19. August 2004, Online publiziert am 18. Oktober 2004  相似文献   

Exploiting Hierarchy in Text Categorization   总被引:4,自引:3,他引:1  
With the recent dramatic increase in electronic access to documents, text categorization—the task of assigning topics to a given document—has moved to the center of the information sciences and knowledge management. This article uses the structure that is present in the semantic space of topics in order to improve performance in text categorization: according to their meaning, topics can be grouped together into meta-topics, e.g., gold, silver, and copper are all metals. The proposed architecture matches the hierarchical structure of the topic space, as opposed to a flat model that ignores the structure. It accommodates both single and multiple topic assignments for each document. Its probabilistic interpretation allows its predictions to be combined in a principled way with information from other sources. The first level of the architecture predicts the probabilities of the meta-topic groups. This allows the individual models for each topic on the second level to focus on finer discriminations within the group. Evaluating the performance of a two-level implementation on the Reuters-22173 testbed of newswire articles shows the most significant improvement for rare classes.  相似文献   

For the definition of electronic records, the use of new terms, like literary warrant, is not necessary, and for the European perspective even not understandable. If this expression simply means best practice and professional culture in recordkeeping, we only to know what creators did for centuries and still do today and probably will do also in the future, by referring to the archival science, diplomatics and archival practice for clarifying definitions in the recordkeeping environment. A multi-disciplinary approach is still required for the electronic recordkeeping system as it was in the past for traditional records, but the theory and the terminology should be consistent and based on the deep understanding of essential characteristics of records and essential requirements of good recordkeeping to produce in the first place and maintain reliable and authentic records. Of course, a record is more than recorded information created in the course of business activity: a record is the recorded representation of an act produced in a specific form – the form prescribed by the legal system – by a creator in the course of its activity.  相似文献   

This paper presents an experimental evaluation of several text-based methods for detecting duplication in scanned document databases using uncorrected OCR output. This task is made challenging both by the wide range of degradations printed documents can suffer, and by conflicting interpretations of what it means to be a duplicate. We report results for four sets of experiments exploring various aspects of the problem space. While the techniques studied are generally robust in the face of most types of OCR errors, there are nonetheless important differences which we identify and discuss in detail.  相似文献   

TIJAH: Embracing IR Methods in XML Databases   总被引:1,自引:0,他引:1  
This paper discusses our participation in INEX (the Initiative for the Evaluation of XML Retrieval) using the TIJAH XML-IR system. TIJAHs system design follows a standard layered database architecture, carefully separating the conceptual, logical and physical levels. At the conceptual level, we classify the INEX XPath-based query expressions into three different query patterns. For each pattern, we present its mapping into a query execution strategy. The logical layer exploits score region algebra (SRA) as the basis for query processing. We discuss the region operators used to select and manipulate XML document components. The logical algebra expressions are mapped into efficient relational algebra expressions over a physical representation of the XML document collection using the pre-post numbering scheme. The paper concludes with an analysis of experiments performed with the INEX test collection.  相似文献   

In this paper the problem of indexing heterogeneous structured documents and of retrieving semi-structured documents is considered. We propose a flexible paradigm for both indexing such documents and formulating user queries specifying soft constraints on both documents structure and content. At the indexing level we propose a model that achieves flexibility by constructing personalised document representations based on users views of the documents. This is obtained by allowing users to specify their preferences on the documents sections that they estimate to bear the most interesting information, as well as to linguistically quantify the number of sections which determine the global potential interest of the documents. At the query language level, a flexible query language for expressing soft selection conditions on both the documents structure and content is proposed.  相似文献   

In the final scene of Raiders of the Lost Ark, a crate containing the object of Indiana Jones' quest is wheeled into an immense warehouse for indefinite storage and questionable research access. Unfortunately, this fate is not all that far from reality. Collections of archaeological and ethnographic materials ranging from stone axes, broken potsherds, and carved monuments to baskets, ceremonial masks, and skin canoes have been held by museums collections since the Renaissance. However, their inestimable value and unique conservation and curatorial requirements often conspire to remove them from the reach of all but the most diligent scholars. The potential of the Web to enhance the quality of research on archaeological and ethnographic collections is enormous. This paper will examine ways that one can use the Web to enhance research and improve access to a variety of materials; while there are many other resources for archaeology available on the Web, this paper focusses on museum-related sites. It will also explore the potential of the Web for innovative research strategies. Digitization of catalogs, associated documents, and images to help one locate and study collections and specific artifacts are just one approach. Others include the connection of devices to the Web, such as cameras and microscopes, the creation of virtual reference collections, and the establishment of research networks that will enhance the identification and analysis of material culture. This paper will also consider the role the Web could play in issues of cultural property, contributing to and in many ways intensifying ongoing debates of ownership, curation, conservation, and repatriation of sensitive materials.  相似文献   

The Archival Bond   总被引:1,自引:0,他引:1  
This paper presents the concept of archival bond as formulated by archival science and used in a research project carried out at the University of British Columbia, entitled The Preservation of Electronic Records. Being one of the essential components of the record, the concept of archival bond is discussed in the context of the traditional diplomatic and archival definitions of records, and its function in demonstrating the reliability and authenticity of records is shown. The most serious challenge with which we are confronted is to make explicit and preserve intact over the long term the archival bond between electronic and non electronic records belonging in the same aggregations.  相似文献   

New Mexico State University's Computing Research Lab has participated in research in all three phases of the US Government's Tipster program. Our work on information retrieval has focused on research and development of multilingual and cross-language approaches to automatic retrieval. The work on automatic systems has been supplemented by additional research into the role of the IR system user in interactive retrieval scenarios: monolingual, multilingual and cross-language. The combined efforts suggest that universal text retrieval, in which a user can find, access and use documents in the face of language differences and information overload, may be possible.  相似文献   

This article examines theclaim that, through its overt symbolicmessaging, the Gatineau Preservation Centre,opened by the National Archives of Canada in1997, embodies a perfect transparency betweenfunction and form, with the shape of the placebeing derived seamlessly from the needs of thearchival work done there, and the proof beingin the exposure of all the elements to view. It reveals the undercurrents of contendingoppositions to this claim, both in thesubversive, Mannerist, or impure architectural eccentricities designed into thestructure, and in the embodiment of archivalnarratives whose symbolism is challenged byunacknowledged resistances. While the buildingis clearly inspired by Modernist andEnlightenment orientations, such as theambition to preserve unchanged a universal,transcendent historical authenticity, thesediverse resistances buried in it aremanifested, for example, in the contest of maleversus female structural elements, and inthe authority of the monumental and exposed setagainst the seduction of the varied and secret. Most importantly, the absorption of the bodyboth metaphorically and physically into themany disciplines of the place unconsciouslycalls into question the building's self-imageas the epitome of a liberal-humanist andobjective-scientific activity; it reflectsinstead the destabilizing plays and displays ofpower which are increasingly seen to form theindeterminate field of the archival pursuit.  相似文献   

The Museum is a perspicuous site for analysing the complex interplay between social, organisational, cultural and political factors which have relevance to the design and use of virtual technologies. Specifically, the introduction of virtual technologies in museums runs up against the issue of the situated character of information use. Across a number of disciplines (anthropology, sociology, psychology, cognitive science) there is growing recognition of the situatedness of knowledge and its importance for the design and use of technology. This awareness is fostered by the fact that technological developments are often associated with disappointing gains for users. The effective use of technology relies on the degree to which it can be embedded in or congruent with the local practices of museum users. Drawing upon field research in two museums of science and technology, both of which are in the process of introducing virtual technologies and exploring the possibilities of on-line access, findings are presented which suggest that the success of such developments will depend on the extent to which they are informed by detailed understanding of practice-practices that are essentially socially constituted in the activities of museum visitors and the daily work of museum professionals.  相似文献   

Zusammenfassung. Zusammenfassung Der Softwaremarkt für Data-Warehouse-Systeme hat in den letzten Jahren stark zugenommen. Da Standards fehlen, bringen Softwaresysteme jeweils eigene multidimensionale Datenmodelle und (physische) Entwurfswerkzeuge mit, sodass der Entwurf von Data-Warehouse-Datenbanken verfrüht auf die eingesetzten Systeme zugeschnitten ist. In diesem Beitrag präsentieren wir ein werkzeuggestütztes Entwurfsvorgehen, das ausgehend von einem Drei-Ebenen-Entwurf die zielplattformunabhängige, konzeptionelle Modellierung multidimensionaler Data-Warehouse-Schemata und anschliesend eine Transformation und Optimierung dieser Schemata für konkrete Zielplattformen ermöglicht. Durch die konzeptionelle Modellierung wird gewährleistet, dass Implementierungsdetails nicht zu früh in den Entwurfsprozess einfliesen und stattdessen fachliche Anforderungen im Vordergrund stehen.CR Subject Classification: D.2.2, H.2.1Arne Harren: Die hier beschriebene Arbeit entstand während unserer gemeinsamen Tätigkeit am Oldenburger Forschungsinstitut OFFIS (http: //www.offis.de).Eingegangen am 25. Juli 2003 / Angenommen am 1. Juni 2004, Online publiziert 6. September 2004  相似文献   

This article presents the Δ-distance, a family of distances between images recursively decomposed into segments and represented by multi-level feature vectors. Such a structure is a quad, a quin or a nona-tree resulting from a fixed and arbitrary image partition or from an image segmentation process. It handles positional information of image features (e.g. color, texture or shape). Δ-distance is the generalized form of dissimilarity measures between multi-level feature vectors. Using different weights on tree nodes and different distances between nodes, distances between trees or visual similarity between images can be computed based on the general definition of Δ. In this article, we present three Δ-based distance families: two families of distances between tree structures, called -distance( for Tree) and -distance ( for Segment), and a family of visual distances between images, called ( for Visual). The -distance visually compares two images using their tree representation and the other two distances compare the tree structures resulting from image segmentation. Moreover, we show how existing distances between multi-level feature vectors appear to be particular cases of the Δ-distance  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号