首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 125 毫秒
1.
In this paper, we propose a document reranking method for Chinese information retrieval. The method is based on a term weighting scheme, which integrates local and global distribution of terms as well as document frequency, document positions and term length. The weight scheme allows randomly setting a larger portion of the retrieved documents as relevance feedback, and lifts off the worry that very fewer relevant documents appear in top retrieved documents. It also helps to improve the performance of maximal marginal relevance (MMR) in document reranking. The method was evaluated by MAP (mean average precision), a recall-oriented measure. Significance tests showed that our method can get significant improvement against standard baselines, and outperform relevant methods consistently.  相似文献   

2.
3.
In information retrieval, cluster-based retrieval is a well-known attempt in resolving the problem of term mismatch. Clustering requires similarity information between the documents, which is difficult to calculate at a feasible time. The adaptive document clustering scheme has been investigated by researchers to resolve this problem. However, its theoretical viewpoint has not been fully discovered. In this regard, we provide a conceptual viewpoint of the adaptive document clustering based on query-based similarities, by regarding the user’s query as a concept. As a result, adaptive document clustering scheme can be viewed as an approximation of this similarity. Based on this idea, we derive three new query-based similarity measures in language modeling framework, and evaluate them in the context of cluster-based retrieval, comparing with K-means clustering and full document expansion. Evaluation result shows that retrievals based on query-based similarities significantly improve the baseline, while being comparable to other methods. This implies that the newly developed query-based similarities become feasible criterions for adaptive document clustering.  相似文献   

4.
With the increase of information on the Web, it is difficult to find desired information quickly out of the documents retrieved by a search engine. One way to solve this problem is to classify web documents according to various criteria. Most document classification has been focused on a subject or a topic of a document. A genre or a style is another view of a document different from a subject or a topic. The genre is also a criterion to classify documents. In this paper, we suggest multiple sets of features to classify genres of web documents. The basic set of features, which have been proposed in the previous studies, is acquired from the textual properties of documents, such as the number of sentences, the number of a certain word, etc. However, web documents are different from textual documents in that they contain URL and HTML tags within the pages. We introduce new sets of features specific to web documents, which are extracted from URL and HTML tags. The present work is an attempt to evaluate the performance of the proposed sets of features, and to discuss their characteristics. Finally, we conclude which is an appropriate set of features in automatic genre classification of web documents.  相似文献   

5.
Decisions in thesaurus construction and use   总被引:1,自引:0,他引:1  
A thesaurus and an ontology provide a set of structured terms, phrases, and metadata, often in a hierarchical arrangement, that may be used to index, search, and mine documents. We describe the decisions that should be made when including a term, deciding whether a term should be subdivided into its subclasses, or determining which of more than one set of possible subclasses should be used. Based on retrospective measurements or estimates of future performance when using thesaurus terms in document ordering, decisions are made so as to maximize performance. These decisions may be used in the automatic construction of a thesaurus. The evaluation of an existing thesaurus is described, consistent with the decision criteria developed here. These kinds of user-focused decision-theoretic techniques may be applied to other hierarchical applications, such as faceted classification systems used in information architecture or the use of hierarchical terms in “breadcrumb navigation”.  相似文献   

6.
Structured document retrieval makes use of document components as the basis of the retrieval process, rather than complete documents. The inherent relationships between these components make it vital to support users’ natural browsing behaviour in order to offer effective and efficient access to structured documents. This paper examines the concept of best entry points, which are document components from which the user can browse to obtain optimal access to relevant document components. It investigates at the types of best entry points in structured document retrieval, and their usage and effectiveness in real information search tasks.  相似文献   

7.
In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which automatically discovers contexts of narrow scope within a document corpus. These contexts act as attractors for clustering documents that are semantically related to each other. Once clustered, the documents are organized into a minimum spanning tree so that the topical similarity of adjacent documents within this structure can be assessed. The pre-defined categories from three different document category sets are used to assess the quality of CDC in terms of its ability to group and structure semantically related documents given the contexts. Quality is evaluated based on two factors, the category overlap between adjacent documents within a cluster, and how well a representative document categorizes all the other documents within a cluster. As the RCV1 collection was collated in a time ordered fashion, it was possible to assess the stability of clusters formed from documents within one time interval when presented with new unseen documents at subsequent time intervals. We demonstrate that CDC is a powerful and scaleable technique with the ability to create stable clusters of high quality. Additionally, to our knowledge this is the first time that a collection as large as RCV1 has been analyzed in its entirety using a static clustering approach.  相似文献   

8.
The number of patent documents is currently rising rapidly worldwide, creating the need for an automatic categorization system to replace time-consuming and labor-intensive manual categorization. Because accurate patent classification is crucial to search for relevant existing patents in a certain field, patent categorization is a very important and useful field. As patent documents are structural documents with their own characteristics distinguished from general documents, these unique traits should be considered in the patent categorization process. In this paper, we categorize Japanese patent documents automatically, focusing on their characteristics: patents are structured by claims, purposes, effects, embodiments of the invention, and so on. We propose a patent document categorization method that uses the k-NN (k-Nearest Neighbour) approach. In order to retrieve similar documents from a training document set, some specific components to denote the so-called semantic elements, such as claim, purpose, and application field, are compared instead of the whole texts. Because those specific components are identified by various user-defined tags, first all of the components are clustered into several semantic elements. Such semantically clustered structural components are the basic features of patent categorization. We can achieve a 74% improvement of categorization performance over a baseline system that does not use the structural information of the patent.  相似文献   

9.
Lately there has been intensive research into the possibilities of using additional information about documents (such as hyperlinks) to improve retrieval effectiveness. It is called data fusion, based on the intuitive principle that different document and query representations or different methods lead to a better estimation of the documents' relevance scores.In this paper we propose a new method of document re-ranking that enables us to improve document scores using inter-document relationships. These relationships are expressed by distances and can be obtained from the text, hyperlinks or other information. The method formalizes the intuition that strongly related documents should not be assigned very different weights.  相似文献   

10.
刘艳平 《情报杂志》1991,10(3):45-49
通过分析陕西省农业文献资源与利用现状,指出文献利用率不高及目前影响文献开发利用的障碍因素,提出了下述几方面开发利用文献资源建设的意见:(1)强化领导及科研人员的情报意识;(2)建立省农业情报资源中心,形成省、地(市)、县三级情报网,实现全省农业文献资源共享;(3)开发成果文献,加速成果转化;(4)加强农业情报工作;(5)建立有地方特色的农业专题文献数据库;(6)加强对现有情报人才的培训。  相似文献   

11.
Online healthcare communities (OHCs) have become producers of medical information. Solving the issue of how to effectively reuse such a large amount of medical data and discover its potential value is of the utmost importance for alleviating the shortage of medical resources. Online consultation has received widespread attention and population since its first appearance in 1999, and as a result, many diagnostic multi-turn questions and answers (Q&A) documents have become available. This type of document is formed by multiple rounds of patient questions and doctors’ diagnostic answers and contains massive medical knowledge and doctors’ diagnostic experience. Few studies concentrate on the modeling and recommendation of this type of document, yet making these documents convenient for reuse reduces the cost of medical consultation for patients and saves time addressing common diseases for doctors. In this paper, we focus on the modeling and understanding of diagnostic multi-turn Q&A records and propose a deep-learning recommendation framework based on patient medical information needs, the contents of Q&A records and doctor background information. With the evaluation based on a real dataset that contains pediatric consultation dialogues fetched from DingXiangYuan, a famous online consultation application in China, we found that the proposed model achieved a good performance on the recommendation of diagnostic multi-turn Q&A records and outperformed baseline models. In addition, we discussed a potential application scenario of the recommendation model, suggesting that the proposed model can promote the reduction of patient costs and doctors’ work pressure in countries or regions with insufficient medical resources.  相似文献   

12.
13.
Structured document retrieval makes use of document components as the basis of the retrieval process, rather than complete documents. The inherent relationships between these components make it vital to support users’ natural browsing behaviour in order to offer effective and efficient access to structured documents. This paper examines the concept of best entry points, which are document components from which the user can browse to obtain optimal access to relevant document components. In particular this paper investigates the basic characteristics of best entry points.  相似文献   

14.
This paper describes our novel retrieval model that is based on contexts of query terms in documents (i.e., document contexts). Our model is novel because it explicitly takes into account of the document contexts instead of implicitly using the document contexts to find query expansion terms. Our model is based on simulating a user making relevance decisions, and it is a hybrid of various existing effective models and techniques. It estimates the relevance decision preference of a document context as the log-odds and uses smoothing techniques as found in language models to solve the problem of zero probabilities. It combines these estimated preferences of document contexts using different types of aggregation operators that comply with different relevance decision principles (e.g., aggregate relevance principle). Our model is evaluated using retrospective experiments (i.e., with full relevance information), because such experiments can (a) reveal the potential of our model, (b) isolate the problems of the model from those of the parameter estimation, (c) provide information about the major factors affecting the retrieval effectiveness of the model, and (d) show that whether the model obeys the probability ranking principle. Our model is promising as its mean average precision is 60–80% in our experiments using different TREC ad hoc English collections and the NTCIR-5 ad hoc Chinese collection. Our experiments showed that (a) the operators that are consistent with aggregate relevance principle were effective in combining the estimated preferences, and (b) that estimating probabilities using the contexts in the relevant documents can produce better retrieval effectiveness than using the entire relevant documents.  相似文献   

15.
Nowadays, new ways of managing and accessing to health-care information are continuously appearing. Web-based Personal Health Records (web PHRs) have the potential to make data about health-care available to clinicians, researchers and students in different medical contexts and applications. Therefore, the amount of web PHRs accessible through Internet has grown enormously and as a result health-care professionals are currently burdened with more and more data. It’s probable that these data, unfortunately, have not always the adequate levels of quality, making that their work cannot always be as successful as expected. As a way of alleviating this fact, the present work is focused on improving the document filtering results in the context of web PHRs management. To achieve this goal, a new kind of document filtering model is proposed. This model is based on fuzzy prototypes which are defined by means of conceptual prototypes. These prototypes are obtained by using a data quality analysis of documents. This analysis guarantees that filtered information will be relevant enough for the information user. The complete model provides an efficient strategy of document filtering that can be very useful when it is necessary to deal with a constant flow of new information.  相似文献   

16.
Documents circulating in paper form are increasingly being substituted by its electronic equivalent in the modern office today so that any stored document can be retrieved whenever needed later on. The office worker is already burdened with information overload, so effective and efficient retrieval facilities become an important factor affecting worker productivity. This paper first reviews the features of current document management systems with varying facilities to manage, store and retrieve either reference to documents or whole documents. Information retrieval databases, groupware products and workflow management systems are presented as developments to handle different needs, together with the underlying concepts of knowledge management. The two problems of worker finiteness and worker ignorance remain outstanding, as they are only partially addressed by the above-mentioned systems. The solution lies in a shift away from pull technology where the user has to actively initiate the request for information towards push technology, where available information is automatically delivered without user intervention. Intelligent information retrieval agents are presented as a solution together with a marketing scenario of how they can be introduced.  相似文献   

17.
Topicality is an operationally necessary but insufficient condition for requestor judged relevance. Documents are independent of one another as to any judgement of their topicality but not independent as to any judgement of their relevance which is a function of their informativeness to a requestor. Recall depends solely upon topicality but precision depends upon informativeness as well.A retrieval system which aspires to the retrieval of relevant documents should have a second stage which will order the topical set in a manner so as to provide maximum informativeness to the requestor. Should a system be concerned only with topicality then a two stage system which generates a high recall set and discards imprecise documents by measuring their distance from a seed document can be iterated to provide topicality feedback without user input.  相似文献   

18.
馆际互借与文献传递是图书情报机构面向读者的新型服务方式,是资源共享的一种最有效形式。应利用各种手段,广泛宣传这种服务,让读者了解馆际互借与文献传递服务,以便用最快捷的方式获得所需文献,最大限度地满足读者的需求。  相似文献   

19.
基于新图书馆和老图书馆文献通借通还的混合流通管理模式,考虑到减少成本开支和开发复杂度,设计了既能与老文献管理系统兼容,又能与老文献管理系统并存并用的无线射频识别(RFID)图书馆智能管理系统。该系统充分利用原图书馆文献信息管理系统中的中央数据库,通过中间件使RFID图书馆智能管理系统与图书馆老文献信息管理系统无缝对接,既体现了RFID图书馆智能管理系统的技术优势,又实现了新馆和老馆之间文献通借通还的管理目标。
Abstract:
Based on the mixed circulation management model of the new library and the old library where the documents can be borrowed and returned freely,and considering the reduction of cost and development complexity,this paper designs a Radio Frequency Identification (RFID) intelligent library management system,which is not only compatible with the old document management system,but can also exist and work together with it.The system makes full use of the central database of the original library document information management system,and by the use of the mediators,is integrated with the old library document information management system seamlessly,which not only embodies the technical advantages of the RFID intelligent library management system,but also realizes the management aim of borrowing and returning documents between the new and old libraries.  相似文献   

20.
The object of this paper is to present a new kind of approach to the problem of information system effectiveness evaluation as based on the theory of fuzzy sets. On the basis of this theory, the concepts of relevance and pertinence, which are the basic concepts used in determining the indices of information system effectiveness evaluation, have been defined. Assuming that in evaluating the effectiveness of information systems, one should consider separately the problem of quality evaluation of the transformation of the contents of documents and information requests into their search patterns and the problem of quality evaluation of the process of profile control of a document set of the information system, definitions have been given of parameters of quality evaluation of the transformation of the contents of documents and information requests into their search patterns with regard to a given information request as well as of parameters of quality evaluation of the process with regard to the whole set of information requests under examination. Besides, parameters of quality evaluation of the process of profile control of a document set of the information system have been defined. The parameters of effectiveness evaluation of information systems put forward in this paper take account of the fact that both evaluation of the relevance and evaluation of the pertinence of documents are of a continuous character.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号