首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
2.
This paper describes a technique for automatic book indexing. The technique requires a dictionary of terms that are to appear in the index, along with all text strings that count as instances of the term. It also requires that the text be in a form suitable for processing by a text formatter. A program searches the text for each occurrence of a term or its associated strings and creates an entry to the index when either is found. The results of the experimental application to a portion of a book text are presented, including measures of precision and recall, with precision giving the ratio of terms correctly assigned in the automatic process to the total assigned, and recall giving the ratio of correct terms automatically assigned to the total number of term assignments according to a human standard. Results indicate that the technique can be applied successfully, especially for texts that employ a technical vocabulary and where there is a premium on indexing exhaustivity.  相似文献   

3.
一个基于本体论全文自动标引方案   总被引:5,自引:1,他引:5  
王泰森 《情报科学》2003,21(9):950-952
本文为支持数字图书馆全文检索精度的提高,提出了一个基于本体论全文自动标引方案。该方案利用本体论的方法,强调词与词之间的内在概念联系,着重解决传统的人工标引不能全面概括全文,而且词与词之间缺乏概念性的连接,很难反映文件主题的全面内容及由于多义词、同义词等的原因造成漏检或检索结果返回信息太多,失去检索意义,达不到理想效果的问题。并为数字图书馆在进行主题标引时实现自动化操作。  相似文献   

4.
5.
This paper examines the feasibility of discovering “title-like” terms using a decision tree classifier from the document. The premise of discovering title-like terms is that title terms and title-like terms should behave similarly in the document. This behavior is characterized by a set of distributional and linguistic features. By training the classifier to observe the behavior of title terms in a balanced manner using 25,000 titles in Reuters articles, other terms with similar behavior would also be discovered. Based on 5000 unseen titles, the recall of title terms was 83%, similar to the manual identification of title terms. The precision of finding title terms is low (i.e., 32%) because some non-title but title-like terms should have been identified as well. Seven subjects were asked to rate, on a scale of between 1 and 5, whether the identified term is a topical/thematic/title term. If a rating of 2.5 is used to determine whether a term is judged to be a “title-like” term, then the mean precision is increased to 58%, or the headline/title is expanded with twice the average number of terms. Since this precision (i.e., 58%) is similar to the mean precision of manually identified title terms averaged across different subjects, we conclude that the discovery of title-like terms using classifiers is a promising approach.  相似文献   

6.
In image retrieval, most systems lack user-centred evaluation since they are assessed by some chosen ground truth dataset. The results reported through precision and recall assessed against the ground truth are thought of as being an acceptable surrogate for the judgment of real users. Much current research focuses on automatically assigning keywords to images for enhancing retrieval effectiveness. However, evaluation methods are usually based on system-level assessment, e.g. classification accuracy based on some chosen ground truth dataset. In this paper, we present a qualitative evaluation methodology for automatic image indexing systems. The automatic indexing task is formulated as one of image annotation, or automatic metadata generation for images. The evaluation is composed of two individual methods. First, the automatic indexing annotation results are assessed by human subjects. Second, the subjects are asked to annotate some chosen images as the test set whose annotations are used as ground truth. Then, the system is tested by the test set whose annotation results are judged against the ground truth. Only one of these methods is reported for most systems on which user-centred evaluation are conducted. We believe that both methods need to be considered for full evaluation. We also provide an example evaluation of our system based on this methodology. According to this study, our proposed evaluation methodology is able to provide deeper understanding of the system’s performance.  相似文献   

7.
8.
网页自动标引方案的优选及标引性能的测评   总被引:2,自引:0,他引:2  
仲云云  侯汉清  薛鹏军 《情报科学》2002,20(10):1108-1110
本文介绍了三种网页自动标引方案,通过对“中国经济网”上50页网页的手工标引、自动标引结果比较,从而优选出一种方案,即对网页全文不同部位加权,采用词频加权统计法。最后对该方案自动主题标引和分类标引分别从人机相符率方面进行测评。  相似文献   

9.
10.
A procedure for automated indexing of pathology diagnostic reports at the National Institutes of Health is described. Diagnostic statements in medical English are encoded by computer into the Systematized Nomenclature of Pathology (SNOP). SNOP is a structured indexing language constructed by pathologists for manual indexing. It is of interest that effective automatic encoding can be based upon an existing vocabulary and code designed for manual methods. Morphosyntactic analysis, a simple syntax analysis, matching of dictionary entries consisting of several words, and synonym substitutions are techniques utilized.  相似文献   

11.
A variety of abstract automatic indexing models have been developed in recent times in an effort to produce indexing methods that are both effective and usable in practice. Among these are the term discrimination model and the term precision system. These two indexing systems are briefly described and experimental evidence is cited showing that a combination of both theories produces better retrieval performance than either one alone. Appropriate conclusions are reached concerning viable automatic indexing procedures usable in practice.  相似文献   

12.
13.
The profusion of online resources calls for tools and methods to help Internet users find precisely what they are looking for. Quality controlled gateway CISMeF provides such services for health resources. However, the human cost of maintaining and updating the catalogue are increasingly high. This paper presents the automatic indexing system currently developed in the CISMeF team to be used as such for preliminary indexing, or after human reviewing for the final indexing. The system architecture, using the INTEX platform for MeSH term extraction is detailed. The results of a first evaluation tend to indicate that the automatic indexing strategy is relevant, as it achieves a precision comparable to that of other existing operational systems. Moreover, the system presented in this paper retrieves keyword/qualifier pairs as opposed to single terms, therefore providing a significantly more precise indexing. Further development and tests will be carried out in order to improve the coverage of the dictionaries, and validate the efficiency of the system in the indexers’ everyday work.  相似文献   

14.
15.
Analyzing and extracting insights from user-generated data has become a topic of interest among businesses and research groups because such data contains valuable information, e.g., consumers’ opinions, ratings, and recommendations of products and services. However, the true value of social media data is rarely discovered due to overloaded information. Existing literature in analyzing online hotel reviews mainly focuses on a single data resource, lexicon, and analysis method and rarely provides marketing insights and decision-making information to improve business’ service and quality of products. We propose an integrated framework which includes a data crawler, data preprocessing, sentiment-sensitive tree construction, convolution tree kernel classification, aspect extraction and category detection, and visual analytics to gain insights into hotel ratings and reviews. The empirical findings show that our proposed approach outperforms baseline algorithms as well as well-known sentiment classification methods, and achieves high precision (0.95) and recall (0.96). The visual analytics results reveal that Business travelers tend to give lower ratings, while Couples tend to give higher ratings. In general, users tend to rate lowest in July and highest in December. The Business travelers more frequently use negative keywords, such as “rude,” “terrible,” “horrible,” “broken,” and “dirty,” to express their dissatisfied emotions toward their hotel stays in July.  相似文献   

16.
Determining requirements when searching for and retrieving relevant information suited to a user’s needs has become increasingly important and difficult, partly due to the explosive growth of electronic documents. The vector space model (VSM) is a popular method in retrieval procedures. However, the weakness in traditional VSM is that the indexing vocabulary changes whenever changes occur in the document set, or the indexing vocabulary selection algorithms, or parameters of the algorithms, or if wording evolution occurs. The major objective of this research is to design a method to solve the afore-mentioned problems for patent retrieval. The proposed method utilizes the special characteristics of the patent documents, the International Patent Classification (IPC) codes, to generate the indexing vocabulary for presenting all the patent documents. The advantage of the generated indexing vocabulary is that it remains unchanged, even if the document sets, selection algorithms, and parameters are changed, or if wording evolution occurs. Comparison of the proposed method with two traditional methods (entropy and chi-square) in manual and automatic evaluations is presented to verify the feasibility and validity. The results also indicate that the IPC-based indexing vocabulary selection method achieves a higher accuracy and is more satisfactory.  相似文献   

17.
王泽贤 《现代情报》2014,34(4):132-136
针对基于Lucene实现中文书目搜索系统的项目中,如何选择最合适的Lucene中文分析器进行了研究。通过大量实验,对Lucene自带的3个分析器及开发活跃度较高的两个第三方中文分析器,从分词效果,建立索引的时间和空间,检索的时间、检全率和平均检准率等方面进行了分析比较。综合实验分析结果,指出ik分析器总体性能最优,为最佳选择。  相似文献   

18.
19.
The Web is revolutionizing the entire scholarly communication process and changing the way that researchers exchange information. In this paper, we analyze two views of information production and use in computer-related research based on citation analysis of PDF and Postcript formatted publications on the Web using autonomous citation indexing (ACI), and a parallel citation analysis of the journal literature indexed by the Institute for Scientific Information (ISI) in SCISEARCH. Our goal is to establish a baseline profile of computer science “literature” as it appears in the published journals and as it appears on the publicly available Web. From this starting point, we hope to identify additional research areas dealing with information dissemination and citation practices in computer science and the utility of autonomous citation indexing on the Web as an adjunct to commercial indexing  相似文献   

20.
“Quan Fang Bei Zu” is a compiled work mainly for folklores, poems and other literary works concerning some common plants with some botanical information in it.  It is certainly not a pure botanical work, covering no more than 240 species of plants, and thus has little use as a reference book in indexing names even in a primitive sense.  Therefore “Quan Fang Bei Zu” could not be considered as a botanical dictionary.  Xu Wen-xuan and his co-workers’ argument that “Quan Fang Bei Zu” was the most perfect ancient botanical codes and records till then is not convincing.  Actually “Tu Jing Ben Cao” is of higher value than the book under discussion from botanical point of view.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号