首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present PubSearch, a hybrid heuristic scheme for re-ranking academic papers retrieved from standard digital libraries such as the ACM Portal. The scheme is based on the hierarchical combination of a custom implementation of the term frequency heuristic, a time-depreciated citation score and a graph-theoretic computed score that relates the paper’s index terms with each other. We designed and developed a meta-search engine that submits user queries to standard digital repositories of academic publications and re-ranks the repository results using the hierarchical heuristic scheme. We evaluate our proposed re-ranking scheme via user feedback against the results of ACM Portal on a total of 58 different user queries specified from 15 different users. The results show that our proposed scheme significantly outperforms ACM Portal in terms of retrieval precision as measured by most common metrics in Information Retrieval including Normalized Discounted Cumulative Gain (NDCG), Expected Reciprocal Rank (ERR) as well as a newly introduced lexicographic rule (LEX) of ranking search results. In particular, PubSearch outperforms ACM Portal by more than 77% in terms of ERR, by more than 11% in terms of NDCG, and by more than 907.5% in terms of LEX. We also re-rank the top-10 results of a subset of the original 58 user queries produced by Google Scholar, Microsoft Academic Search, and ArnetMiner; the results show that PubSearch compares very well against these search engines as well. The proposed scheme can be easily plugged in any existing search engine for retrieval of academic publications.  相似文献   

2.
How to merge and organise query results retrieved from different resources is one of the key issues in distributed information retrieval. Some previous research and experiments suggest that cluster-based document browsing is more effective than a single merged list. Cluster-based retrieval results presentation is based on the cluster hypothesis, which states that documents that cluster together have a similar relevance to a given query. However, while this hypothesis has been demonstrated to hold in classical information retrieval environments, it has never been fully tested in heterogeneous distributed information retrieval environments. Heterogeneous document representations, the presence of document duplicates, and disparate qualities of retrieval results, are major features of an heterogeneous distributed information retrieval environment that might disrupt the effectiveness of the cluster hypothesis. In this paper we report on an experimental investigation into the validity and effectiveness of the cluster hypothesis in highly heterogeneous distributed information retrieval environments. The results show that although clustering is affected by different retrieval results representations and quality, the cluster hypothesis still holds and that generating hierarchical clusters in highly heterogeneous distributed information retrieval environments is still a very effective way of presenting retrieval results to users.  相似文献   

3.
利用分类法和主题法改善搜索引擎的性能   总被引:6,自引:0,他引:6  
苏瑞竹  吴英姿 《情报科学》2001,19(11):1170-1175
本文对Internet上的检索工具搜索引擎的工作机理和性能进行了全方位的探讨,指出了常见搜索引擎信息检索缺点。同时还指出了Meta搜索引擎、智能搜索引擎和代理搜索引擎虽然提高了网络信息检索的质量,但由于分类体系不统一,类目划分标准模糊,因而仍然未能从根本上改变搜索引擎主要以关键词(自然语言)作为检索入口的现状,不能实现分类检索与主题检索的一体化。要实现搜索引擎信息检索的突破,笔者认为有必要运用情报检索语言的理论和方法来完善因特网搜索引擎的性能,实现分类、主题一体化的检索机制,克服分类检索语言单纯以学科聚类、主题语言单纯以事物聚类的局限性。  相似文献   

4.
情报检索的理论和方法在改善搜索引擎搜索性能中的应用   总被引:7,自引:0,他引:7  
夏立新 《情报科学》2001,19(7):759-762
本文认为计算机情报检索的理论与方法对改善搜索引擎的搜索性能具有重要意义,并从索引数据库和检索功能两方面探讨了改善其搜索性能的措施。  相似文献   

5.
Web searchers commonly have difficulties crafting queries to fulfill their information needs; even after they are able to craft a query, they often find it challenging to evaluate the results of their Web searches. Sources of these problems include the lack of support for constructing and refining queries, and the static nature of the list-based representations of Web search results. WordBars has been developed to assist users in their Web search and exploration tasks. This system provides a visual representation of the frequencies of the terms found in the first 100 document surrogates returned from an initial query, in the form of a histogram. Exploration of the search results is supported through term selection in the histogram, resulting in a re-sorting of the search results based on the use of the selected terms in the document surrogates. Terms from the histogram can be easily added or removed from the query, generating a new set of search results. Examples illustrate how WordBars can provide valuable support for query refinement and search results exploration, both when vague and specific initial queries are provided. User evaluations with both expert and intermediate Web searchers illustrate the benefits of the interactive exploration features of WordBars in terms of effectiveness as well as subjective measures. Although differences were found in the demographics of these two user groups, both were able to benefit from the features of WordBars.  相似文献   

6.
Google为用户提供了多种检索功能,但这些检索功能的性能如何却少见于相关文献中。作者对Google的两种较为独特的检索功能进行了评析,分别是相似检索和链接检索,并以普通检索作为评价的标准。本文采用了20个检索式和20个目标网页,对相似检索、链接检索和普通检索的检索性能差异性进行了评价。研究中主要利用单因素方差分析法和回归分析法。研究结果显示普通检索表现最好。研究成果能帮助用户对相似检索和链接检索有较为清晰的认识,并且可以了解Google在这两种检索功能上所表现出的不同的检索性能。  相似文献   

7.
The relevance feedback process uses information obtained from a user about a set of initially retrieved documents to improve subsequent search formulations and retrieval performance. In extended Boolean models, the relevance feedback implies not only that new query terms must be identified and re-weighted, but also that the terms must be connected with Boolean And/Or operators properly. Salton et al. proposed a relevance feedback method, called DNF (disjunctive normal form) method, for a well established extended Boolean model. However, this method mainly focuses on generating Boolean queries but does not concern about re-weighting query terms. Also, this method has some problems in generating reformulated Boolean queries. In this study, we investigate the problems of the DNF method and propose a relevance feedback method using hierarchical clustering techniques to solve those problems. We also propose a neural network model in which the term weights used in extended Boolean queries can be adjusted by the users’ relevance feedbacks.  相似文献   

8.
To resolve some of lexical disagreement problems between queries and FAQs, we propose a reliable FAQ retrieval system using query log clustering. On indexing time, the proposed system clusters the logs of users’ queries into predefined FAQ categories. To increase the precision and the recall rate of clustering, the proposed system adopts a new similarity measure using a machine readable dictionary. On searching time, the proposed system calculates the similarities between users’ queries and each cluster in order to smooth FAQs. By virtue of the cluster-based retrieval technique, the proposed system could partially bridge lexical chasms between queries and FAQs. In addition, the proposed system outperforms the traditional information retrieval systems in FAQ retrieval.  相似文献   

9.
Ontologies are frequently used in information retrieval being their main applications the expansion of queries, semantic indexing of documents and the organization of search results. Ontologies provide lexical items, allow conceptual normalization and provide different types of relations. However, the optimization of an ontology to perform information retrieval tasks is still unclear. In this paper, we use an ontology query model to analyze the usefulness of ontologies in effectively performing document searches. Moreover, we propose an algorithm to refine ontologies for information retrieval tasks with preliminary positive results.  相似文献   

10.
The performance and capabilities of Web search engines is an important and significant area of research. Millions of people world wide use Web search engines very day. This paper reports the results of a major study examining the overlap among results retrieved by multiple Web search engines for a large set of more than 10,000 queries. Previous smaller studies have discussed a lack of overlap in results returned by Web search engines for the same queries. The goal of the current study was to conduct a large-scale study to measure the overlap of search results on the first result page (both non-sponsored and sponsored) across the four most popular Web search engines, at specific points in time using a large number of queries. The Web search engines included in the study were MSN Search, Google, Yahoo! and Ask Jeeves. Our study then compares these results with the first page results retrieved for the same queries by the metasearch engine Dogpile.com. Two sets of randomly selected user-entered queries, one set was 10,316 queries and the other 12,570 queries, from Infospace’s Dogpile.com search engine (the first set was from Dogpile, the second was from across the Infospace Network of search properties were submitted to the four single Web search engines). Findings show that the percent of total results unique to only one of the four Web search engines was 84.9%, shared by two of the three Web search engines was 11.4%, shared by three of the Web search engines was 2.6%, and shared by all four Web search engines was 1.1%. This small degree of overlap shows the significant difference in the way major Web search engines retrieve and rank results in response to given queries. Results point to the value of metasearch engines in Web retrieval to overcome the biases of individual search engines.  相似文献   

11.
Many Web sites have begun allowing users to submit items to a collection and tag them with keywords. The folksonomies built from these tags are an interesting topic that has seen little empirical research. This study compared the search information retrieval (IR) performance of folksonomies from social bookmarking Web sites against search engines and subject directories. Thirty-four participants created 103 queries for various information needs. Results from each IR system were collected and participants judged relevance. Folksonomy search results overlapped with those from the other systems, and documents found by both search engines and folksonomies were significantly more likely to be judged relevant than those returned by any single IR system type. The search engines in the study had the highest precision and recall, but the folksonomies fared surprisingly well. Del.icio.us was statistically indistinguishable from the directories in many cases. Overall the directories were more precise than the folksonomies but they had similar recall scores. Better query handling may enhance folksonomy IR performance further. The folksonomies studied were promising, and may be able to improve Web search performance.  相似文献   

12.
The purpose of this study is to provide automatic new topic identification of search engine query logs, and estimate the effect of statistical characteristics of search engine queries on new topic identification. By applying multiple linear regression and multi-factor ANOVA on a sample data log from the Excite search engine, we demonstrated that the statistical characteristics of Web search queries, such as time interval, search pattern and position of a query in a user session, are effective on shifting to a new topic. Multiple linear regression is also a successful tool for estimating topic shifts and continuations. The findings of this study provide statistical proof for the relationship between the non-semantic characteristics of Web search queries and the occurrence of topic shifts and continuations.  相似文献   

13.
This paper explores the integration of textual and visual information for cross-language image retrieval. An approach which automatically transforms textual queries into visual representations is proposed. First, we mine the relationships between text and images and employ the mined relationships to construct visual queries from textual ones. Then, the retrieval results of textual and visual queries are combined. To evaluate the proposed approach, we conduct English monolingual and Chinese–English cross-language retrieval experiments. The selection of suitable textual query terms to construct visual queries is the major issue. Experimental results show that the proposed approach improves retrieval performance, and use of nouns is appropriate to generate visual queries.  相似文献   

14.
Topic distillation is one of the main information needs when users search the Web. Previous approaches for topic distillation treat single page as the basic searching unit, which has not fully utilized the structure information of the Web. In this paper, we propose a novel concept for topic distillation, named sub-site retrieval, in which the basic searching unit is sub-site instead of single page. A sub-site is the subset of a website, consisting of a structural collection of pages. The key of sub-site retrieval includes (1) extracting effective features for the representation of a sub-site using both the content and structure information, (2) delivering the sub-site-based retrieval results with a friendly and informative user interface. For the first point, we propose Punished Integration algorithm, which is based on the modeling of the growth of websites. For the second point, we design a user interface to better illustrate the search results of sub-site retrieval. Testing on the topic distillation task of TREC 2003 and 2004, sub-site retrieval leads to significant improvement of retrieval performance over the previous methods based on single pages. Furthermore, time complexity analysis shows that sub-site retrieval can be integrated into the index component of search engines.  相似文献   

15.
Current citation-based document retrieval systems generally offer only limited search facilities, such as author search. In order to facilitate more advanced search functions, we have developed a significantly improved system that employs two novel techniques: Context-based Cluster Analysis (CCA) and Context-based Ontology Generation frAmework (COGA). CCA aims to extract relevant information from clusters originally obtained from disparate clustering methods by building relationships between them. The built relationships are then represented as formal context using the Formal Concept Analysis (FCA) technique. COGA aims to generate ontology from clusters relationship built by CCA. By combining these two techniques, we are able to perform ontology learning from a citation database using clustering results. We have implemented the improved system and have demonstrated its use for finding research domain expertise. We have also conducted performance evaluation on the system and the results are encouraging.  相似文献   

16.
As an effective technique for improving retrieval effectiveness, relevance feedback (RF) has been widely studied in both monolingual and translingual information retrieval (TLIR). The studies of RF in TLIR have been focused on query expansion (QE), in which queries are reformulated before and/or after they are translated. However, RF in TLIR actually not only can help select better query terms, but also can enhance query translation by adjusting translation probabilities and even resolving some out-of-vocabulary terms. In this paper, we propose a novel relevance feedback method called translation enhancement (TE), which uses the extracted translation relationships from relevant documents to revise the translation probabilities of query terms and to identify extra available translation alternatives so that the translated queries are more tuned to the current search. We studied TE using pseudo-relevance feedback (PRF) and interactive relevance feedback (IRF). Our results show that TE can significantly improve TLIR with both types of relevance feedback methods, and that the improvement is comparable to that of query expansion. More importantly, the effects of translation enhancement and query expansion are complementary. Their integration can produce further improvement, and makes TLIR more robust for a variety of queries.  相似文献   

17.
Users enter queries that are short as well as long. The aim of this work is to evaluate techniques that can enable information retrieval (IR) systems to automatically adapt to perform better on such queries. By adaptation we refer to (1) modifications to the queries via user interaction, and (2) detecting that the original query is not a good candidate for modification. We show that the former has the potential to improve mean average precision (MAP) of long and short queries by 40% and 30% respectively, and that simple user interaction can help towards this goal. We observed that after inspecting the options presented to them, users frequently did not select any. We present techniques in this paper to determine beforehand the utility of user interaction to avoid this waste of time and effort. We show that our techniques can provide IR systems with the ability to detect and avoid interaction for unpromising queries without a significant drop in overall performance.  相似文献   

18.
This paper reports on the underlying IR problems encountered when dealing with the complex morphology and compound constructions found in the Hungarian language. It describes evaluations carried out on two general stemming strategies for this language, and also demonstrates that a light stemming approach could be quite effective. Based on searches done on the CLEF test collection, we find that a more aggressive suffix-stripping approach may produce better MAP. When compared to an IR scheme without stemming or one based on only a light stemmer, we find the differences to be statistically significant. When compared with probabilistic, vector-space and language models, we find that the Okapi model results in the best retrieval effectiveness. The resulting MAP is found to be about 35% better than the classical tf idf approach, particularly for very short requests. Finally, we demonstrate that applying an automatic decompounding procedure for both queries and documents significantly improves IR performance (+10%), compared to word-based indexing strategies.  相似文献   

19.
Searching for relevant material that satisfies the information need of a user, within a large document collection is a critical activity for web search engines. Query Expansion techniques are widely used by search engines for the disambiguation of user’s information need and for improving the information retrieval (IR) performance. Knowledge-based, corpus-based and relevance feedback, are the main QE techniques, that employ different approaches for expanding the user query with synonyms of the search terms (word synonymy) in order to bring more relevant documents and for filtering documents that contain search terms but with a different meaning (also known as word polysemy problem) than the user intended. This work, surveys existing query expansion techniques, highlights their strengths and limitations and introduces a new method that combines the power of knowledge-based or corpus-based techniques with that of relevance feedback. Experimental evaluation on three information retrieval benchmark datasets shows that the application of knowledge or corpus-based query expansion techniques on the results of the relevance feedback step improves the information retrieval performance, with knowledge-based techniques providing significantly better results than their simple relevance feedback alternatives in all sets.  相似文献   

20.
This paper presents a new approach to query expansion in search engines through the use of general non-topical terms (NTTs) and domain-specific semi-topical terms (STTs). NTTs and STTs can be used in conjunction with topical terms (TTs) to improve precision in retrieval results. In Phase I, 20 topical queries in two domains (Health and the Social Sciences) were carried out in Google and from the results of the queries, 800 pages were textually analysed. Of 1442 NTTs and STTs identified, 15% were shared between the two domains; 62% were NTTs and 38% were STTs; and approximately 64% occurred before while 36% occurred after their respective topical terms (TTs). Findings of Phase II showed that query expansion through NTTs (or STTs) particularly in the ‘exact title’ and URL search options resulted in more precise and manageable results. Statistically significant differences were found between Health and the Social Sciences vis-à-vis keyword and ‘exact phrase’ search results; however there were no significant differences in exact title and URL search results. The ratio of exact phrase, exact title, and URL search result frequencies to keyword search result frequencies also showed statistically significant differences between the two domains. Our findings suggest that web searching could be greatly enhanced combining NTTs (and STTs) with TTs in an initial query. Additionally, search results would improve if queries are restricted to the exact title or URL search options. Finally, we suggest the development and implementation of knowledge-based lists of NTTs (and STTs) by both general and specialized search engines to aid query expansion.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号