首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
面对数据大爆炸,人们很难获取有用的信息。网络爬虫技术成为了搜索引擎中最为重要的部分,能够有效的在海量数据中找到有价值的信息。首先介绍网络爬虫的抓取对象和抓取策略,然后介绍最常见的网页分析算法——Pagerank算法,最后通过实例实现网络爬虫。实例结果表明,网络爬虫能够准确的从海量数据中抓取有用信息。  相似文献   

2.
网络爬虫是搜索引擎的一个基本组件,网络爬虫抓取页面的效率直接影响搜索引擎提供的服务质量。除了可以通过改进网络爬虫的爬行策略来提高网络爬虫效率之外,也可以通过优化网络爬虫程序某方面的设计来消除特定的效率瓶颈。通过对网络爬虫结构和实际运行数据的分析,针对爬虫的DNS解析瓶颈,设计了一种带缓存异步域名解析器模型,并通过实验和一般DNS解析器模型进行了比较,实验结果证明这种模型对于减少程序等待解析域名的这一操作时间十分有效,显然也能够提高爬虫的整体效率。  相似文献   

3.
主要介绍了主题搜索引擎、网络爬虫的基本概念和Heritrix系统的体系结构,分析了Heritrix的工作流程,在Heritrix框架的基础上进行扩展和优化。通过一个实例,实现了对京东网图书信息的抓取,为建立面向图书信息的垂直搜索引擎提供了网页信息资源。  相似文献   

4.
网络爬虫软件的研究与开发   总被引:1,自引:0,他引:1  
作为一种快捷、高效访问网络海量数据的工具,通用搜索引擎自诞生以来备受人们喜爱。然而在设计上它却存在着很多不足,并且随着万维网的快速发展而日益不能满足人们的需求。基于这种背景,用于对网页进行定向抓取的主题爬虫应运而生。主题爬虫的设计理念是利用最少的资源,尽可能快而准确地抓取网络中用户关心的网页,目前已经有着非常广泛的应用。首先,了解主题爬虫提出的历史背景及当前国内外的发展状况,分析与主题爬虫设计相关的技术知识,如HTTP协议、HTML解析、中文分词等。其次,提出使用向量空间模型进行主题相关度计算。为了能够充分利用网页中丰富的启发式信息,综合运用了网页内容分析和网页链接分析技术。最后,基于对主题爬虫设计与实现方法的研究,使用Java开发一个多线程主题爬虫。  相似文献   

5.
本文介绍了网络爬虫的基本架构、工作原理,设计了网络视频爬虫网络视频爬虫的基本架构、详细讨论了如何有效的避免重复遍历网页和如何快速的更新网站新内容的两个关键问题和网络视频爬虫下载视频和抓取网页的工作方式。  相似文献   

6.
人类社会已经进入大数据时代了,随着互联网的迅猛发展,种类繁多,数量庞大的数据随之产生,作为辅助人们检索信息工具的搜索引擎也存在着一定的局限性,如:不同领域,背景的用户往往具有不同的检索目的和需求,通用搜索引擎所返回的结果包含大量用户不关心的网页。为了解决这个问题,网络爬虫系统应运而生。众所周知,搜索引擎从互联网中靶向性筛选出有用信息,而网络爬虫又是搜索引擎的基础构件之一。本文实现了一个基于python语言的聚焦网络爬虫,利用关键字匹配技术对目标网站进行扫描,得到所需数据并抓取。  相似文献   

7.
基于Python的网络爬虫可以方便地抓取网页信息,以豆瓣网站为例,实现了基于Python网络爬虫抓取豆瓣影视信息的过程。  相似文献   

8.
随着信息技术的不断发展,互联网上的数据类型越来越多,信息量以几何级增长,庞大的数据给人们的生活带来便利的同时也给信息的查找带来了巨大的挑战。搜索引擎的通用网络爬虫越来越难以胜任越来越大规模的数据抓取任务。本文设计了一个分布式架构的主题网络爬虫,能快速、准确、稳定的抓取特定领域的信息。  相似文献   

9.
随着互联网的快速发展,大数据时代的来临,网络上的数据和信息呈爆炸性增长,网络爬虫技术越来越受欢迎。本文通过以抓取二手房出售数据为例,探索R语言爬虫技术的网页信息抓取方法,发现基于R语言的rvest函数包与Selector Gadget工具实现的网页信息爬取方法比传统方法更加简单快捷。  相似文献   

10.
以开源网络爬虫Heritrix为基础,阐述其工作原理和架构。根据渔业信息词库建立索引,提出一种基于Heritrix的定题爬虫算法,根据链接和内容对网页进行过滤,并构建了渔业信息网络爬虫FishInfoCrawler,经实验表明,本算法能完成渔业信息领域相关网页的抓取。  相似文献   

11.
网络蜘蛛搜索策略的研究是近年来专业搜索引擎研究的焦点之一,如何使搜索引擎快速准确地从庞大的网页数据中获取所需资源的需求是目前所面临的重要问题。重点阐述了搜索引擎的Web Spider(网络蜘蛛)的搜索策略和搜索优化措施,提出了一种简单的基于广度优先算法的网络蜘蛛设计方案,并分析了设计过程中的优化措施。  相似文献   

12.
随着网络的飞速发展,网页数量急剧膨胀,近几年来更是以指数级进行增长,搜索引擎面临的挑战越来越严峻,很难从海量的网页中准确快捷地找到符合用户需求的网页。网页分类是解决这个问题的有效手段之一,基于网页主题分类和基于网页体裁分类是网页分类的两大主流,二者有效地提高了搜索引擎的检索效率。网页体裁分类是指按照网页的表现形式及其用途对网页进行分类。介绍了网页体裁的定义,网页体裁分类研究常用的分类特征,并且介绍了几种常用特征筛选方法、分类模型以及分类器的评估方法,为研究者提供了对网页体裁分类的概要性了解。  相似文献   

13.
This paper studies the influence of question-related variables (closed/open and predictable/unpredictable source) on Web users’ choices of search strategy (direct address, subject directory, and search engine) in the initial stage of a search. Subjects are 54 Finnish and American students with about 2.5 yr of Web searching. Data were gathered via a questionnaire asking for decisions for 16 questions of four types: closed/predictable source; closed/unpredictable source; open/predictable source; open/unpredictable source. The participants not only indicated a fairly high degree of familiarity with the initial search options and used different search strategies but also were influenced in their choice of an initial search strategy by question-related characteristics. Of the two question characteristics in the study, the most influential is the predictable/unpredictable source of the answer. The participants mentioned 24 different types of reasons for selecting the initial search strategy, which were grouped by their focus on questions, sources, and search strategy options. The reasons varied across question types. A model of choice behavior shows relationships among the initial question, the reasons, and the choice of initial search strategies.  相似文献   

14.
刘天娇  周瑛 《情报科学》2012,(8):1192-1195
以研究2001-2010年网络搜索引擎的研究发展动态,为该领域后续研究指明方向为目的。以2001-2010年10年为时间限制,通过对CNKI来源期刊有关"网络搜索引擎"的文章搜索出的386篇文章进行分析,并运用内容分析法以及SPSS统计软件,对发文数量,发文期刊分布及发文内容进行分析。经过实例的分析,得出自2001-2010年10年间,对网络搜索引擎的细分化研究论文数量开始多于其综合性研究论文的数量,近10年间对网络搜索引擎方面的研究开始呈现向纵深方向发展的趋势的结论。  相似文献   

15.
Search engines are the gateway for users to retrieve information from the Web. There is a crucial need for tools that allow effective analysis of search engine queries to provide a greater understanding of Web users' information seeking behavior. The objective of the study is to develop an effective strategy for the selection of samples from large-scale data sets. Millions of queries are submitted to Web search engines daily and new sampling techniques are required to bring these databases to a manageable size, while preserving the statistically representative characteristics of the entire data set. This paper reports results from a study using data logs from the Excite Web search engine. We use Poisson sampling to develop a sampling strategy, and show how sample sets selected by Poisson sampling statistically effectively represent the characteristics of the entire dataset. In addition, this paper discusses the use of Poisson sampling in continuous monitoring of stochastic processes, such as Web site dynamics.  相似文献   

16.
基于遗传算法的Web信息采集策略研究   总被引:1,自引:0,他引:1  
本文为了提高Web采集系统(网络蜘蛛)的自适应性和预测链接价值的准确性,提出了一种基于遗传算法的网络蜘蛛搜索策略,实验结果表明,较之传统的网络蜘蛛,使用此种算法的网络蜘蛛具有较高的Web搜索效率.  相似文献   

17.
企业网站搜索引擎优化策略研究   总被引:1,自引:0,他引:1  
搜索引擎优化对企业推广有着重要的作用。在分析影响网站搜索引擎排名的因素的基础上,总结出有利于提高企业网站排名的搜索引擎优化策略。  相似文献   

18.
Large-scale web search engines are composed of multiple data centers that are geographically distant to each other. Typically, a user query is processed in a data center that is geographically close to the origin of the query, over a replica of the entire web index. Compared to a centralized, single-center search engine, this architecture offers lower query response times as the network latencies between the users and data centers are reduced. However, it does not scale well with increasing index sizes and query traffic volumes because queries are evaluated on the entire web index, which has to be replicated and maintained in all data centers. As a remedy to this scalability problem, we propose a document replication framework in which documents are selectively replicated on data centers based on regional user interests. Within this framework, we propose three different document replication strategies, each optimizing a different objective: reducing the potential search quality loss, the average query response time, or the total query workload of the search system. For all three strategies, we consider two alternative types of capacity constraints on index sizes of data centers. Moreover, we investigate the performance impact of query forwarding and result caching. We evaluate our strategies via detailed simulations, using a large query log and a document collection obtained from the Yahoo! web search engine.  相似文献   

19.
Search Engine for South-East Europe (SE4SEE) is a socio-cultural search engine running on the grid infrastructure. It offers a personalized, on-demand, country-specific, category-based Web search facility. The main goal of SE4SEE is to attack the page freshness problem by performing the search on the original pages residing on the Web, rather than on the previously fetched copies as done in the traditional search engines. SE4SEE also aims to obtain high download rates in Web crawling by making use of the geographically distributed nature of the grid. In this work, we present the architectural design issues and implementation details of this search engine. We conduct various experiments to illustrate performance results obtained on a grid infrastructure and justify the use of the search strategy employed in SE4SEE.  相似文献   

20.
This research is a part of ongoing study to better understand citation analysis on the Web. It builds on Kleinberg's research (J. Kleinberg, R. Kumar, P. Raghavan, P. Rajagopalan, A. Tomkins, Invited survey at the International Conference on Combinatorics and Computing, 1999) that hyperlinks between web pages constitute a web graph structure and tries to classify different web graphs in the new coordinate space: out-degree, in-degree. The out-degree coordinate is defined as the number of outgoing web pages from a given web page. The in-degree coordinate is the number of web pages that point to a given web page. In this new coordinate space a metric is built to classify how close or far are different web graphs. Kleinberg's web algorithm (J. Kleinberg, Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1998, pp. 668–677) on discovering “hub web pages” and “authorities web pages” is applied in this new coordinate space. Some very uncommon phenomenon has been discovered and new interesting results interpreted. This study does not look at enhancing web retrieval by adding context information. It only considers web hyperlinks as a source to analyze citations on the web. The author believes that understanding the underlying web page as a graph will help design better web algorithms, enhance retrieval and web performance, and recommends using graphs as a part of visual aid for search engine designers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号