首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 93 毫秒
1.
【目的】论述Web信息抽取技术在新闻舆情分析中的应用,为舆情虚假信息甄别、舆论引导提供新方法,从而避免对大众的思维、想法等造成不良影响。【方法】研究提出了基于行块分布函数和基于统计与网页结构两种不同的新闻正文信息抽取方法,使得在对Web新闻数据采集和存储的基础上,正文信息抽取更加高效和准确。【结果】两种Web信息抽取技术可以广泛应用于海量新闻数据分析、舆情监测等应用场景。【结论】通过基于行块分布函数的抽取方法和基于统计信息与网页结构的抽取方法,能够分别对轻量网页和大流量网页抽取信息时表现更优。  相似文献   

2.
针对目前网络上的标题党新闻,提出一种标题党新闻自动识别的算法。通过分析新闻网页构成的特点,抽取出新闻标题和新闻正文;以句子关系矩阵为基础,提出一种以语句为单位的主题句抽取算法;根据句子相似度计算结果来进行判断。实验表明,本方法的识别精度可达到80%,是一种有效的方法。  相似文献   

3.
信息抽取是从海量网页获取有价值信息的重要方式,对目标网页内容进行主题相关性判断是提高信息抽取效率和准确性的关键环节。目前的相关性判断主要采用人工筛选和文档训练的方法,这其中存在效率低、重复训练等问题,而本文尝试针对抽取任务引入主题描述模型用于网页内容的主题相关性判断。从任务的主题描述模型的角度出发,计算模型中的关键词基于标记信息的加权频率,将网页内容进行量化表示,然后分析关键词加权频率关于任务主题描述模型的变化来判断网页内容的主题相关性。最后通过对比该方法在国防产品信息抽取中结果,实验证明该方法大大提高了网页信息抽取的效率和准确性。  相似文献   

4.
网络舆情搜索引擎与通常的网络信息搜索不同,其最终结果要深入到站点和页面内部采集与抽取有效数据,给情报界提出了许多新的研究内容和方法.在对网页信息抽取的模板和页面分析两种方式、基于自然语言处理、包装器归纳和Ontology抽取方法的分析基础上,使用基于包装器归纳方式并在规则生成模块中采用专家模式,设计一种基于样本学习的新闻抽取方法,通过人工分析网页源代码制定和修改抽取规则,然后根据抽取规则进行信息自动抽取,以提高舆情搜索引擎的精度和质量.  相似文献   

5.
主题抽取是自然语言处理研究的重要问题之一.目前流行的方法是"词典 匹配",但该方法用于处理动态变化的网页信息时,词典难于及时更新等弊病就表现出来.本文作者在研究中文新闻网页内容、结构特点的基础上,提出了一种利用Web页面结构无需词典的主题抽取算法.我们使用该方法对新华网财经新闻语料1000篇进行主题抽取实验,并与手工抽取的主题进行比较,结果表明,重合率高达93%以上.  相似文献   

6.
 针对网页主题信息抽取不够精确的问题,提出一种新型的定义和量化主题信息的方法,即把主题信息分为三种信息形式并对不同形式的信息采用不同的方法进行量化计算。基于上述思想,结合DOM规范和分块思想,在DOM树的基础上提出IB-DOM树,并采用分治思想,先定位到包含主题信息的区域,后过滤噪音信息。实验证明本文提出的方法能够较好地解决主题信息自动提取存在的信息完整性和准确性的矛盾。  相似文献   

7.
张艳 《图书情报工作》2010,54(14):107-130
提出一个RSS级别的网页主题内容抽取方法与系统,利用RSS feed中的少量entry信息训练得到主题内容模板,通过模板可以对RSS feed下的所有网页进行主题内容抽取。该方法支持分别抽取网页的标题、正文、类别等信息;另外,该方法有自适应机制,能实时侦测模板的变化。从实验结果来看,该方法和系统有很高的召回率和准确率。  相似文献   

8.
本文重点探讨基于编辑距离的网页相似度算法在Web 抽取系统中的应用与实现.通过结合基于URL 及编辑距离的网页结构相似度的计算方法,抽取系统在抽取过程中能够检测网页结构的变化,从而主动做出判断,选择适应规则进行抽取或通过主动学习自动扩展规则库.结构相似度计算赋予系统感知网页结构变化的能力,系统通过主动自我更新与调整,能更好地适应面向实际应用的异构资源的获取.算法的可行性和效率在原型系统中得以验证.  相似文献   

9.
基于GATE语义标注的Web信息的自动抽取   总被引:1,自引:0,他引:1  
重点研究基于语义标注样本的Web信息自动抽取的实现方法。借助自然语言处理框架GATE,首先引入领域本体对样本网页内容进行语义标注,精确定位出待抽取的语义项,并据此将样本网页解析为S DOM树。从S DOM树中抽取出语义项的特征描述,形成样本实例并采用机器学习算法归纳抽取规则,自动生成包装器。抽取过程中,通过比较网页结构的相似度,系统能够感知网页的变化,主动学习并扩展规则库。试验结果表明,由于精确定位保障了学习样本的质量,小样本学习生成的包装器能够达到较为理想的查全率和查准率。  相似文献   

10.
一种通用HTML网页主题信息提取方法*   总被引:9,自引:0,他引:9  
采用DOM规范,把HTML网页表示成树结构,对不同模板的HTML页面“主题”信息提取进行研究和分析,提出一种新的结点主题相关性判定方法,依据此方法判定出要抽取的主题内容,并删除无关内容,结果输出只含主题信息的HTML文档。  相似文献   

11.
基于标题的中文新闻网页自动分类   总被引:1,自引:0,他引:1  
借鉴tf-idf加权思想,利用新闻标题来做中文新闻网页自动分类的依据,构建基于标题的中文新闻自动分类方法,并设计多个实验对各种基于标题的中文新闻网页自动分类方法进行评测。实验结果表明,基于标题对中文新闻网页进行自动分类,可以大大缩短判断处理时间,节省存储空间,且准确率较高,特别是改进的类目加权法分类效果最好。  相似文献   

12.
Researchers believe that the Web functions to supplement traditional news media. Little is known, however, about how traditional news media consumption influences Web use patterns. This study investigates how prior TV news exposure influences individuals' subsequent Web use by testing 3 theories that may explain individuals' information selection patterns—accessibility, instrumental utility, and personal issue importance. The results of this study reveal the strong effects of personal issue importance when selecting information on the Web, regardless of news coverage in traditional media. The findings also indicate higher levels of information selection when there is no prior exposure to news coverage.  相似文献   

13.
基于后缀树的中文新闻重复网页识别算法   总被引:1,自引:0,他引:1  
针对识别中文新闻重复网页传统方法的不足,提出以后缀树作为基本数据结构,依据新闻网页的标题性和时间性,构建中文新闻重复网页识别算法。该算法以Ukkonen算法和Matching Statistics算法为基础,并对其具体实现进行优化。实验结果表明,该算法不仅具有有效性,而且对计算字符串相似度也有启发意义。  相似文献   

14.
基于RSS的Web新闻主题聚合系统的设计与实现   总被引:5,自引:0,他引:5  
基于RSS的Web新闻主题聚合是信息处理领域内的一个新兴且有实用价值的方向。分析Web新闻主题聚合的基本问题,提出难点以及相关的解决方案,并在此基础上设计Web新闻主题聚合系统。  相似文献   

15.
Media convergence is happening around the world. This study looks at the current operation of a cable news station that produces 2 media products in 1 newsroom. It also explores the theoretical foundations of value creation in online news by examining how online news is selected, packaged, processed, and distributed. Observational results showed that media convergence still has a long way to go. More important, this study found several divides between the Web people and the news people, between the managers and the reporters, and between the news department and the advertising department. This article suggests that convergence would go more smoothly if stations would integrate Web producers into the newsroom; if reporters were given incentive to do extra work or if their daily work load were adjusted to give them time to file for the Web; and if the sales people better understood the value of the online product.  相似文献   

16.
As news organizations look toward social networking sites as a way to expand their audience, the present article explores how this trend might impact discussion among users of political news content. A content analysis of user comments left by readers of the Washington Post suggests that when it comes to discussing political news, there are significant differences in the deliberative quality of those who access the news directly through the news organization's Web site and those who access the same news via Facebook. In short, comments left by Web site users exhibited greater deliberative quality than those left by Facebook users.  相似文献   

17.
This study takes a network approach to examining international communication. Building upon the world system theory and the preferential attachment network theorem, the structure of the international network created by news media is examined. The use of external hyperlinks in 6,298 foreign sries in 20 languages from 223 news Web sites in 73 countries was examined. Findings revealed that information continues to flow from a handful of countries to the rest of the world. News media preferred linking to established information sources, typically in core counties. This study concludes that news media use new technology to replicate old practices.  相似文献   

18.
The Internet continues to grow as an information and entertainment medium. Internet growth has implications for the news industry. Twenty-four hour news networks such as CNN and MSNBC regularly encourage viewers of their television programs to visit their Web sites. While visiting news Web sites, visitors are invited to participate in opinion polls. Unfortunately, these online opinion polls are not scientific and have little real news value. In spite of these limitations, news Web sites' Internet polls are often treated as serious topics in broadcast news discussions. This article examines media organizations' Internet online polls and critiques them as instances of symbolic representation and pseudo-events that have arisen largely out of the integration of print, broadcast, and Internet media.  相似文献   

19.
Considering radio as a social system for the production of culture and communication, and based on an overview of the Greek case, this article suggests a model for studying the potential of the Web casting radio compared with the traditional radio in various media environments. The model suggested includes eight dimensions: institutional framework, market structures and business models, content diversity, audience profile, interactivity, sociability, relations with the recording industry, and relations with major news media and organizations. The analysis shows that a complex approach is needed to explore the chances for the potential of the Web casting radio to be realized.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号