欢迎访问《中国图书馆学报》编辑部网站！

文章摘要

欧阳剑.面向数字人文研究的大规模古籍文本可视化分析与挖掘[J].中国图书馆学报,2016,42(2):66~80

面向数字人文研究的大规模古籍文本可视化分析与挖掘

Visual Analysis and Exploration of Ancient Texts for Digital Humanities Research

投稿时间：2015-10-16 修订日期：2015-11-19

DOI：10.13530/j.cnki.jlis.160011

中文关键词: 数字人文文本可视化数据挖掘古籍文献

英文关键词: Digital humanities Text visualization Data mining Ancient literature

基金项目:

作者	单位	E-mail
欧阳剑	上海师范大学语言研究所计算语言学上海 200234	oyjjj@163.com

摘要点击次数: 4741

全文下载次数: 2322

中文摘要:

传统的古籍开发与应用模式已难以适应人文学科研究的需要,人文学科研究者期待一个技术逻辑和人文逻辑相耦合的数字人文研究范式的出现。本文从古籍文献深层次开发与利用出发,利用新的信息技术与面向数字人文研究跨学科方法,以大规模中国古籍文本为研究对象,采用大数据研究理念,对古籍进行整理、标注、自动分词等处理,以词频分析统计为研究核心,采用数据降噪、基于窗口时间单位的统计分析计算、滑动窗口预测等分析与挖掘方法,采用大数据实时分析技术,实现了实时、在线、立体、可视化、定量分析字词的历史词频分布规律,创建了一个以语言学、历史文献学、历史地理学等人文学科研究为主的古籍实时统计分析平台,可辅助研究者在大量的古籍文献中发现新的模式、现象、趋势等,实现古籍开发与应用模式创新的初步尝试。图11。参考文献36。

英文摘要:

Digital humanity, a new research pattern, brings consequently a new way of research for traditional humanity and social sciences for traditional development and utilization mode of the ancient literature resources that no longer fit the requirements of humanity researches. This paper aims at the deep development and utilization of ancient literature resources by using new information technology and method of digital humanity with the ancient Chinese literatures as to construct a new platform for real-time textual statistic analysis of linguistics, studies of historical literature and historical geography etc.
    This study adopts a big data concept, and applies sorting and labelling to Chinese ancient texts for the construction of a corpus of more than 40 000 kinds of ancient texts. This study also adopts means of dictionary superposition of piecewise and Bigram model to carry out word segmentation of Chinese ancient texts and also with the application of Grubbs method for data denoising and the maximum elimination of problematic data. With word frequency statistical analysis as the research focus base on ancient corpus, we use time window unit analytical computing to analyze the word frequency, apply the idea of memory real-time computing to solve the bottleneck problem of reading big data. The results of the statistics and analysis are displayed by the micro-level scatter plot and the macro-level curve graph based on the time axis as the main line. With the author of the ancient books as the main line, we use the geographic information system （GIS） technology to integrate and display digital ancient books, and with the retrieval of the ancient literature as a clue to show the geographical distribution of the authors. This study improves the efficiency of real-time inquiry and realizes the visualization of the scatter diagram and curve graph of the word frequency according to the years. A statistical and analytical platform of ancient literatures and documents in linguistics, history and historical geography will be established based on the new methods and pattern.
    The study not only extends the research paradigm and method of the humanities, but also enriches the research tools of the humanities research. This research broadens the dimension of the utilization and development of ancient literature and texts, and expands the scope of humanities materials. The platform has a vast application prospect in linguistics, history and historical geography.
    This research is a new attempt in the deep development and utilization of ancient texts and documents by means of digital humanity within the scope of big data. First of all, this study builds a large-scale ancient text corpus of more than 40 000 kinds of ancient books; secondly, this study uses statistical methods and superposition of word segmentation method to implement word segmentation in ancient texts; finally, with the help of big data technique, this study improves the efficiency of real-time inquiry and realizes the visualization of the scatter diagram and curve graph of the word frequency according to the years, which provides a direct visual display of the result of the analysis.
    Due to the insufficient vocabulary database, the accuracy of word segmentation needs to be improved; in addition, in order to improve the quality of the corpus, the information of edition of ancient books and the authors also requires verification. The extraction of the entity in corpus of ancient books, such as persons, historical events, places, titles and names needs to be developed further. 11 figs. 36 refs.

查看全文查看/发表评论下载PDF阅读器