中文电子病历的分词及实体识别研究 Healthcare Data Mining: Word Segmentation and Named Entity Recognition in Chinese Electronic Medical Record期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

中文电子病历的分词及实体识别研究

引用本文：	王若佳,赵常煜,王继民.中文电子病历的分词及实体识别研究[J].图书情报工作,2019,63(2):34-42.

作者姓名：	王若佳赵常煜王继民

作者单位：	1. 北京大学信息管理系北京 100871; 2. 北京大学海洋研究院北京 100871

摘要：	目的/意义]健康医疗大数据是我国重要的基础性战略资源,本研究对中文电子病历分词与实体识别的探讨与实证较好地完成了医疗数据的信息抽取任务,对今后医疗大数据在语义层面的应用发展具有重要意义。方法/过程]本研究首先融合权威词表、官方标准、健康网站数据及其他医学补充词库构建了词语数量级达到10万的医学词表;然后对电子病历的字段进行分词,对比了jieba工具、导入词典后的jieba、无监督学习及AC自动机4种模型的分词效果;最后,以自动分词和人工标注结果为语料,实现基于条件随机场的电子病历实体识别研究,并比较不同实体类别以及不同文本特征下的实体识别效果,选出最优模板。结果/结论]分词结果显示,AC自动机的效果最好,F值可达82%;实体识别结果表明,"检查"和"疾病"实体的识别效果最好,而"症状"的识别效果不太理想。
关键词：	电子病历中文分词实体识别健康医疗大数据 AC自动机条件随机场
收稿时间：	2018-07-16
修稿时间：	2018-09-15
Healthcare Data Mining: Word Segmentation and Named Entity Recognition in Chinese Electronic Medical Record

Wang Ruojia,Cho Sang,Wang Jimin.Healthcare Data Mining: Word Segmentation and Named Entity Recognition in Chinese Electronic Medical Record[J].Library and Information Service,2019,63(2):34-42.

Authors:	Wang Ruojia Cho Sang Wang Jimin

Institution:	1. Department of information management, Peking University, Beijing 100871; 2. Institute of Ocean Research, Peking University, Beijing 100871

Abstract:	Purpose/significance] Healthcare big data is an important basic strategic resource in China. Word segmentation and entity recognition of Chinese electronic medical record(EMR) is helpful in extracting important information from a large number of unstructured text.Method/process] In this study, a Chinese medical thesaurus is firstly built in terms of authoritative medical subject headings, official standards and health website data; then, the effect of four segmentation methods is compared based on the corpus of artificial segmentation and manual annotation; finally, CRF model is used to identify 5 entities, including disease, symptom, test, drug and treatment.Result/conclusion] Results show that (i)AC automaton model has the best F-measure in EMR word segmentation, which is 82%; (ii) compared with Western medical record, it's difficult to identify medical entities in the record of traditional Chinese medicine. Besides, "Test" and "Disease" entities have better F-measure, while the F-measure of "Symptom" entity is not that ideal.

Keywords:	healthcare data mining electronic medical record Chinese word segmentation named entity recognition AC automaton conditional random field
本文献已被维普等数据库收录！
	点击此处可从《图书情报工作》浏览原始摘要信息
	点击此处可从《图书情报工作》下载免费的PDF全文

设为首页 | 免责声明 | 关于勤云 | 加入收藏