首页 | 本学科首页   官方微博 | 高级检索  
     检索      

中文电子病历的分词及实体识别研究
引用本文:王若佳,赵常煜,王继民.中文电子病历的分词及实体识别研究[J].图书情报工作,2019,63(2):34-42.
作者姓名:王若佳  赵常煜  王继民
作者单位:1. 北京大学信息管理系 北京 100871; 2. 北京大学海洋研究院 北京 100871
摘    要:目的/意义]健康医疗大数据是我国重要的基础性战略资源,本研究对中文电子病历分词与实体识别的探讨与实证较好地完成了医疗数据的信息抽取任务,对今后医疗大数据在语义层面的应用发展具有重要意义。方法/过程]本研究首先融合权威词表、官方标准、健康网站数据及其他医学补充词库构建了词语数量级达到10万的医学词表;然后对电子病历的字段进行分词,对比了jieba工具、导入词典后的jieba、无监督学习及AC自动机4种模型的分词效果;最后,以自动分词和人工标注结果为语料,实现基于条件随机场的电子病历实体识别研究,并比较不同实体类别以及不同文本特征下的实体识别效果,选出最优模板。结果/结论]分词结果显示,AC自动机的效果最好,F值可达82%;实体识别结果表明,"检查"和"疾病"实体的识别效果最好,而"症状"的识别效果不太理想。

关 键 词:电子病历  中文分词  实体识别  健康医疗大数据  AC自动机  条件随机场
收稿时间:2018-07-16
修稿时间:2018-09-15

Healthcare Data Mining: Word Segmentation and Named Entity Recognition in Chinese Electronic Medical Record
Wang Ruojia,Cho Sang,Wang Jimin.Healthcare Data Mining: Word Segmentation and Named Entity Recognition in Chinese Electronic Medical Record[J].Library and Information Service,2019,63(2):34-42.
Authors:Wang Ruojia  Cho Sang  Wang Jimin
Institution:1. Department of information management, Peking University, Beijing 100871; 2. Institute of Ocean Research, Peking University, Beijing 100871
Abstract:Purpose/significance] Healthcare big data is an important basic strategic resource in China. Word segmentation and entity recognition of Chinese electronic medical record(EMR) is helpful in extracting important information from a large number of unstructured text.Method/process] In this study, a Chinese medical thesaurus is firstly built in terms of authoritative medical subject headings, official standards and health website data; then, the effect of four segmentation methods is compared based on the corpus of artificial segmentation and manual annotation; finally, CRF model is used to identify 5 entities, including disease, symptom, test, drug and treatment.Result/conclusion] Results show that (i)AC automaton model has the best F-measure in EMR word segmentation, which is 82%; (ii) compared with Western medical record, it's difficult to identify medical entities in the record of traditional Chinese medicine. Besides, "Test" and "Disease" entities have better F-measure, while the F-measure of "Symptom" entity is not that ideal.
Keywords:healthcare data mining  electronic medical record  Chinese word segmentation  named entity recognition  AC automaton  conditional random field  
本文献已被 维普 等数据库收录!
点击此处可从《图书情报工作》浏览原始摘要信息
点击此处可从《图书情报工作》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号