首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到18条相似文献,搜索用时 343 毫秒
1.
中文分词是地质大数据智能化知识挖掘难以回避的第一道基本工序。基于统计的分词方法受语料影响,跨领域适应性较差。基于词典的分词方法可以直接利用领域词典进行分词,但不能解决未登录词识别问题。在领域语料不足的情况下,为提高地质文本分词的准确率和未登录词识别率,提出一种基于统计的中文地质词语识别方法。该方法基于质串思想构建了地质基本词典库,用以改善统计分词方法在地质文本分词上的适应性。采用重复串查找方法得到地质词语候选集,并使用上下文邻接以及基于位置成词的概率词典,对地质词语候选集进行过滤,最终实现地质词语识别。实验结果表明,使用该方法对地质专业词语识别准确率达到81.6%,比通用统计分词方法提高了近60%。该方法能够识别地质文本中的未登录词,并保证地质分词的准确率,可以应用到地质文本分词工作中。  相似文献   

2.
介绍一种基于词结合提取的未登录词识别方法.该方法对碎片分词后的文本建立二元模型,结合互信息和规则过滤提取由若干个词组合而成的未登录词(组).测试结果准确率为84.71%,召回率为72.13%.  相似文献   

3.
深入探讨基于词典的分词过程、常见词典结构以及分词算法。在分析现有系统的基础上,设计一个新的词典结构,对经典的分词算法进行改进,通过词典加载功能改善未登录词的识别问题,通过双向匹配算法获取最优分词结果,改善歧义识别问题。  相似文献   

4.
中分词一直是大规模语料库加工的基础,它需要能够正确识别出语料中的已知词和未登录词,而各种基于规则和统计的方法在识别已知词和未登录词时各有优劣。本试分别从已知词和未登录词识别两个方面,对ACL—SIGHAN第一届国际中分词竞赛中各参赛系统进行比较,指出中分词既需要提高已知词识别的准确率,还要能够较好地预测语料中出现的未登录词,并处理好它们之间的平衡关系。  相似文献   

5.
现代汉语分词虽已取得较大进展,但是古籍文本分词由于受到古代汉语词汇特征、语义、语法等限制,始终没有形成一种行之有效的方法。通过互信息与邻接熵的新词发现方法从《汉书》中寻找未登录词,结合古代汉语词汇表、古代人名词表和古代地名表构建古籍文本分词词典,以此为基础,使用pyNLPIR对《汉书》进行分词操作。实验结果显示,新词发现方法可以在一定程度上完善古籍文本分词所需的用户词典全面性,但是对3字以上的词语识别效果较差。实验证明使用新词发现结合词典信息的方法对古籍文本进行分词能够有效提高古代汉语分词准确度。  相似文献   

6.
为了获得较高的译文质量,提出了一种基于网络搜索的中文未登录词的翻译方法。该方法首先利用词典对未登录词进行扩展,然后将扩展查询词提交搜索引擎,从获取的中英文混合摘要中采用频度变化信息算法抽取译文候选,最后采用表层模板和频度右距离模型对译文候选进行排序。实验结果表明通过本方法进行中文未登录词译文挖掘是有效可行的。  相似文献   

7.
基于神经网络的人工智能分词是中文分词技术的一个重要发展方向。介绍了当前神经网络分词的研究现状,给出神经网络分词的一般模型,重点阐述BP等算法在歧义解决中的应用,介绍了BP算法在未登录词识别方面的应用,最后对分词技术的发展进行了展望。  相似文献   

8.
中文自然语言处理在舆情系统信息预处理中起着重要作用。提出一种基于ICTCLAS的中文舆情语料分词方法。它通过采用层叠隐马尔科夫模型将中文分词、词性标注、歧义词处理和未登录词识别进行系统集成,形成整体的系统框架。实验结果表明,该方法能够有效识别网络舆情用语,提高了分词准确率,为进一步发现高校网络舆情奠定了基础。  相似文献   

9.
为获取中文自然地址描述语句中的位置信息,提出一种不依赖于词典的中文地址分词方法。首先根据地址语料库中字串共现的统计规律统计词频,然后对地名地址串进行正则表达式预处理,再对地址串进行全切分处理。通过互信息和信息熵得到最优粗分结果,通过置信度对粗分结果进行过滤得到最优分词结果。实验结果表明,该方法在不依赖词典的情况下能有效实现对地名地址串的拆分,正确率和召回率分别达到了80.03%和89.28%。  相似文献   

10.
不是所有的复合形容词都能从词典上找到,有些是作者在写作过程中临时编造的。搞清了它们之间以及它们与被修饰词之间的逻辑关系,对我们正确使用和理解这些词有很大帮助。现就它们之间的逻辑关系分述如下:一、名词 现在分词这类复合形容词中的第一个词为第二个词的逻辑宾语,它们之间是  相似文献   

11.
借助于统计语言模型将汉语分词转换为字序列标注并实现汉语分词已经成为近年来汉语分词的主流方法,但统计语言模型训练时间较长一直是这一方法中的最大问题.提出了一种基于三词位的字标注汉语分词方法,并在bakeoff2005提供的语料上进行了对比实验,结果表明该方法可以取得接近四词位字标注分词方法的性能,但在模型的训练时间上明显优于四词位标注方法.  相似文献   

12.
Concept of word—the awareness of how words differ from nonwords or other linguistic properties—is important to learning to read Chinese because words in Chinese texts are not separated by space, and most characters can be productively compounded with other characters to form new words. The current study examined the effects of reader, word, and character attributes on Chinese children’s concept of word in text. A total of 164 fifth-grade Chinese children participated in this study. Concept of word was measured by children’s lexical decisions about words and nonwords embedded in strings of characters. Cross-classified multilevel logistic models showed that reader attributes, including reading comprehension, vocabulary knowledge, and morphological awareness, interacted with certain word or character attributes in predicting children’s lexical decisions about words or nonwords. This study sheds light on the complex relationships between reader, word, and character attributes in the formation of concept of word in Chinese.  相似文献   

13.
字符串比较是计算机信息处理的重要方法之一。针对现有关联规则挖掘算法不能记忆及利用历史挖掘成果的局限性,提出了将事务数据库转化为项目数据库,构造项目的支持事务标识符有序序列方法。为提高挖掘效率,减少串处理效率较低的负面影响,给出了双序列串比较算法,以及针对串比较的大项目频繁集发现方法。  相似文献   

14.
The present study examined the use of statistical cues for word boundaries during Chinese reading. Participants were instructed to read sentences for comprehension with their eye movements being recorded. A two-character target word was embedded in each sentence. The contrast between the probabilities of the ending character (C2) of the target word (C12) being used as word beginning and ending in all words containing it was manipulated. In addition, by using the boundary paradigm, parafoveal overlapping ambiguity in the string C123 was manipulated with three types of preview of the character C3, which was a single-character word in the identical condition. During preview, the combination of C23′ was a legal word in the ambiguous condition and was not a word in the control condition. Significant probability and preview effects were observed. In the low-probability condition, inconsistency in the frequent within-word position (word beginning) and the present position (word ending) lengthened gaze durations and increased refixation rate on the target word. Although benefits from the identical previews were apparent, effects of overlapping ambiguity were negligible. The results suggest that the probability of within-word positions had an influence during character-to-word assignment, which was mainly verified during foveal processing. Thus, the overlapping ambiguity between parafoveal words did not interfere with reading. Further investigation is necessary to examine whether current computational models of eye movement control should incorporate statistical cues for word boundaries together with other linguistic factors in their word processing system to account for Chinese reading.  相似文献   

15.
文章采用整群分层抽样法,对《人民日报》中"伟大"一词的使用情况进行调查,对每一时段"伟大"一词使用的主要类别、特点及其生成原因进行分析,并得出结论:人们在使用"伟大"一词的过程中带有较明显的主观性,"伟大"的使用频率与人们的理智相联系,"伟大"使用有非常鲜明的时代特点,是反映重大时代变化的风向标。  相似文献   

16.
In Java, System.out.printf and String.format consume a specialised kind of string commonly known as a format string. In our study of first-year students at the Ateneo de Manila University, we discovered that format strings present a substantial challenge for novice programmers. Focusing on their first laboratory we found that 8% of all the compilation errors and 100% of the exceptional, run-time behaviour they encountered were due to the improper construction of format strings. Format strings are a language unto themselves embedded within Java, and they are difficult for novice programmers to master when learning to program. In this article, we present exemplars of students' problematic interactions with the Java compiler and run-time environment when dealing with format strings, discuss these interactions, and recommend possible instructional interventions based on our observations.  相似文献   

17.
“形容词+双音节名词”构成的“1+2偏正形式”在汉语中很常见。文章运用“词汇的完整性”原则。通过区分“句法词”和“词汇词”得出结论:汉语的词具有“多维性”,“1+2偏正形式’’的是以短语的生成方式所造的词,既有词汇特征。又有短语的特征。  相似文献   

18.
Factors such as errors during the fabrication or construction of structural components and errors of calculation assumption or calculation methods, are very likely to cause serious deviation of many strings' actual prestressing forces from the designed values during tension structure construction or service period, and further to threaten the safety and reliability of the structure. Aiming at relatively large errors of the prestressing force of strings in a tension structure construction or service period, this paper proposes a new finite element method (FEM), the "tensile force correction calculation method". Based on the measured prestressing forces of the strings, this new method applies the structure from the zero prestressing force status approach to the measured prestressing force status for the first phase, and from the measured prestressing force status approach to the designed prestressing force status for the second phase. The construction tensile force correction value for each string can be obtained by multi-iteration with FEM. Using the results of calculation, the strings' tensile force correction by group and in batch will be methodic, simple and accurate. This new calculation method can be applied to the prestressed correction construction simulation analysis for tension structures.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号