首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 984 毫秒
1.
主观题是语言测试中的重要组成部分。主观题可以弥补标准化试题的不足,但又存在评分依赖于评分员主观印象的问题,这导致评分员自身的不稳定性和评分员之间的差异。借鉴、利用三大测量理论和计算机辅助评分,可以优化主观题评分质量,提高其精准性和有效性。  相似文献   

2.
在各类测验中,不同的评阅者在主观题上常常会评出不同的分数,其结果是增大了测验的误差以及误差的不确定性.为克服这一缺陷,可以采取一种新的评分方法,使不同的评分者在主观题上也能评出相同的分数.新的评分方法的主要步骤可概括为:找出主观题的节点,按照不同的评分步骤可将题目划分成6种不同的评分等级.  相似文献   

3.
随着信息技术的发展,主观题智能评分成为考试与测评领域的研究热点。基于深度学习的主观题智能评分方法目前尚存在一定局限性:一是基于深度学习的方法通常需要充足的训练样本才能达到比较好的效果,而一些真实阅卷场景却无法提供充足的标定样本;二是评分模型仅预测总分值,缺少评分细节,无法为后续的结果评价提供依据。针对以上问题,提出基于领域预训练的孪生网络智能评分方法,探索利用考生作答文本提高评卷精度的方法,探索得分点模型的可行性与实现方法。实验证明,孪生网络智能评分方法能够有效提高小样本情况下的主观题智能评分精度。  相似文献   

4.
谈到当今高考语文试题变化的趋势,一般人都会有一个印象:主观题增多,而客观题变少了。那么,何谓主观题?何谓客观题?按照一般教科书的说法,主观题也称主观性试题,是指应试者在解答问题时,可以自由组织答案,对给分标准评分者难以做到完全客观一致,需要借助主观判断确定的试题。主要包括论述、作文、翻译、简答、填空、改错等。客观题也称客观性试题,是指把格式固定的答案形式提供给被试,给分标准容易掌握,评分客  相似文献   

5.
主观题评分质量的估计方法评述   总被引:2,自引:0,他引:2  
在心理测量理论中,主观题的评分质量是一个值得研究的课题。本文分别介绍了三大测量理论(经典测量理论、概化理论、项目反应理论)对于主观题评分质量的估计方法,并对其优劣进行了比较。概化理论和项目反应理论在评价主观题评分质量上具有较明显的优势,如何结合使用三大理论,为主观题评分质量获取更多有价值的信息是值得深入探讨的问题。  相似文献   

6.
高考作文60分,占高考总分值750分的8%。这样的一道举足轻重的大型主观题,如果离开了科学的评分方式,将会严重影响高考的信度和效度。现在的情形是,一方面,有关部门不断加强对评分的管理和控制;另一方面,误差较大、分数趋中等问题依然没有解决,社会对高考作文评分的质疑持续不断。作文评分方式涉及命题环节评分量表的制订和阅卷环节...  相似文献   

7.
主观考试采用评分员进行主观评分,由于评分一致性不高,缺乏信度,测量学界一直在努力探索提高主观评分信度的办法。本文用Longford方法对参加HSK[高等]作文考试评分的异常评分员作了一次实证检验。结果证明,该方法对检验大规模标准化主观考试评分员差异确实有效。  相似文献   

8.
本研究以上海高考文言文阅读二(2014-2017)为例,讨论了如何建立高考主观题的评分标准。通过小范围的实证与测试,结合作者长期的实践与观察,分析了目前高考主观题阅卷中实际应用的"参考答案(答案示例)+评分说明"模式存在的信度、区分度不高等问题。基于测量与评价理论,区分了"答案(答案示例)"与"评分标准"在理念、思维、文字表征等方面的不同,进而聚焦"应答要素"的结构化分析。在此基础上,提出命题者、研究者、阅卷者由"答案"思维转变为"标准"思维的急迫性,同时对建构与使用"评分标准"的改进策略提出明确评分准则、评分量表应有思考方式、解题视角以及与情境有关的具体内容等方向性建议。  相似文献   

9.
当今国外许多研究表明,任何人任何时候进行作文评分,都不可避免地会渗入一些主观成份。作文评分普遍存在偏差,这是由作文本身的特点决定的。它是通过教师的心理因素对评分过程的影响表现出来的。导致作文  相似文献   

10.
目前大规模考试作文评分大都采用双评评分模式,本研究采用多侧面Rasch模型(MFRM)分析双评模式下大型英语作文评分中的评分者误差来源及主要影响因素。对57名评分者所评价的2 427篇作文分析发现:1评分者的宽严度存在显著的差异;2在作文评分中,约有22.8%的评分者之间的一致性较差,也存在约3.5%的评分者之间一致性过高;3约90%的评分者自身的一致性都较高,但仍有8.8%的评分者自身一致性很差,约2%的评分者出现评分自身一致性过高的情况;4从整体上讲,评分者在不同的评分标准(或维度)上、不同评分等级宽严程度的把握存在差异;评分者和被试,以及评分者、被试和评分标准三者的交互作用不显著;5评分者对男生和女生具有相同的宽严度。  相似文献   

11.
The purpose of this study was to investigate the stability of rater severity over an extended rating period. Multifaceted Rasch analysis was applied to ratings of 16 raters on writing performances of 8, 285 elementary school students. Each performance was rated by two trained raters over a period of seven rating days. Performances rated on the first day were re-rated at the end of the rating period. Statistically significant differences between raters were found within each day and in all days combined. Daily estimates of the relative severity of individual raters were found to differ significantly from single, on-average estimates for the whole rating period. For 10 raters, severity estimates on the last day were significantly different from estimates on the first day. These fndings cast doubt on the practice of using a single calibration of rater severity as the basis for adjustment of person measures.  相似文献   

12.
The decision-making behaviors of 8 raters when scoring 39 persuasive and 39 narrative essays written by second language learners were examined, first using Rasch analysis and then, through think aloud protocols. Results based on Rasch analysis and think aloud protocols recorded by raters as they were scoring holistically and analytically suggested that rater background may have contributed to rater expectations that might explain individual differences in the application of the performance criteria of the rubrics when rating essays. The results further suggested that rater ego engagement with the text and/or author may have helped mitigate rater severity and that self-monitoring behaviors by raters may have had a similar mitigating effect.  相似文献   

13.
Automated scoring systems are typically evaluated by comparing the performance of a single automated rater item-by-item to human raters. This presents a challenge when the performance of multiple raters needs to be compared across multiple items. Rankings could depend on specifics of the ranking procedure; observed differences could be due to random sampling of items and/or responses in the validation sets. Any statistical hypothesis test of the differences in rankings needs to be appropriate for use with rater statistics and adjust for multiple comparisons. This study considered different statistical methods to evaluate differences in performance across multiple raters and items. These methods are illustrated leveraging data from the 2012 Automated Scoring Assessment Prize competitions. Using average rankings to test for significant differences in performance between automated and human raters, findings show that most automated raters did not perform statistically significantly different from human-to-human inter-rater agreement for essays but they did perform differently on short-answer items. Differences in average rankings between most automated raters were not statistically significant, even when their observed performance differed substantially.  相似文献   

14.
This study examined rater effects on essay scoring in an operational monitoring system from England's 2008 national curriculum English writing test for 14‐year‐olds. We fitted two multilevel models and analyzed: (1) drift in rater severity effects over time; (2) rater central tendency effects; and (3) differences in rater severity and central tendency effects by raters’ previous rating experience. We found no significant evidence of rater drift and, while raters with less experience appeared more severe than raters with more experience, this result also was not significant. However, we did find that there was a central tendency to raters’ scoring. We also found that rater severity was significantly unstable over time. We discuss the theoretical and practical questions that our findings raise.  相似文献   

15.
16.
The term measurement disturbance has been used to describe systematic conditions that affect a measurement process, resulting in a compromised interpretation of person or item estimates. Measurement disturbances have been discussed in relation to systematic response patterns associated with items and persons, such as start‐up, plodding, boredom, or fatigue. An understanding of the different types of measurement disturbances can lead to a more complete understanding of persons or items in terms of the construct being measured. Although measurement disturbances have been explored in several contexts, they have not been explicitly considered in the context of performance assessments. The purpose of this study is to illustrate the use of graphical methods to explore measurement disturbances related to raters within the context of a writing assessment. Graphical displays that illustrate the alignment between expected and empirical rater response functions are considered as they relate to indicators of rating quality based on the Rasch model. Results suggest that graphical displays can be used to identify measurement disturbances for raters related to specific ranges of student achievement that suggest potential rater bias. Further, results highlight the added diagnostic value of graphical displays for detecting measurement disturbances that are not captured using Rasch model–data fit statistics.  相似文献   

17.
This study describes several categories of rater errors (rater severity, halo effect, central tendency, and restriction of range). Criteria are presented for evaluating the quality of ratings based on a many-faceted Rasch measurement (FACETS) model for analyzing judgments. A random sample of 264 compositions rated by 15 raters and a validity committee from the 1990 administration of the Eighth Grade Writing Test in Georgia is used to illustrate the model. The data suggest that there are significant differences in rater severity. Evidence of a halo effect is found for two raters who appear to be rating the compositions holistically rather than analytically. Approximately 80% of the ratings are in the two middle categories of the rating scale, indicating that the error of central tendency is present. Restriction of range is evident when the unadjusted raw score distribution is examined, although this rater error is less evident when adjusted estimates of writing competence are used  相似文献   

18.
Researchers have explored a variety of topics related to identifying and distinguishing among specific types of rater effects, as well as the implications of different types of incomplete data collection designs for rater‐mediated assessments. In this study, we used simulated data to examine the sensitivity of latent trait model indicators of three rater effects (leniency, central tendency, and severity) in combination with different types of incomplete rating designs (systematic links, anchor performances, and spiral). We used the rating scale model and the partial credit model to calculate rater location estimates, standard errors of rater estimates, model–data fit statistics, and the standard deviation of rating scale category thresholds as indicators of rater effects and we explored the sensitivity of these indicators to rater effects under different conditions. Our results suggest that it is possible to detect rater effects when each of the three types of rating designs is used. However, there are differences in the sensitivity of each indicator related to type of rater effect, type of rating design, and the overall proportion of effect raters. We discuss implications for research and practice related to rater‐mediated assessments.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号