期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Development of Automated Scoring Algorithms for Complex Performance Assessments: A Comparison of Two Approaches

Brian E. Clauser Melissa J. Margolis Stephen G. Clyman Linette P. Ross 《Journal of Educational Measurement》1997,34(2):141-161

Performance assessments are typically scored by having experts rate individual performances. The cost associated with using expert raters may represent a serious limitation in many large-scale testing programs. The use of raters may also introduce an additional source of error into the assessment. These limitations have motivated development of automated scoring systems for performance assessments. Preliminary research has shown these systems to have application across a variety of tasks ranging from simple mathematics to architectural problem solving. This study extends research on automated scoring by comparing alternative automated systems for scoring a computer simulation test of physicians'patient management skills; one system uses regression-derived weights for components of the performance, the other uses complex rules to map performances into score levels. The procedures are evaluated by comparing the resulting scores to expert ratings of the same performances. 相似文献

2.

大学英语写作评分方法对评分者严厉程度的影响——整体评分法和分析评分法的对比分析 总被引：1，自引：0，他引：1

贺满足《湖南第一师范学报》2006,6(4):59-61,66

评分标准在写作测试中非常重要,使用不同的评分方法会影响评卷者的评分行为。研究显示,虽然整体法和分析法两种英语写作评分方法都可靠,但是在两种评分中,评卷者的严厉程度以及考生的写作成绩发生很大变化。总体上,整体法评分中,评卷者的严厉程度趋于一致,接近理想值;分析法评分中,考生的写作成绩更高,同时评卷者的严厉程度也存在显著差异。因而,在决定考生前途命运的重大考试中,整体评分法更受推崇。相似文献

3.

Development of a Scoring Algorithm to Replace Expert Rating for Scoring a Complex Performance-Based Assessment

《教育实用测度》2013,26(4):345-358

Performance assessments typically are scored by having experts rate individual performances. In contexts such as medical licensure, where the examinee population is large and the pool of expert raters is limited practically, this approach may be unworkable. This article describes an automated scoring algorithm for a computer simulation-based examination of physicians' patient-management skills. The algorithm is based on the policy used by clinicians in rating case performances. The results show that scores produced using this algorithm are highly correlated to actual clinician ratings. These scores also are shown to be effective in discriminating between case performance judged to be passing or failing by an independent group of clinicians. 相似文献

4.

Validating human and automated scoring of essays against “True” scores

Yoav Cohen Effi Levi Anat Ben-Simon 《教育实用测度》2018,31(3):241-250

ABSTRACT

In the current study, two pools of 250 essays, all written as a response to the same prompt, were rated by two groups of raters (14 or 15 raters per group), thereby providing an approximation to the essay’s true score. An automated essay scoring (AES) system was trained on the datasets and then scored the essays using a cross-validation scheme. By eliminating one, two, or three raters at a time, and by calculating an estimate of the true scores using the remaining raters, an independent criterion against which to judge the validity of the human raters and that of the AES system, as well as the interrater reliability was produced. The results of the study indicated that the automated scores correlate with human scores to the same degree as human raters correlate with each other. However, the findings regarding the validity of the ratings support a claim that the reliability and validity of AES diverge: although the AES scoring is, naturally, more consistent than the human ratings, it is less valid. 相似文献

5.

Statistically Comparing the Performance of Multiple Automated Raters Across Multiple Items

Vincent Kieftenbeld Michelle Boyer 《教育实用测度》2017,30(2):117-128

Automated scoring systems are typically evaluated by comparing the performance of a single automated rater item-by-item to human raters. This presents a challenge when the performance of multiple raters needs to be compared across multiple items. Rankings could depend on specifics of the ranking procedure; observed differences could be due to random sampling of items and/or responses in the validation sets. Any statistical hypothesis test of the differences in rankings needs to be appropriate for use with rater statistics and adjust for multiple comparisons. This study considered different statistical methods to evaluate differences in performance across multiple raters and items. These methods are illustrated leveraging data from the 2012 Automated Scoring Assessment Prize competitions. Using average rankings to test for significant differences in performance between automated and human raters, findings show that most automated raters did not perform statistically significantly different from human-to-human inter-rater agreement for essays but they did perform differently on short-answer items. Differences in average rankings between most automated raters were not statistically significant, even when their observed performance differed substantially. 相似文献

6.

Evaluation of construct-irrelevant variance yielded by machine and human scoring of a science teacher PCK constructed response assessment

《Studies in Educational Evaluation》2020

Machine learning has been frequently employed to automatically score constructed response assessments. However, there is a lack of evidence of how this predictive scoring approach might be compromised by construct-irrelevant variance (CIV), which is a threat to test validity. In this study, we evaluated machine scores and human scores with regard to potential CIV. We developed two assessment tasks targeting science teacher pedagogical content knowledge (PCK); each task contains three video-based constructed response questions. 187 in-service science teachers watched the videos with each had a given classroom teaching scenario and then responded to the constructed-response items. Three human experts rated the responses and the human-consent scores were used to develop machine learning algorithms to predict ratings of the responses. Including the machine as another independent rater, along with the three human raters, we employed the many-facet Rasch measurement model to examine CIV due to three sources: variability of scenarios, rater severity, and rater sensitivity of the scenarios. Results indicate that variability of scenarios impacts teachers’ performance, but the impact significantly depends on the construct of interest; for each assessment task, the machine is always the most severe rater, compared to the three human raters. However, the machine is less sensitive than the human raters to the task scenarios. This means the machine scoring is more consistent and stable across scenarios within each of the two tasks. 相似文献

7.

Appraising the scoring performance of automated essay scoring systems—Some additional considerations: Which essays? Which human raters? Which scores?

Kevin Raczynski Allan Cohen 《教育实用测度》2018,31(3):233-240

ABSTRACT

The literature on Automated Essay Scoring (AES) systems has provided useful validation frameworks for any assessment that includes AES scoring. Furthermore, evidence for the scoring fidelity of AES systems is accumulating. Yet questions remain when appraising the scoring performance of AES systems. These questions include: (a) which essays are used to calibrate and test AES systems; (b) which human raters provided the scores on these essays; and (c) given that multiple human raters are generally used for this purpose, which human scores should ultimately be used when there are score disagreements? This article provides commentary on the first two questions and an empirical investigation into the third question. The authors suggest that addressing these three questions strengthens the scoring component of the validity argument for any assessment that includes AES scoring. 相似文献

8.

Scoring a Performance-Based Assessment by Modeling the Judgments of Experts

Brian E. Clauser Raja G. Subhiyah Ronald J. Nungester Douglas R. Ripkey Stephen G. Clyman Danette McKinley 《Journal of Educational Measurement》1995,32(4):397-415

Performance assessments typically require expert judges to individually rate each performance. This results in a limitation in the use of such assessments because the rating process may be extremely time consuming. This article describes a scoring algorithm that is based on expert judgments but requires the rating of only a sample of performances. A regression-based policy capturing procedure was implemented to model the judgment policies of experts. The data set was a seven-case performance assessment of physician patient management skills. The assessment used a computer-based simulation of the patient care environment. The results showed a substantial improvement in correspondence between scores produced using the algorithm and actual ratings, when compared to raw scores. Scores based on the algorithm were also shown to be superior to raw scores and equal to expert ratings for making pass/fail decisions which agreed with those made by an independent committee of experts 相似文献

9.

Validating Automated Essay Scoring: A (Modest) Refinement of the “Gold Standard”

Donald E. Powers David S. Escoffery Matthew P. Duchnowski 《教育实用测度》2015,28(2):130-142

By far, the most frequently used method of validating (the interpretation and use of) automated essay scores has been to compare them with scores awarded by human raters. Although this practice is questionable, human-machine agreement is still often regarded as the “gold standard.” Our objective was to refine this model and apply it to data from a major testing program and one system of automated essay scoring. The refinement capitalizes on the fact that essay raters differ in numerous ways (e.g., training and experience), any of which may affect the quality of ratings. We found that automated scores exhibited different correlations with scores awarded by experienced raters (a more compelling criterion) than with those awarded by untrained raters (a less compelling criterion). The results suggest potential for a refined machine-human agreement model that differentiates raters with respect to experience, expertise, and possibly even more salient characteristics. 相似文献

10.

The effectiveness of machine score-ability ratings in predicting automated scoring performance

Susan Lottridge Scott Wood Dan Shaw 《教育实用测度》2018,31(3):215-232

ABSTRACT

This study sought to provide a framework for evaluating machine score-ability of items using a new score-ability rating scale, and to determine the extent to which ratings were predictive of observed automated scoring performance. The study listed and described a set of factors that are thought to influence machine score-ability; these factors informed the score-ability rating applied by expert raters. Five Reading items, six Science items, and 10 Math items were examined. Experts in automated scoring served as reviewers, providing independent ratings of score-ability before engine calibration. Following the rating, engines were calibrated and their performances were evaluated using common industry criteria. Three derived criteria from the engine evaluations were computed: the score-ability value in the rating scale based on the empirical results, the number of industry evaluation criteria met by the engine, the approval status of the engine based on the number of criteria met. The results indicated that the score-ability ratings were moderately correlated with Science score-ability, the ratings were weakly correlated with Math score-ability, and were not correlated with Reading score-ability. 相似文献

11.

Effects of Assigning Raters to Items

Robert C. Sykes Kyoko Ito Zhen Wang 《Educational Measurement》2008,27(1):47-55

Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items. 相似文献

12.

Digital Module 18: Automated Scoring https://ncme.elevate.commpartners.com

Sue Lottridge Amy Burkhardt Michelle Boyer 《Educational Measurement》2020,39(3):141-142

In this digital ITEMS module, Dr. Sue Lottridge, Amy Burkhardt, and Dr. Michelle Boyer provide an overview of automated scoring. Automated scoring is the use of computer algorithms to score unconstrained open-ended test items by mimicking human scoring. The use of automated scoring is increasing in educational assessment programs because it allows scores to be returned faster at lower cost. In the module, they discuss automated scoring from a number of perspectives. First, they discuss benefits and weaknesses of automated scoring, and what psychometricians should know about automated scoring. Next, they describe the overall process of automated scoring, moving from data collection to engine training to operational scoring. Then, they describe how automated scoring systems work, including the basic functions around score prediction as well as other flagging methods. Finally, they conclude with a discussion of the specific validity demands around automated scoring and how they align with the larger validity demands around test scores. Two data activities are provided. The first is an interactive activity that allows the user to train and evaluate a simple automated scoring engine. The second is a worked example that examines the impact of rater error on test scores. The digital module contains a link to an interactive web application as well as its R-Shiny code, diagnostic quiz questions, activities, curated resources, and a glossary. 相似文献

13.

Validity Issues for Performance-Based Tests Scored With Computer-Automated Scoring Systems

《教育实用测度》2013,26(4):413-432

With the increasing use of automated scoring systems in high-stakes testing, it has become essential that test developers assess the validity of the inferences based on scores produced by these systems. In this article, we attempt to place the issues associated with computer-automated scoring within the context of current validity theory. Although it is assumed that the criteria appropriate for evaluating the validity of score interpretations are the same for tests using automated scoring procedures as for other assessments, different aspects of the validity argument may require emphasis as a function of the scoring procedure. We begin the article with a taxonomy of automated scoring procedures. The presentation of this taxonomy provides a framework for discussing threats to validity that may take on increased importance for specific approaches to automated scoring. We then present a general discussion of the process by which test-based inferences are validated, followed by a discussion of the special issues that must be considered when scoring is done by computer. 相似文献

14.

Automated essay scoring: Psychometric guidelines and practices

Chaitanya Ramineni David M. Williamson 《Assessing Writing》2013,18(1):25-39

相似文献

15.

PETS口试评分培训效果的多面Rasch分析

李英关丹丹《外语教学理论与实践》2016,153(3):43-48

本研究以PETS-1级拟聘口试教师为研究对象,对口试教师评分的培训效果进行了研究。采用多面Rasch分析对比口试教师接受培训前后的评分效果。结果发现：培训后,提升了口试教师与专家评分完全一致的比率,评分偏于严格的口试教师在评分标准上做了恰当的调整,所有口试教师评分拟合值都在可接受范围内,总体上,口试教师评分的培训比较有效,培训后提升了评分的准确性。多面Rasch分析有助于发现评分过于宽松、过于严格、评分拟合差的口试教师以及评分异常情况,为开展有针对性地培训提供了可靠的依据。相似文献

16.

Comparing Human and Automated Essay Scoring for Prospective Graduate Students With Learning Disabilities and/or ADHD

Heather Buzick Maria Elena Oliveri Yigal Attali Michael Flor 《教育实用测度》2013,26(3):161-172

ABSTRACT

Automated essay scoring is a developing technology that can provide efficient scoring of large numbers of written responses. Its use in higher education admissions testing provides an opportunity to collect validity and fairness evidence to support current uses and inform its emergence in other areas such as K–12 large-scale assessment. In this study, human and automated scores on essays written by college students with and without learning disabilities and/or attention deficit hyperactivity disorder were compared, using a nationwide (U.S.) sample of prospective graduate students taking the revised Graduate Record Examination. The findings are that, on average, human raters and the automated scoring engine assigned similar essay scores for all groups, despite average differences among groups with respect to essay length and spelling errors. 相似文献

17.

A Framework for Evaluation and Use of Automated Scoring

David M. Williamson Xiaoming Xi F. Jay Breyer 《Educational Measurement》2012,31(1):2-13

A framework for evaluation and use of automated scoring of constructed‐response tasks is provided that entails both evaluation of automated scoring as well as guidelines for implementation and maintenance in the context of constantly evolving technologies. Consideration of validity issues and challenges associated with automated scoring are discussed within the framework. The fit between the scoring capability and the assessment purpose, the agreement between human and automated scores, the consideration of associations with independent measures, the generalizability of automated scores as implemented in operational practice across different tasks and test forms, and the impact and consequences for the population and subgroups are proffered as integral evidence supporting use of automated scoring. Specific evaluation guidelines are provided for using automated scoring to complement human scoring for tests used for high‐stakes purposes. These guidelines are intended to be generalizable to new automated scoring systems and as existing systems change over time. 相似文献

18.

The Generalizability of Scores for a Performance Assessment Scored with a Computer-Automated Scoring System 总被引：1，自引：0，他引：1

Brian E. Clauser Polina Harik Stephen G. Clyman 《Journal of Educational Measurement》2000,37(3):245-261

When performance assessments are delivered and scored by computer, the costs of scoring may be substantially lower than those of scoring the same assessment based on expert review of the individual performances. Computerized scoring algorithms also ensure that the scoring rules are implemented precisely and uniformly. Such computerized algorithms represent an effort to encode the scoring policies of experts. This raises the question, would a different group of experts have produced a meaningfully different algorithm? The research reported in this paper uses generalizability theory to assess the impact of using independent, randomly equivalent groups of experts to develop the scoring algorithms for a set of computer‐simulation tasks designed to measure physicians’ patient management skills. The results suggest that the impact of this “expert group” effect may be significant but that it can be controlled with appropriate test development strategies. The appendix presents multivariate generalizability analysis to examine the stability of the assessed proficiency across scores representing the scoring policies of different groups of experts. 相似文献

19.

Classical,Generalizability, and Multifaceted Rasch Detection of Interrater Variability in Large,Sparse Data Sets

Peter D. Macmillan 《Journal of Experimental Education》2013,81(2):167-190

Classical test theory (CTT), generalizability theory (GT), and multi-faceted Rasch model (MFRM) approaches to detecting and correcting for rater variability were compared. Each of 4,930 students' responses on an English examination was graded on 9 scales by 3 raters drawn from a pool of 70. CTT and MFRM indicated substantial variation among raters; the MFRM analysis identified far more raters as different than the CTT analysis did. In contrast, the GT rater variance component and the Rasch histograms suggested little rater variation. CTT and MFRM correction procedures both produced different scores for more than 50% of the examinees, but 75% of the examinees received identical results after each correction. The demonstrated value of a correction for systems of well-trained multiple graders has implications for all systems in which subjective scoring is used. 相似文献

20.

'Mental Model' Comparison of Automated and Human Scoring

David M. Williamson Isaac I. Bejar Anne S. Hone 《Journal of Educational Measurement》1999,36(2):158-184

'Mental models' used by automated scoring for the simulation divisions of the computerized Architect Registration Examination are contrasted with those used by experienced human graders. Candidate solutions (N = 3613) received both automated and human holistic scores. Quantitative analyses suggest high correspondence between automated and human scores; thereby suggesting similar mental models are implemented. Solutions with discrepancies between automated and human scores were selected for qualitative analysis. The human graders were reconvened to review the human scores and to investigate the source of score discrepancies in light of rationales provided by the automated scoring process. After review, slightly more than half of the score discrepancies were reduced or eliminated. Six sources of discrepancy between original human scores and automated scores were identified: subjective criteria; objective criteria; tolerances/ weighting; details; examinee task interpretation; and unjustified. The tendency of the human graders to be compelled by automated score rationales varied by the nature of original score discrepancy. We determine that, while the automated scores are based on a mental model consistent with that of expert graders, there remain some important differences, both intentional and incidental, which distinguish between human and automated scoring. We conclude that automated scoring has the potential to enhance the validity evidence of scores in addition to improving efficiency. 相似文献