首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 453 毫秒
1.
ABSTRACT

In the current study, two pools of 250 essays, all written as a response to the same prompt, were rated by two groups of raters (14 or 15 raters per group), thereby providing an approximation to the essay’s true score. An automated essay scoring (AES) system was trained on the datasets and then scored the essays using a cross-validation scheme. By eliminating one, two, or three raters at a time, and by calculating an estimate of the true scores using the remaining raters, an independent criterion against which to judge the validity of the human raters and that of the AES system, as well as the interrater reliability was produced. The results of the study indicated that the automated scores correlate with human scores to the same degree as human raters correlate with each other. However, the findings regarding the validity of the ratings support a claim that the reliability and validity of AES diverge: although the AES scoring is, naturally, more consistent than the human ratings, it is less valid.  相似文献   

2.
This study of the reliability and validity of scales from the Child's Report of Parental Behavior (CRPBI) presents data on the utility of aggregating the ratings of multiple observers. Subjects were 680 individuals from 170 families. The participants in each family were a college freshman student, the mother, the father, and 1 sibling. The results revealed moderate internal consistency (M = .71) for all rater types on the 18 subscales of the CRPBI, but low interrater agreement (M = .30). The same factor structure was observed across the 4 rater types; however, aggregation within raters across salient scales to form estimated factor scores did not improve rater convergence appreciably (M = .36). Aggregation of factor scores across 2 raters yields much higher convergence (M = .51), and the 4-rater aggregates yielded impressive generalizability coefficients (M = .69). These and other analyses suggested that the responses of each family member contained a small proportion of true variance and a substantial proportion of factor-specific systematic error. The latter can be greatly reduced by aggregating scores across multiple raters.  相似文献   

3.
Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items.  相似文献   

4.
5.
6.
7.
This study investigates how experienced and inexperienced raters score essays written by ESL students on two different prompts. The quantitative analysis using multi-faceted Rasch measurement, which provides measurements of rater severity and consistency, showed that the inexperienced raters were more severe than the experienced raters on one prompt but not on the other prompt, and that differences between the two groups of raters were eliminated following rater training. The qualitative analysis, which consisted of analysis of raters' think-aloud protocols while scoring essays, provided insights into reasons for these differences. Differences were related to the ease with which the scoring rubric could be applied to the two prompts and to differences in how the two groups of raters perceived the appropriateness of the prompts.  相似文献   

8.
An approach to essay grading based on signal detection theory (SDT) is presented. SDT offers a basis for understanding rater behavior with respect to the scoring of construct responses, in that it provides a theory of psychological processes underlying the raters' behavior. The approach also provides measures of the precision of the raters and the accuracy of classifications. An application of latent class SDT to essay grading is detailed, and similarities to and differences from item response theory (IRT) are noted. The validity and utility of classifications obtained from the SDT model and scores obtained from IRT models are compared. Validity coefficients were found to be about equal in magnitude across SDT and IRT models. Results from a simulation study of a 5-class SDT model with eight raters are also presented.  相似文献   

9.
A method for assessing rater reliability by means of a design of overlapping rater teams is presented. The products to be rated are split randomly into m disjoint subsamples, m equaling the number of raters. Each rater rates at least two subsamples according to a prefixed design. The covariances or correlations of the ratings can be analyzed with LISREL models, resulting in estimates of the rater reliabilities. Models in which the rater reliabilities are congeneric, tauequivalent, or parallel can be tested. We address problems concerning the identification and the degrees of freedom of the models and present two examples based on essay ratings.  相似文献   

10.
The decision-making behaviors of 8 raters when scoring 39 persuasive and 39 narrative essays written by second language learners were examined, first using Rasch analysis and then, through think aloud protocols. Results based on Rasch analysis and think aloud protocols recorded by raters as they were scoring holistically and analytically suggested that rater background may have contributed to rater expectations that might explain individual differences in the application of the performance criteria of the rubrics when rating essays. The results further suggested that rater ego engagement with the text and/or author may have helped mitigate rater severity and that self-monitoring behaviors by raters may have had a similar mitigating effect.  相似文献   

11.
Ratings given to the same item response may have a stronger correlation than those given to different item responses, especially when raters interact with one another before giving ratings. The rater bundle model was developed to account for such local dependence by forming multiple ratings given to an item response as a bundle and assigning fixed‐effect parameters to describe response patterns in the bundle. Unfortunately, this model becomes difficult to manage when a polytomous item is graded by more than two raters. In this study, by adding random‐effect parameters to the facets model, we propose a class of generalized rater models to account for the local dependence among multiple ratings and intrarater variation in severity. A series of simulations was conducted with the freeware WinBUGS to evaluate parameter recovery of the new models and consequences of ignoring the local dependence or intrarater variation in severity. The results revealed a good parameter recovery when the data‐generating models were fit, and a poor estimation of parameters and test reliability when the local dependence or intrarater variation in severity was ignored. An empirical example is provided.  相似文献   

12.
This article presents a novel method, the Complex Dynamics Essay Scorer (CDES), for automated essay scoring using complex network features. Texts produced by college students in China were represented as scale‐free networks (e.g., a word adjacency model) from which typical network features, such as the in‐/out‐degrees, clustering coefficient (CC), and dynamic networks, were obtained. The CDES integrates the classical concepts of network feature representation and essay score series variation. Several experiments indicated that the network measures different essay qualities and can be clearly demonstrated to develop complex networks for autoscoring tasks. The average agreement of the CDES and human rater scores was 86.5%, and the average Pearson correlation was .77. The results indicate that the CDES produced functional complex systems and autoscored Chinese essays in a method consistent with human raters. Our research suggests potential applications in other areas of educational assessment.  相似文献   

13.
The hierarchical rater model (HRM) re‐cognizes the hierarchical structure of data that arises when raters score constructed response items. In this approach, raters’ scores are not viewed as being direct indicators of examinee proficiency but rather as indicators of essay quality; the (latent categorical) quality of an examinee's essay in turn serves as an indicator of the examinee's proficiency, thus yielding a hierarchical structure. Here it is shown that a latent class model motivated by signal detection theory (SDT) is a natural candidate for the first level of the HRM, the rater model. The latent class SDT model provides measures of rater precision and various rater effects, above and beyond simply severity or leniency. The HRM‐SDT model is applied to data from a large‐scale assessment and is shown to provide a useful summary of various aspects of the raters’ performance.  相似文献   

14.
Numerous studies have examined performance assessment data using generaliz-ability theory. Typically, these studies have treated raters as randomly sampled from a population, with each rater judging a given performance on a single occasion. This paper presents two studies that focus on aspects of the rating process that are not explicitly accounted for in this typical design. The first study makes explicit the "committee" facet, acknowledging that raters often work within groups. The second study makes explicit the "rating-occasion" facet by having each rater judge each performance on two separate occasions. The results of the first study highlight the importance of clearly specifying the relevant facets of the universe of interest. Failing to include the committee facet led to an overly optimistic estimate of the precision of the measurement procedure. By contrast, failing to include the rating-occasion facet, in the second study, had minimal impact on the estimated error variance.  相似文献   

15.
Evaluating Rater Accuracy in Performance Assessments   总被引:1,自引:0,他引:1  
A new method for evaluating rater accuracy within the context of performance assessments is described. Accuracy is defined as the match between ratings obtained from operational raters and those obtained from an expert panel on a set of benchmark, exemplar, or anchor performances. An extended Rasch measurement model called the FACETS model is presented for examining rater accuracy. The FACETS model is illustrated with 373 benchmark papers rated by 20 operational raters and an expert panel. The data are from the 1993field test of the High School Graduation Writing Test in Georgia. The data suggest that there are statistically significant differences in rater accuracy; the data also suggest that it is easier to be accurate on some benchmark papers than on others. A small example is presented to illustrate how the accuracy ordering of raters may not be invariant over different subsets of benchmarks used to evaluate accuracy.  相似文献   

16.
By far, the most frequently used method of validating (the interpretation and use of) automated essay scores has been to compare them with scores awarded by human raters. Although this practice is questionable, human-machine agreement is still often regarded as the “gold standard.” Our objective was to refine this model and apply it to data from a major testing program and one system of automated essay scoring. The refinement capitalizes on the fact that essay raters differ in numerous ways (e.g., training and experience), any of which may affect the quality of ratings. We found that automated scores exhibited different correlations with scores awarded by experienced raters (a more compelling criterion) than with those awarded by untrained raters (a less compelling criterion). The results suggest potential for a refined machine-human agreement model that differentiates raters with respect to experience, expertise, and possibly even more salient characteristics.  相似文献   

17.
Automated scoring systems are typically evaluated by comparing the performance of a single automated rater item-by-item to human raters. This presents a challenge when the performance of multiple raters needs to be compared across multiple items. Rankings could depend on specifics of the ranking procedure; observed differences could be due to random sampling of items and/or responses in the validation sets. Any statistical hypothesis test of the differences in rankings needs to be appropriate for use with rater statistics and adjust for multiple comparisons. This study considered different statistical methods to evaluate differences in performance across multiple raters and items. These methods are illustrated leveraging data from the 2012 Automated Scoring Assessment Prize competitions. Using average rankings to test for significant differences in performance between automated and human raters, findings show that most automated raters did not perform statistically significantly different from human-to-human inter-rater agreement for essays but they did perform differently on short-answer items. Differences in average rankings between most automated raters were not statistically significant, even when their observed performance differed substantially.  相似文献   

18.
本研究采用混合研究法对CET-4作文评分人如何使用评分标准进行分析。26位CET-4作文评分人对30篇CET-4模拟作文评分,并提供3条按重要性排序的评分理由。研究结果显示:(1)虽然存在严厉度的差异,但是26位评分人之间的一致性比较好,且大部分评分人的自身一致性也较好。(2)部分评分人的评分理由呈现了单一化趋势。(3)评分人所给评分理由的71.91%体现了CET-4作文评分标准所规定的5个文本特征,说明大部分评分人对标准的理解和把握还是比较准确的。  相似文献   

19.
In this study, patterns of variation in severities of a group of raters over time or so-called "rater drift" was examined when raters scored an essay written under examination conditions. At the same time feedback was given to rater leaders (called "table leaders") who then interpreted the feedback and reported to the raters. Rater severities in five successive periods were estimated using a modified linear logistic test model (LLTM, Fischer, 1973) approach. It was found that the raters did indeed drift towards the mean, but a planned comparision of the feedback with a control condition was not successful; it was believed that this was due to contamination at the table leader level. A series of models was also estimated designed to detect other types of rater effects beyond severity: a tendency to use extreme scores, and tendency to prefer certain categories. The models for these effects were found to be showing significant improvement in fit, implying that these effects were indeed present, although they were difficult to detect in relatively short time periods.  相似文献   

20.
Classical test theory (CTT), generalizability theory (GT), and multi-faceted Rasch model (MFRM) approaches to detecting and correcting for rater variability were compared. Each of 4,930 students' responses on an English examination was graded on 9 scales by 3 raters drawn from a pool of 70. CTT and MFRM indicated substantial variation among raters; the MFRM analysis identified far more raters as different than the CTT analysis did. In contrast, the GT rater variance component and the Rasch histograms suggested little rater variation. CTT and MFRM correction procedures both produced different scores for more than 50% of the examinees, but 75% of the examinees received identical results after each correction. The demonstrated value of a correction for systems of well-trained multiple graders has implications for all systems in which subjective scoring is used.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号