首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this study, patterns of variation in severities of a group of raters over time or so-called "rater drift" was examined when raters scored an essay written under examination conditions. At the same time feedback was given to rater leaders (called "table leaders") who then interpreted the feedback and reported to the raters. Rater severities in five successive periods were estimated using a modified linear logistic test model (LLTM, Fischer, 1973) approach. It was found that the raters did indeed drift towards the mean, but a planned comparision of the feedback with a control condition was not successful; it was believed that this was due to contamination at the table leader level. A series of models was also estimated designed to detect other types of rater effects beyond severity: a tendency to use extreme scores, and tendency to prefer certain categories. The models for these effects were found to be showing significant improvement in fit, implying that these effects were indeed present, although they were difficult to detect in relatively short time periods.  相似文献   

2.
The hierarchical rater model (HRM) re‐cognizes the hierarchical structure of data that arises when raters score constructed response items. In this approach, raters’ scores are not viewed as being direct indicators of examinee proficiency but rather as indicators of essay quality; the (latent categorical) quality of an examinee's essay in turn serves as an indicator of the examinee's proficiency, thus yielding a hierarchical structure. Here it is shown that a latent class model motivated by signal detection theory (SDT) is a natural candidate for the first level of the HRM, the rater model. The latent class SDT model provides measures of rater precision and various rater effects, above and beyond simply severity or leniency. The HRM‐SDT model is applied to data from a large‐scale assessment and is shown to provide a useful summary of various aspects of the raters’ performance.  相似文献   

3.
In this study, we describe a framework for monitoring rater performance over time. We present several statistical indices to identify raters whose standards drift and explain how to use those indices operationally. To illustrate the use of the framework, we analyzed rating data from the 2002 Advanced Placement English Literature and Composition examination, employing a multifaceted Rasch approach to determine whether raters exhibited evidence of two types of differential rater functioning over time (i.e., changes in levels of accuracy or scale category use). Some raters showed statistically significant changes in their levels of accuracy as the scoring progressed, while other raters displayed evidence of differential scale category use over time.  相似文献   

4.
Researchers have explored a variety of topics related to identifying and distinguishing among specific types of rater effects, as well as the implications of different types of incomplete data collection designs for rater‐mediated assessments. In this study, we used simulated data to examine the sensitivity of latent trait model indicators of three rater effects (leniency, central tendency, and severity) in combination with different types of incomplete rating designs (systematic links, anchor performances, and spiral). We used the rating scale model and the partial credit model to calculate rater location estimates, standard errors of rater estimates, model–data fit statistics, and the standard deviation of rating scale category thresholds as indicators of rater effects and we explored the sensitivity of these indicators to rater effects under different conditions. Our results suggest that it is possible to detect rater effects when each of the three types of rating designs is used. However, there are differences in the sensitivity of each indicator related to type of rater effect, type of rating design, and the overall proportion of effect raters. We discuss implications for research and practice related to rater‐mediated assessments.  相似文献   

5.
This study describes several categories of rater errors (rater severity, halo effect, central tendency, and restriction of range). Criteria are presented for evaluating the quality of ratings based on a many-faceted Rasch measurement (FACETS) model for analyzing judgments. A random sample of 264 compositions rated by 15 raters and a validity committee from the 1990 administration of the Eighth Grade Writing Test in Georgia is used to illustrate the model. The data suggest that there are significant differences in rater severity. Evidence of a halo effect is found for two raters who appear to be rating the compositions holistically rather than analytically. Approximately 80% of the ratings are in the two middle categories of the rating scale, indicating that the error of central tendency is present. Restriction of range is evident when the unadjusted raw score distribution is examined, although this rater error is less evident when adjusted estimates of writing competence are used  相似文献   

6.
This study evaluated rater accuracy with rater-monitoring data from high stakes examinations in England. Rater accuracy was estimated with cross-classified multilevel modelling. The data included face-to-face training and monitoring of 567 raters in 110 teams, across 22 examinations, giving a total of 5500 data points. Two rater-monitoring systems (Expert consensus scores and Supervisor judgement of correct scores) were utilised for all raters. Results showed significant group training (table leader) effects upon rater accuracy and these were greater in the expert consensus score monitoring system. When supervisor judgement methods of monitoring were used, differences between training teams (table leader effects) were underestimated. Supervisor-based judgements of raters’ accuracies were more widely dispersed than in the Expert consensus monitoring system. Supervisors not only influenced their teams’ scoring accuracies, they overestimated differences between raters’ accuracies, compared with the Expert consensus system. Systems using supervisor judgements of correct scores and face-to-face rater training are, therefore, likely to underestimate table leader effects and overestimate rater effects.  相似文献   

7.
In the United Kingdom, the majority of national assessments involve human raters. The processes by which raters determine the scores to award are central to the assessment process and affect the extent to which valid inferences can be made from assessment outcomes. Thus, understanding rater cognition has become a growing area of research in the United Kingdom. This study investigated rater cognition in the context of the assessment of school‐based project work for high‐stakes purposes. Thirteen teachers across three subjects were asked to “think aloud” whilst scoring example projects. Teachers also completed an internal standardization exercise. Nine professional raters across the same three subjects standardized a set of project scores whilst thinking aloud. The behaviors and features attended to were coded. The data provided insights into aspects of rater cognition such as reading strategies, emotional and social influences, evaluations of features of student work (which aligned with scoring criteria), and how overall judgments are reached. The findings can be related to existing theories of judgment. Based on the evidence collected, the cognition of teacher raters did not appear to be substantially different from that of professional raters.  相似文献   

8.
Rater‐mediated assessments require the evaluation of the accuracy and consistency of the inferences made by the raters to ensure the validity of score interpretations and uses. Modeling rater response processes allows for a better understanding of how raters map their representations of the examinee performance to their representation of the scoring criteria. Validity of score meaning is affected by the accuracy of raters' representations of examinee performance and the scoring criteria, and the accuracy of the mapping process. Methodological advances and applications that model rater response processes, rater accuracy, and rater consistency inform the design, scoring, interpretations, and uses of rater‐mediated assessments.  相似文献   

9.
The purpose of this study was to investigate the stability of rater severity over an extended rating period. Multifaceted Rasch analysis was applied to ratings of 16 raters on writing performances of 8, 285 elementary school students. Each performance was rated by two trained raters over a period of seven rating days. Performances rated on the first day were re-rated at the end of the rating period. Statistically significant differences between raters were found within each day and in all days combined. Daily estimates of the relative severity of individual raters were found to differ significantly from single, on-average estimates for the whole rating period. For 10 raters, severity estimates on the last day were significantly different from estimates on the first day. These fndings cast doubt on the practice of using a single calibration of rater severity as the basis for adjustment of person measures.  相似文献   

10.
The decision-making behaviors of 8 raters when scoring 39 persuasive and 39 narrative essays written by second language learners were examined, first using Rasch analysis and then, through think aloud protocols. Results based on Rasch analysis and think aloud protocols recorded by raters as they were scoring holistically and analytically suggested that rater background may have contributed to rater expectations that might explain individual differences in the application of the performance criteria of the rubrics when rating essays. The results further suggested that rater ego engagement with the text and/or author may have helped mitigate rater severity and that self-monitoring behaviors by raters may have had a similar mitigating effect.  相似文献   

11.
In signal detection rater models for constructed response (CR) scoring, it is assumed that raters discriminate equally well between different latent classes defined by the scoring rubric. An extended model that relaxes this assumption is introduced; the model recognizes that a rater may not discriminate equally well between some of the scoring classes. The extension recognizes a different type of rater effect and is shown to offer useful tests and diagnostic plots of the equal discrimination assumption, along with ways to assess rater accuracy and various rater effects. The approach is illustrated with an application to a large‐scale language test.  相似文献   

12.
When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of standardized residuals). One can create plots of the standardized residuals, isolating those that resulted from raters’ ratings of particular subgroups. Practitioners can then examine the plots to identify raters who did not maintain a uniform level of severity when they assessed various subgroups (i.e., exhibited evidence of differential rater functioning). In this study, we analyzed simulated and real data to explore the utility of this between‐subgroup fit approach. We used standardized between‐subgroup outfit statistics to identify misfitting raters and the corresponding plots of their standardized residuals to determine whether there were any identifiable patterns in each rater's misfitting ratings related to subgroups.  相似文献   

13.
Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested.  相似文献   

14.
Many large‐scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple‐choice and/or constructed responses sections of items to generate multiple scores. In the current article, we propose an extension of the hierarchical rater model (HRM) to be applied with simple structured tests with constructed response items. In addition to modeling the appropriate trait structure, the multidimensional HRM (M‐HRM) presented here also accounts for rater severity bias and rater variability or inconsistency. We introduce the model formulation, test parameter recovery with a focus on latent traits, and compare the M‐HRM to other scoring approaches (unidimensional HRMs and a traditional multidimensional item response theory model) using simulated and empirical data. Results show more precise scores under the M‐HRM, with a major improvement in scores when incorporating rater effects versus ignoring them in the traditional multidimensional item response theory model.  相似文献   

15.
Internationally, many assessment systems rely predominantly on human raters to score examinations. Arguably, this facilitates the assessment of multiple sophisticated educational constructs, strengthening assessment validity. It can introduce subjectivity into the scoring process, however, engendering threats to accuracy. The present objectives are to examine some key qualitative data collection methods used internationally to research this potential trade‐off, and to consider some theoretical contexts within which the methods are usable. Self‐report methods such as Kelly's Repertory Grid, think aloud, stimulated recall, and the NASA task load index have yielded important insights into the competencies needed for scoring expertise, as well as the sequences of mental activity that scoring typically involves. Examples of new data and of recent studies are used to illustrate these methods’ strengths and weaknesses. This investigation has significance for assessment designers, developers and administrators. It may inform decisions on the methods’ applicability in American and other rater cognition research contexts.  相似文献   

16.
Although much attention has been given to rater effects in rater‐mediated assessment contexts, little research has examined the overall stability of leniency and severity effects over time. This study examined longitudinal scoring data collected during three consecutive administrations of a large‐scale, multi‐state summative assessment program. Multilevel models were used to assess the overall extent of rater leniency/severity during scoring and examine the extent to which leniency/severity effects were stable across the three administrations. Model results were then applied to scaled scores to estimate the impact of the stability of leniency/severity effects on students’ scores. Results showed relative scoring stability across administrations in mathematics. In English language arts, short constructed response items showed evidence of slightly increasing severity across administrations, while essays showed mixed results: evidence of both slightly increasing severity and moderately increasing leniency over time, depending on trait. However, when model results were applied to scaled scores, results revealed rater effects had minimal impact on students’ scores.  相似文献   

17.
Fabricated test protocols were used to study how effectively examiners agree in scoring ambiguous WISC-R responses. Clinical and school psychologists (N = 62) and graduate students (N = 48) scored WISC-R Similarities, Comprehension, and Vocabulary responses obtained in a group administration procedure. From over 15,000 responses obtained, only unusual, atypical, or ambiguous responses were selected; 726 responses were scored, with 11 groups of 10 raters each scoring 66 responses. Considerable scoring disagreement occurred. Unanimous agreement (within each group of 10 raters) was found on only 13% of the responses, while 80% of the raters agreed on 44% of the responses. Rates of scoring agreement were not related to level of rater experience, but were related to specific subtests. Higher rates of agreement were found on the Similarities and Comprehension subtests than on the Vocabulary subtest for both experienced and inexperienced raters. The results suggest that even with the improved WISC-R manual, scoring remains a difficult and challenging task.  相似文献   

18.
Rater training is an important part of developing and conducting large‐scale constructed‐response assessments. As part of this process, candidate raters have to pass a certification test to confirm that they are able to score consistently and accurately before they begin scoring operationally. Moreover, many assessment programs require raters to pass a calibration test before every scoring shift. To support the high‐stakes decisions made on the basis of rater certification tests, a psychometric approach for their development, analysis, and use is proposed. The circumstances and uses of these tests suggest that they are expected to have relatively low reliability. This expectation is supported by empirical data. Implications for the development and use of these tests to ensure their quality are discussed.  相似文献   

19.
When good model-data fit is observed, the Many-Facet Rasch (MFR) model acts as a linking and equating model that can be used to estimate student achievement, item difficulties, and rater severity on the same linear continuum. Given sufficient connectivity among the facets, the MFR model provides estimates of student achievement that are equated to control for differences in rater severity. Although several different linking designs are used in practice to establish connectivity, the implications of design differences have not been fully explored. Research is also limited related to the impact of model-data fit on the quality of MFR model-based adjustments for rater severity. This study explores the effects of linking designs and model-data fit for raters on the interpretation of student achievement estimates within the context of performance assessments in music. Results indicate that performances cannot be effectively adjusted for rater effects when inadequate linking or model-data fit is present.  相似文献   

20.
Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test-takers’ achievement estimates given their response patterns, has not been investigated. In rater-mediated assessments, person fit reflects the reasonableness of rater judgments of individual test-takers’ achievement over components of the assessment. This study illustrates an approach to visualizing and evaluating person fit in assessments that involve rater judgment using rater-mediated person response functions (rm-PRFs). The rm-PRF approach allows analysts to consider the impact of rater effects on person fit in order to identify individual test-takers for whom the assessment results may not have a straightforward interpretation. A simulation study is used to evaluate the impact of rater effects on person fit. Results indicate that rater effects can compromise the interpretation and use of performance assessment results for individual test-takers. Recommendations are presented that call researchers and practitioners to supplement routine psychometric analyses for performance assessments (e.g., rater reliability checks) with rm-PRFs to identify students whose ratings may have compromised interpretations as a result of rater effects, person misfit, or both.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号