期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Conceptualizing Rater Judgments and Rating Processes for Rater‐Mediated Assessments

Jue Wang George Engelhard 《Journal of Educational Measurement》2019,56(3):582-609

Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested. 相似文献

2.

An Instructional Module on Mokken Scale Analysis

下载免费PDF全文

Stefanie A. Wind 《Educational Measurement》2017,36(2):50-66

Mokken scale analysis (MSA) is a probabilistic‐nonparametric approach to item response theory (IRT) that can be used to evaluate fundamental measurement properties with less strict assumptions than parametric IRT models. This instructional module provides an introduction to MSA as a probabilistic‐nonparametric framework in which to explore measurement quality, with an emphasis on its application in the context of educational assessment. The module describes both dichotomous and polytomous formulations of the MSA model. Examples of the application of MSA to educational assessment are provided using data from a multiple‐choice physical science assessment and a rater‐mediated writing assessment. 相似文献

3.

Pedagogical Considerations for Examining Rater Variability in Rater‐Mediated Assessments: A Three‐Model Framework

Brian C. Wesolowski Stefanie A. Wind 《Journal of Educational Measurement》2019,56(3):521-546

Rater‐mediated assessments are a common methodology for measuring persons, investigating rater behavior, and/or defining latent constructs. The purpose of this article is to provide a pedagogical framework for examining rater variability in the context of rater‐mediated assessments using three distinct models. The first model is the observation model, which includes ecological/environmental considerations for the evaluation system. The second model is the measurement model, which includes the transformation of observed, rater response data to linear measures using a measurement model with specific requirements of rater‐invariant measurement in order to examine raters’ construct‐relevant variability stemming from the evaluative system. The third model is the interaction model, which includes an interaction parameter to allow for the investigation into raters’ systematic, construct‐irrelevant variability stemming from the evaluative system. Implications for measurement outcomes and validity are discussed. 相似文献

4.

How invariant and accurate are domain ratings in writing assessment?

Stefanie A. Wind George Engelhard 《Assessing Writing》2013,18(4):278-299

The use of evidence to guide policy and practice in education (Cooper, Levin, & Campbell, 2009) has included an increased emphasis on constructed-response items, such as essays and portfolios. Because assessments that go beyond selected-response items and incorporate constructed-response items are rater-mediated (Engelhard, 2002, Engelhard, 2013), it is necessary to develop evidence-based indices of quality for the rating processes used to evaluate student performances. This study proposes a set of criteria for evaluating the quality of ratings based on the concepts of measurement invariance and accuracy within the context of a large-scale writing assessment. Two measurement models are used to explore indices of quality for raters and ratings: the first model provides evidence for the invariance of ratings, and the second model provides evidence for rater accuracy. Rating quality is examined within four writing domains from an analytic rubric. Further, this study explores the alignment between indices of rating quality based on these invariance and accuracy models within each of the four domains of writing. Major findings suggest that rating quality varies across analytic rubric domains, and that there is some correspondence between indices of rating quality based on the invariance and accuracy models. Implications for research and practice are discussed. 相似文献

5.

Examining Rater Errors in the Assessment of Written Composition With a Many-Faceted Rasch Model 总被引：2，自引：0，他引：2

George Engelhard Jr 《Journal of Educational Measurement》1994,31(2):93-112

This study describes several categories of rater errors (rater severity, halo effect, central tendency, and restriction of range). Criteria are presented for evaluating the quality of ratings based on a many-faceted Rasch measurement (FACETS) model for analyzing judgments. A random sample of 264 compositions rated by 15 raters and a validity committee from the 1990 administration of the Eighth Grade Writing Test in Georgia is used to illustrate the model. The data suggest that there are significant differences in rater severity. Evidence of a halo effect is found for two raters who appear to be rating the compositions holistically rather than analytically. Approximately 80% of the ratings are in the two middle categories of the rating scale, indicating that the error of central tendency is present. Restriction of range is evident when the unadjusted raw score distribution is examined, although this rater error is less evident when adjusted estimates of writing competence are used 相似文献

6.

Exploring the Impact of Rater Effects on Person Fit in Rater-Mediated Assessments

Stefanie A. Wind 《Educational Measurement》2020,39(4):76-94

Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test-takers’ achievement estimates given their response patterns, has not been investigated. In rater-mediated assessments, person fit reflects the reasonableness of rater judgments of individual test-takers’ achievement over components of the assessment. This study illustrates an approach to visualizing and evaluating person fit in assessments that involve rater judgment using rater-mediated person response functions (rm-PRFs). The rm-PRF approach allows analysts to consider the impact of rater effects on person fit in order to identify individual test-takers for whom the assessment results may not have a straightforward interpretation. A simulation study is used to evaluate the impact of rater effects on person fit. Results indicate that rater effects can compromise the interpretation and use of performance assessment results for individual test-takers. Recommendations are presented that call researchers and practitioners to supplement routine psychometric analyses for performance assessments (e.g., rater reliability checks) with rm-PRFs to identify students whose ratings may have compromised interpretations as a result of rater effects, person misfit, or both. 相似文献

7.

Technical Quality Criteria for Evaluating District Assessment Portfolios Used in the Nebraska STARS

Barbara S. Plake James C. Impara Chad W. Buckendahl 《Educational Measurement》2004,23(2):12-16

Nebraska districts use different strategies for measuring student performance on the state's content standards. District assessments differ in type and technical quality. Six quality criteria were endorsed by the state. These criteria cover content and curricular validity, fairness, and appropriateness of score interpretations. District assessment portfolios document how well assessments meet these criteria. Districts receive ratings on how well their assessments meet each of the quality criteria and are given a rating from Unacceptable to Exemplary. This article presents these technical quality criteria and explains how they are (a) individually rated and (b) combined for the district's overall quality rating. 相似文献

8.

Assessing the reliability of self‐ and peer rating in student group work

Bo Zhang Lucy Johnston Gulsen Bagci Kilic 《Assessment & Evaluation in Higher Education》2008,33(3):329-340

Peer and self‐ratings have been strongly recommended as the means to adjust individual contributions to group work. To evaluate the quality of student ratings, previous research has primarily explored the validity of these ratings, as indicated by the degree of agreement between student and teacher ratings. This research describes a Generalizability Theory framework to evaluate the reliability of student ratings in terms of the degree of consistency among students themselves, as well as group and rater effects. Ratings from two group projects are analyzed to illustrate how this method can be applied. The reliability of student ratings differs for the two group projects considered in this research. While a strong group effect is present in both projects, the rater effect is different. Implications of this research for classroom assessment practice are discussed. 相似文献

9.

Examining Differential Rater Functioning Using a Between‐Subgroup Outfit Approach

Stefanie A. Wind Stefanie S. Sebok‐Syer 《Journal of Educational Measurement》2019,56(2):217-250

When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of standardized residuals). One can create plots of the standardized residuals, isolating those that resulted from raters’ ratings of particular subgroups. Practitioners can then examine the plots to identify raters who did not maintain a uniform level of severity when they assessed various subgroups (i.e., exhibited evidence of differential rater functioning). In this study, we analyzed simulated and real data to explore the utility of this between‐subgroup fit approach. We used standardized between‐subgroup outfit statistics to identify misfitting raters and the corresponding plots of their standardized residuals to determine whether there were any identifiable patterns in each rater's misfitting ratings related to subgroups. 相似文献

10.

The Effects of Incomplete Rating Designs in Combination With Rater Effects

Stefanie A. Wind Eli Jones 《Journal of Educational Measurement》2019,56(1):76-100

Researchers have explored a variety of topics related to identifying and distinguishing among specific types of rater effects, as well as the implications of different types of incomplete data collection designs for rater‐mediated assessments. In this study, we used simulated data to examine the sensitivity of latent trait model indicators of three rater effects (leniency, central tendency, and severity) in combination with different types of incomplete rating designs (systematic links, anchor performances, and spiral). We used the rating scale model and the partial credit model to calculate rater location estimates, standard errors of rater estimates, model–data fit statistics, and the standard deviation of rating scale category thresholds as indicators of rater effects and we explored the sensitivity of these indicators to rater effects under different conditions. Our results suggest that it is possible to detect rater effects when each of the three types of rating designs is used. However, there are differences in the sensitivity of each indicator related to type of rater effect, type of rating design, and the overall proportion of effect raters. We discuss implications for research and practice related to rater‐mediated assessments. 相似文献

11.

ESTIMATING THE RELIABILITY, VALIDITY, AND INVALIDITY OF ESSAY RATINGS

H. BLOK 《Journal of Educational Measurement》1985,22(1):41-52

In an essay rating study multiple ratings may be obtained by having different raters judge essays or by having the same rater(s) repeat the judging of essays. An important question in the analysis of essay ratings is whether multiple ratings, however obtained, may be assumed to represent the same true scores. When different raters judge the same essays only once, it is impossible to answer this question. In this study 16 raters judged 105 essays on two occasions; hence, it was possible to test assumptions about true scores within the framework of linear structural equation models. It emerged that the ratings of a given rater on the two occasions represented the same true scores. However, the ratings of different raters did not represent the same true scores. The estimated intercorrelations of the true scores of different raters ranged from .415 to .910. Parameters of the best fitting model were used to compute coefficients of reliability, validity, and invalidity. The implications of these coefficients are discussed. 相似文献

12.

Evaluating Rater Accuracy in Performance Assessments 总被引：1，自引：0，他引：1

George Engelhard Jr. 《Journal of Educational Measurement》1996,33(1):56-70

A new method for evaluating rater accuracy within the context of performance assessments is described. Accuracy is defined as the match between ratings obtained from operational raters and those obtained from an expert panel on a set of benchmark, exemplar, or anchor performances. An extended Rasch measurement model called the FACETS model is presented for examining rater accuracy. The FACETS model is illustrated with 373 benchmark papers rated by 20 operational raters and an expert panel. The data are from the 1993field test of the High School Graduation Writing Test in Georgia. The data suggest that there are statistically significant differences in rater accuracy; the data also suggest that it is easier to be accurate on some benchmark papers than on others. A small example is presented to illustrate how the accuracy ordering of raters may not be invariant over different subsets of benchmarks used to evaluate accuracy. 相似文献

13.

Modeling Rater Response Processes in Evaluating Score Meaning

Suzanne Lane 《Journal of Educational Measurement》2019,56(3):653-663

Rater‐mediated assessments require the evaluation of the accuracy and consistency of the inferences made by the raters to ensure the validity of score interpretations and uses. Modeling rater response processes allows for a better understanding of how raters map their representations of the examinee performance to their representation of the scoring criteria. Validity of score meaning is affected by the accuracy of raters' representations of examinee performance and the scoring criteria, and the accuracy of the mapping process. Methodological advances and applications that model rater response processes, rater accuracy, and rater consistency inform the design, scoring, interpretations, and uses of rater‐mediated assessments. 相似文献

14.

Comparing the Effectiveness of Self‐Paced and Collaborative Frame‐of‐Reference Training on Rater Accuracy in a Large‐Scale Writing Assessment

下载免费PDF全文

Kevin R. Raczynski Allan S. Cohen George Engelhard Jr. Zhenqiu Lu 《Journal of Educational Measurement》2015,52(3):301-318

There is a large body of research on the effectiveness of rater training methods in the industrial and organizational psychology literature. Less has been reported in the measurement literature on large‐scale writing assessments. This study compared the effectiveness of two widely used rater training methods—self‐paced and collaborative frame‐of‐reference training—in the context of a large‐scale writing assessment. Sixty‐six raters were randomly assigned to the training methods. After training, all raters scored the same 50 representative essays prescored by a group of expert raters. A series of generalized linear mixed models were then fitted to the rating data. Results suggested that the self‐paced method was equivalent in effectiveness to the more time‐intensive and expensive collaborative method. Implications for large‐scale writing assessments and suggestions for further research are discussed. 相似文献

15.

Quantitative analysis of the rubric as an assessment tool: an empirical study of student peer‐group rating

John Hafner Patti Hafner 《International Journal of Science Education》2013,35(12):1509-1528

Although the rubric has emerged as one of the most popular assessment tools in progressive educational programs, there is an unfortunate dearth of information in the literature quantifying the actual effectiveness of the rubric as an assessment tool in the hands of the students. This study focuses on the validity and reliability of the rubric as an assessment tool for student peer‐group evaluation in an effort to further explore the use and effectiveness of the rubric. A total of 1577 peer‐group ratings using a rubric for an oral presentation was used in this 3‐year study involving 107 college biology students. A quantitative analysis of the rubric used in this study shows that it is used consistently by both students and the instructor across the study years. Moreover, the rubric appears to be ‘gender neutral’ and the students' academic strength has no significant bearing on the way that they employ the rubric. A significant, one‐to‐one relationship (slope = 1.0) between the instructor's assessment and the students' rating is seen across all years using the rubric. A generalizability study yields estimates of inter‐rater reliability of moderate values across all years and allows for the estimation of variance components. Taken together, these data indicate that the general form and evaluative criteria of the rubric are clear and that the rubric is a useful assessment tool for peer‐group (and self‐) assessment by students. To our knowledge, these data provide the first statistical documentation of the validity and reliability of the rubric for student peer‐group assessment. 相似文献

16.

Application of Latent Trait Models to Identifying Substantively Interesting Raters

Edward W. Wolfe Aaron McVay 《Educational Measurement》2012,31(3):31-37

Historically, research focusing on rater characteristics and rating contexts that enable the assignment of accurate ratings and research focusing on statistical indicators of accurate ratings has been conducted by separate communities of researchers. This study demonstrates how existing latent trait modeling procedures can identify groups of raters who may be of substantive interest to those studying the experiential, cognitive, and contextual aspects of ratings. We employ two data sources in our demonstration—simulated data and data from a large‐scale state‐wide writing assessment. We apply latent trait models to these data to identify examples of rater leniency, centrality, inaccuracy, and differential dimensionality; and we investigate the association between rater training procedures and the manifestation of rater effects in the real data. 相似文献

17.

Detecting Measurement Disturbances in Rater‐Mediated Assessments

下载免费PDF全文

Stefanie A. Wind Randall E. Schumacker 《Educational Measurement》2017,36(4):44-51

The term measurement disturbance has been used to describe systematic conditions that affect a measurement process, resulting in a compromised interpretation of person or item estimates. Measurement disturbances have been discussed in relation to systematic response patterns associated with items and persons, such as start‐up, plodding, boredom, or fatigue. An understanding of the different types of measurement disturbances can lead to a more complete understanding of persons or items in terms of the construct being measured. Although measurement disturbances have been explored in several contexts, they have not been explicitly considered in the context of performance assessments. The purpose of this study is to illustrate the use of graphical methods to explore measurement disturbances related to raters within the context of a writing assessment. Graphical displays that illustrate the alignment between expected and empirical rater response functions are considered as they relate to indicators of rating quality based on the Rasch model. Results suggest that graphical displays can be used to identify measurement disturbances for raters related to specific ranges of student achievement that suggest potential rater bias. Further, results highlight the added diagnostic value of graphical displays for detecting measurement disturbances that are not captured using Rasch model–data fit statistics. 相似文献

18.

Faculty evaluation: Reliability of peer assessments of research,teaching, and service 总被引：1，自引：0，他引：1

Lawrence S. Root 《Research in higher education》1987,26(1):71-84

In this paper, assessments of faculty performance for the determination of salary increases are analyzed to estimate interrater reliability. Using the independent ratings by six elected members of the faculty, correlations between the ratings are calculated and estimates of the reliability of the composite (group) ratings are generated. Average intercorrelations are found to range from 0.603 for teaching, to 0.850 for research. The average intercorrelation for the overall faculty ratings is 0.794. Using these correlations, the reliability of the six-person group (the composite reliability) is estimated to be over 0.900 for each of the three areas and 0.959 for the overall faculty rating. Furthermore, little correlation is found between the ratings of performance levels of individual faculty members in the three areas of research, teaching, and service. The high intercorrelations and, consequently, the high composite reliabilities suggest that a reduction in the number of raters would have relatively small effects on reliability. The findings are discussed in terms of their relationship to issues of validity as well as to other questions of faculty assessment. 相似文献

19.

Assessing student exposure to and use of computer technologies through an examination of course syllabi

Michael B. Madson Timothy P. Melchert Joan L. Whipp 《Assessment & Evaluation in Higher Education》2004,29(5):549-561

A syllabus analysis instrument was developed to assist program evaluators, administrators and faculty in the identification of skills that students use as they complete their college coursework. While this instrument can be tailored for use with a variety of learning domains, we used it to assess students' use of and exposure to computer technology skills. The reliability and validity of the instrument was examined through an analysis of 88 syllabi from courses within the teacher education program and the core curriculum at a private Midwest US university. Results indicate that the instrument has good inter‐rater reliability and ratings by and interviews with faculty and students provide evidence of construct validity. The use and limitations of the instrument in educational program evaluation are discussed. 相似文献

20.

Latent trait modelling of rater accuracy in formative peer assessment of English-Chinese consecutive interpreting

Chao Han 《Assessment & Evaluation in Higher Education》2018,43(6):979-994

Despite the increasing popularity of peer assessment in tertiary-level interpreter education, very little research has been conducted to examine the quality of peer ratings on language interpretation. While previous research on the quality of peer ratings, particularly rating accuracy, mainly relies on correlation and analysis of variance, latent trait modelling emerges as a useful approach to investigate rating accuracy in rater-mediated performance assessment. The present study demonstrates the use of multifaceted Rasch partial credit modelling to explore the accuracy of peer ratings on English-Chinese consecutive interpretation. The analysis shows that there was a relatively wide spread of rater accuracy estimates and that statistically significant differences were found between peer raters regarding rating accuracy. Additionally, it was easier for peer raters to assess some students accurately than others, to peer-assess target language quality accurately than the other rating domains, and to provide accurate ratings to English-to-Chinese interpretation than the other direction. Through these findings, latent trait modelling demonstrates its capability to produce individual-level indices, measure rater accuracy directly, and accommodate sparse data rating designs. It is therefore hoped that substantive inquiries into peer assessment of language interpretation could utilise latent trait modelling to move this line of research forward. 相似文献