首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 546 毫秒
1.
Rater‐mediated assessments exhibit scoring challenges due to the involvement of human raters. The quality of human ratings largely determines the reliability, validity, and fairness of the assessment process. Our research recommends that the evaluation of ratings should be based on two aspects: a theoretical model of human judgment and an appropriate measurement model for evaluating these judgments. In rater‐mediated assessments, the underlying constructs and response processes may require the use of different rater judgment models and the application of different measurement models. We describe the use of Brunswik's lens model as an organizing theme for conceptualizing human judgments in rater‐mediated assessments. The constructs vary depending on which distal variables are identified in the lens models for the underlying rater‐mediated assessment. For example, one lens model can be developed to emphasize the measurement of student proficiency, while another lens model can stress the evaluation of rater accuracy. Next, we describe two measurement models that reflect different response processes (cumulative and unfolding) from raters: Rasch and hyperbolic cosine models. Future directions for the development and evaluation of rater‐mediated assessments are suggested.  相似文献   

2.
Historically, research focusing on rater characteristics and rating contexts that enable the assignment of accurate ratings and research focusing on statistical indicators of accurate ratings has been conducted by separate communities of researchers. This study demonstrates how existing latent trait modeling procedures can identify groups of raters who may be of substantive interest to those studying the experiential, cognitive, and contextual aspects of ratings. We employ two data sources in our demonstration—simulated data and data from a large‐scale state‐wide writing assessment. We apply latent trait models to these data to identify examples of rater leniency, centrality, inaccuracy, and differential dimensionality; and we investigate the association between rater training procedures and the manifestation of rater effects in the real data.  相似文献   

3.
In an essay rating study multiple ratings may be obtained by having different raters judge essays or by having the same rater(s) repeat the judging of essays. An important question in the analysis of essay ratings is whether multiple ratings, however obtained, may be assumed to represent the same true scores. When different raters judge the same essays only once, it is impossible to answer this question. In this study 16 raters judged 105 essays on two occasions; hence, it was possible to test assumptions about true scores within the framework of linear structural equation models. It emerged that the ratings of a given rater on the two occasions represented the same true scores. However, the ratings of different raters did not represent the same true scores. The estimated intercorrelations of the true scores of different raters ranged from .415 to .910. Parameters of the best fitting model were used to compute coefficients of reliability, validity, and invalidity. The implications of these coefficients are discussed.  相似文献   

4.
Peer and self‐ratings have been strongly recommended as the means to adjust individual contributions to group work. To evaluate the quality of student ratings, previous research has primarily explored the validity of these ratings, as indicated by the degree of agreement between student and teacher ratings. This research describes a Generalizability Theory framework to evaluate the reliability of student ratings in terms of the degree of consistency among students themselves, as well as group and rater effects. Ratings from two group projects are analyzed to illustrate how this method can be applied. The reliability of student ratings differs for the two group projects considered in this research. While a strong group effect is present in both projects, the rater effect is different. Implications of this research for classroom assessment practice are discussed.  相似文献   

5.
This study examines the agreement across informant pairs of teachers, parents, and students regarding the students’ social‐emotional learning (SEL) competencies. Two student subsamples representative of the social skills improvement system (SSIS) SEL edition rating forms national standardization sample were examined: first, 168 students (3rd to 12th grades) with ratings by three informants (a teacher, a parent, and the student him/herself) and a second group of 164 students who had ratings by two raters in a similar role—two parents or two teachers. To assess interrater agreements, two methods were employed: calculation of q correlations among pairs of raters and effect size indices to capture the extant rater pairs differed in their assessments of social‐emotional skills. The empirical results indicated that pairs of different types of informants exhibited greater than chance levels of agreement as indexed by significant interrater correlations; teacher–parent informants showed higher correlations than teacher–student or parent–student pairs across all SEL competency domains assessed, and pairs of similar informants exhibited significantly higher correlations than pairs of dissimilar informants. Study limitations are identified and future research needs outlined.  相似文献   

6.
Ratings given to the same item response may have a stronger correlation than those given to different item responses, especially when raters interact with one another before giving ratings. The rater bundle model was developed to account for such local dependence by forming multiple ratings given to an item response as a bundle and assigning fixed‐effect parameters to describe response patterns in the bundle. Unfortunately, this model becomes difficult to manage when a polytomous item is graded by more than two raters. In this study, by adding random‐effect parameters to the facets model, we propose a class of generalized rater models to account for the local dependence among multiple ratings and intrarater variation in severity. A series of simulations was conducted with the freeware WinBUGS to evaluate parameter recovery of the new models and consequences of ignoring the local dependence or intrarater variation in severity. The results revealed a good parameter recovery when the data‐generating models were fit, and a poor estimation of parameters and test reliability when the local dependence or intrarater variation in severity was ignored. An empirical example is provided.  相似文献   

7.
Researchers have documented the impact of rater effects, or raters’ tendencies to give different ratings than would be expected given examinee achievement levels, in performance assessments. However, the degree to which rater effects influence person fit, or the reasonableness of test-takers’ achievement estimates given their response patterns, has not been investigated. In rater-mediated assessments, person fit reflects the reasonableness of rater judgments of individual test-takers’ achievement over components of the assessment. This study illustrates an approach to visualizing and evaluating person fit in assessments that involve rater judgment using rater-mediated person response functions (rm-PRFs). The rm-PRF approach allows analysts to consider the impact of rater effects on person fit in order to identify individual test-takers for whom the assessment results may not have a straightforward interpretation. A simulation study is used to evaluate the impact of rater effects on person fit. Results indicate that rater effects can compromise the interpretation and use of performance assessment results for individual test-takers. Recommendations are presented that call researchers and practitioners to supplement routine psychometric analyses for performance assessments (e.g., rater reliability checks) with rm-PRFs to identify students whose ratings may have compromised interpretations as a result of rater effects, person misfit, or both.  相似文献   

8.
When practitioners use modern measurement models to evaluate rating quality, they commonly examine rater fit statistics that summarize how well each rater's ratings fit the expectations of the measurement model. Essentially, this approach involves examining the unexpected ratings that each misfitting rater assigned (i.e., carrying out analyses of standardized residuals). One can create plots of the standardized residuals, isolating those that resulted from raters’ ratings of particular subgroups. Practitioners can then examine the plots to identify raters who did not maintain a uniform level of severity when they assessed various subgroups (i.e., exhibited evidence of differential rater functioning). In this study, we analyzed simulated and real data to explore the utility of this between‐subgroup fit approach. We used standardized between‐subgroup outfit statistics to identify misfitting raters and the corresponding plots of their standardized residuals to determine whether there were any identifiable patterns in each rater's misfitting ratings related to subgroups.  相似文献   

9.
Standard setting methods such as the Angoff method rely on judgments of item characteristics; item response theory empirically estimates item characteristics and displays them in item characteristic curves (ICCs). This study evaluated several indexes of rater fit to ICCs as a method for judging rater accuracy in their estimates of expected item performance for target groups of test-takers. Simulated data were used to compare adequately fitting ratings to poorly fitting ratings at various target competence levels in a simulated two stage standard setting study. The indexes were then applied to a set of real ratings on 66 items evaluated at 4 competence thresholds to demonstrate their relative usefulness for gaining insight into rater “fit.” Based on analysis of both the simulated and real data, it is recommended that fit indexes based on the absolute deviations of ratings from the ICCs be used, and those based on the standard errors of ratings should be avoided. Suggestions are provided for using these indexes in future research and practice.  相似文献   

10.
Feldman (1977), reviewing research about the reliability of student evaluations, reported that while class average responses were quite reliable (.80s and .90s), single rater reliabilities were typically low (.20s). However, studies he reviewed determined single rater reliability with internal consistency measures which assumed that differences among students in the same class (within-class variance) were completely random—an assumption which Feldman seriously questioned. In the present study, this assumption was tested by collecting evaluations from the same students at the end of each class and again one year after graduation. Single rater reliability based upon an internal consistency approach (agreement among different students in the same class) was similar to that reported by Feldman. However, single rater reliability based upon a stability approach (agreement between end-of-term and follow-up ratings by the same student) was much higher (medianr=.59). These results indicate that individual student evaluations were remarkably stable over time and more reliable than previously assumed. Most important, there was systematic information in individual student ratings—beyond that implied by the class average response—that internal consistency approaches have ignored or assumed to be nonexistent.  相似文献   

11.
This study describes several categories of rater errors (rater severity, halo effect, central tendency, and restriction of range). Criteria are presented for evaluating the quality of ratings based on a many-faceted Rasch measurement (FACETS) model for analyzing judgments. A random sample of 264 compositions rated by 15 raters and a validity committee from the 1990 administration of the Eighth Grade Writing Test in Georgia is used to illustrate the model. The data suggest that there are significant differences in rater severity. Evidence of a halo effect is found for two raters who appear to be rating the compositions holistically rather than analytically. Approximately 80% of the ratings are in the two middle categories of the rating scale, indicating that the error of central tendency is present. Restriction of range is evident when the unadjusted raw score distribution is examined, although this rater error is less evident when adjusted estimates of writing competence are used  相似文献   

12.
Multiple Perspectives on Family Relationships: A Latent Variables Model   总被引:1,自引:0,他引:1  
Many scholars are skeptical of family member reports on their interpersonal relationships. Familial reports are assumed to be biased by social desirability as well as other factors. In this study, a latent variables modeling approach was employed to evaluate rater reliability and bias in mother, father, and child ratings of parent-child negativity. Results based on 78 clinical families demonstrate that family member ratings contain a significant "true score" component that correlates with observer ratings of parental behavior. The presence of systematic rater effects is also demonstrated. The latent variables approach, which provides statistical control for rater effects, is recommended for the analysis of this type of data.  相似文献   

13.
Although the use of multiple criteria and informants is one of the most universally agreed on practices in the identification of gifted children, few studies to date have examined the convergent validity of multiple informants and objective ability tests in gifted identification. In this study, we illustrate the use of the correlated traits–correlated (methods – 1) or CT–C(M – 1) model (Eid, Lischetzke, Nussbeck, & Trierweiler, 2003) to examine the convergent validity of self, parent, and teacher ratings relative to objective cognitive ability tests in a sample of 145 4th to 6th graders. The CT–C(M – 1) analyses revealed that teacher ratings showed the highest convergence with the objective assessments, whereas self-ratings had the lowest reliabilities and insufficient validity. Parent ratings were more reliable and valid than self-reports, but were outperformed by teacher ratings for most abilities. Overall, the CT–C(M – 1) analyses showed that the convergent validity of the ratings relative to the objective test battery was highest for numerical and lowest for creative abilities. Furthermore, whereas part of the shared variance between parent and teacher ratings reflected true convergent validity, agreement between parent and self-reports was entirely due to a shared rater variance. Our analyses demonstrate the usefulness and proper interpretation of the CT–C(M – 1) approach for examining convergent validity and method effects in multitrait–multimethod data.  相似文献   

14.
This article examines the reliability of content analyses of state student achievement tests and state content standards. We use data from two states in three grades in mathematics and English language arts and reading to explore differences by state, content area, grade level, and document type. Using a generalizability framework, we find that reliabilities for four coders are generally greater than .80. For the two problematic reliabilities, they are partly explained by an odd rater out. We conclude that the content analysis procedures, when used with at least five raters, provide reliable information to researchers, policymakers, and practitioners about the content of assessments and standards.  相似文献   

15.
Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items.  相似文献   

16.
Numerous researchers have proposed methods for evaluating the quality of rater‐mediated assessments using nonparametric methods (e.g., kappa coefficients) and parametric methods (e.g., the many‐facet Rasch model). Generally speaking, popular nonparametric methods for evaluating rating quality are not based on a particular measurement theory. On the other hand, popular parametric methods for evaluating rating quality are often based on measurement theories such as invariant measurement. However, these methods are based on assumptions and transformations that may not be appropriate for ordinal ratings. In this study, I show how researchers can use Mokken scale analysis (MSA), which is a nonparametric approach to item response theory, to evaluate rating quality within the framework of invariant measurement without the use of potentially inappropriate parametric techniques. I use an illustrative analysis of data from a rater‐mediated writing assessment to demonstrate how one can use numeric and graphical indicators from MSA to gather evidence of validity, reliability, and fairness. The results from the analyses suggest that MSA provides a useful framework within which to evaluate rater‐mediated assessments for evidence of validity, reliability, and fairness that can supplement existing popular methods for evaluating ratings.  相似文献   

17.
18.
Psychometric models based on structural equation modeling framework are commonly used in many multiple-choice test settings to assess measurement invariance of test items across examinee subpopulations. The premise of the current article is that they may also be useful in the context of performance assessment tests to test measurement invariance of raters. The modeling approach and how it can be used for performance tests with less than optimal rater designs are illustrated using a data set from a performance test designed to measure medical students’ patient management skills. The results suggest that group-specific rater statistics can help spot differences in rater performance that might be due to rater bias, identify specific weaknesses and strengths of individual raters, and enhance decisions related to future task development, rater training, and test scoring processes.  相似文献   

19.
Evaluating Rater Accuracy in Performance Assessments   总被引:1,自引:0,他引:1  
A new method for evaluating rater accuracy within the context of performance assessments is described. Accuracy is defined as the match between ratings obtained from operational raters and those obtained from an expert panel on a set of benchmark, exemplar, or anchor performances. An extended Rasch measurement model called the FACETS model is presented for examining rater accuracy. The FACETS model is illustrated with 373 benchmark papers rated by 20 operational raters and an expert panel. The data are from the 1993field test of the High School Graduation Writing Test in Georgia. The data suggest that there are statistically significant differences in rater accuracy; the data also suggest that it is easier to be accurate on some benchmark papers than on others. A small example is presented to illustrate how the accuracy ordering of raters may not be invariant over different subsets of benchmarks used to evaluate accuracy.  相似文献   

20.
This study describes three least squares models to control for rater effects in performance evaluation: ordinary least squares (OLS); weighted least squares (WLS); and ordinary least squares, subsequent to applying a logistic transformation to observed ratings (LOG-OLS). The models were applied to ratings obtained from four administrations of an oral examination required for certification in a medical specialty. For any single administration, there were 40 raters and approximately 115 candidates, and each candidate was rated by four raters. The results indicated that raters exhibited significant amounts of leniency error and that application of the least squares models would change the pass-fail status of approximately 7% to 9% of the candidates. Ratings adjusted by the models demonstrated higher reliability and correlated slightly higher than observed ratings with the scores on a written examination.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号