首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 718 毫秒
1.
The credibility of standard‐setting cut scores depends in part on two sources of consistency evidence: intrajudge and interjudge consistency. Although intrajudge consistency feedback has often been provided to Angoff judges in practice, more evidence is needed to determine whether it achieves its intended effect. In this randomized experiment with 36 judges, non‐numeric item‐level intrajudge consistency feedback was provided to treatment‐group judges after the first and second rounds of Angoff ratings. Compared to the judges in the control condition, those receiving the feedback significantly improved their intrajudge consistency, with the effect being stronger after the first round than after the second round. To examine whether this feedback has deleterious effects on between‐judge consistency, I also examined interjudge consistency at the cut score level and the item level using generalizability theory. The results showed that without the feedback, cut score variability worsened; with the feedback, idiosyncratic item‐level variability improved. These results suggest that non‐numeric intrajudge consistency feedback achieves its intended effect and potentially improves interjudge consistency. The findings contribute to standard‐setting feedback research and provide empirical evidence for practitioners planning Angoff procedures.  相似文献   

2.
Minimum standards were established for the National Teacher Examinations (NTE) area examinations in mathematics and in elementary education by independent panels of teacher educators who had been instructed in the use of either the Angoff, Nedelsky, or Jaeger procedures. Of these three procedures, only the Jaeger method requires that normative data be provided to the judges when evaluating the items. However, it was of interest to study the effect such information would have upon the standards obtained using the other two methods. Therefore, the design incorporated three sequential review sessions with the level of normative information different for each. A three-factor ANOVA revealed significant main effects for methods and sessions but not for subject area. None of the interactions was significant. The anticipated failure rates, the psychometric characteristics of the ratings, and other factors suggest that the Angoff procedure, as modified during the second session of this study, yields the most defensible standards for the NTE area examinations.  相似文献   

3.
An Angoff standard setting study generally yields judgments on a number of items by a number of judges (who may or may not be nested in panels). Variability associated with judges (and possibly panels) contributes error to the resulting cut score. The variability associated with items plays a more complicated role. To the extent that the mean item judgments directly reflect empirical item difficulties, the variability in Angoff judgments over items would not add error to the cut score, but to the extent that the mean item judgments do not correspond to the empirical item difficulties, variability in mean judgments over items would add error to the cut score. In this article, we present two generalizability-theory–based analyses of the proportion of the item variance that contributes to error in the cut score. For one approach, variance components are estimated on the probability (or proportion-correct) scale of the Angoff judgments, and for the other, the judgments are transferred to the theta scale of an item response theory model before estimating the variance components. The two analyses yield somewhat different results but both indicate that it is not appropriate to simply ignore the item variance component in estimating the error variance.  相似文献   

4.
Setting performance standards is a judgmental process involving human opinions and values as well as technical and empirical considerations. Although all cut score decisions are by nature somewhat arbitrary, they should not be capricious. Judges selected for standard‐setting panels should have the proper qualifications to make the judgments asked of them; however, even qualified judges vary in expertise and in some cases, such as highly specialized areas or when members of the public are involved, it may be difficult to ensure that each member of a standard‐setting panel has the requisite expertise to make qualified judgments. Given the subjective nature of these types of judgments, and that a large part of the validity argument for an exam lies in the robustness of its passing standard, an examination of the influence of judge proficiency on the judgments is warranted. This study explores the use of the many‐facet Rasch model as a method for adjusting modified Angoff standard‐setting ratings based on judges’ proficiency levels. The results suggest differences in the severity and quality of standard‐setting judgments across levels of judge proficiency, such that judges who answered easy items incorrectly tended to perceive them as easier, but those who answered correctly tended to provide ratings within normal stochastic limits.  相似文献   

5.
Cut scores, estimated using the Angoff procedure, are routinely used to make high-stakes classification decisions based on examinee scores. Precision is necessary in estimation of cut scores because of the importance of these decisions. Although much has been written about how these procedures should be implemented, there is relatively little literature providing empirical support for specific approaches to providing training and feedback to standard-setting judges. This article presents a multivariate generalizability analysis designed to examine the impact of training and feedback on various sources of error in estimation of cut scores for a standard-setting procedure in which multiple independent groups completed the judgments. The results indicate that after training, there was little improvement in the ability of judges to rank order items by difficulty but there was a substantial improvement in inter-judge consistency in centering ratings. The results also show a substantial group effect. Consistent with this result, the direction of change for the estimated cut score was shown to be group dependent.  相似文献   

6.
Since 1971 there have been a number of studies in which a cut score has been set using a method proposed by Angoff (1971). In this method, each member of a panel of judges estimates for each test question the proportion correct for a specific target group of examinees. Prior and contemporary research suggests that this is a difficult task for judges. Angoff also proposed that judges simply indicate whether or not an examinee from the target group will be able to answer each question correctly (the yes/no method). We report on the results of two studies that compare a yes/no estimation with a proportion correct estimation. The two studies demonstrate that both methods produce essentially equal cut scores and that judges find the yes/no method more comfortable to use than the estimated proportion correct method.  相似文献   

7.
Establishing cut scores using the Angoff method requires panelists to evaluate every item on a test and make a probability judgment. This can be time-consuming when there are large numbers of items on the test. Previous research using resampling studies suggest that it is possible to recommend stable Angoff-based cut score estimates using a content-stratified subset of ?45 items. Recommendations from earlier work were directly applied in this study in two operational standard-setting meetings. Angoff cut scores from two panels of raters were collected at each study, wherein one panel established the cut score based on the entire test, and another comparable panel first used a proportionally stratified subset of 45 items, and subsequently used the entire test in recommending the cut scores. The cut scores recommended for the subset of items were compared to the cut scores recommended based on the entire test for the same panel, and a comparable independent panel. Results from both studies suggest that cut scores recommended using a subset of items are comparable (i.e., within one standard error) to the cut score estimates from the full test.  相似文献   

8.
The purpose of the present study was to extend past work with the Angoff method for setting standards by examining judgments at the judge level rather than the panel level. The focus was on investigating the relationship between observed Angoff standard setting judgments and empirical conditional probabilities. This relationship has been used as a measure of internal consistency by previous researchers. Results indicated that judges varied in the degree to which they were able to produce internally consistent ratings; some judges produced ratings that were highly correlated with empirical conditional probabilities and other judges’ ratings had essentially no correlation with the conditional probabilities. The results also showed that weighting procedures applied to individual judgments both increased panel-level internal consistency and produced convergence across panels.  相似文献   

9.
Standard setting methods such as the Angoff method rely on judgments of item characteristics; item response theory empirically estimates item characteristics and displays them in item characteristic curves (ICCs). This study evaluated several indexes of rater fit to ICCs as a method for judging rater accuracy in their estimates of expected item performance for target groups of test-takers. Simulated data were used to compare adequately fitting ratings to poorly fitting ratings at various target competence levels in a simulated two stage standard setting study. The indexes were then applied to a set of real ratings on 66 items evaluated at 4 competence thresholds to demonstrate their relative usefulness for gaining insight into rater “fit.” Based on analysis of both the simulated and real data, it is recommended that fit indexes based on the absolute deviations of ratings from the ICCs be used, and those based on the standard errors of ratings should be avoided. Suggestions are provided for using these indexes in future research and practice.  相似文献   

10.
The Angoff method requires experts to view every item on the test and make a probability judgment. This can be time consuming when there are large numbers of items on the test. In this study, a G-theory framework was used to determine if a subset of items can be used to make generalizable cut-score recommendations. Angoff ratings (i.e., probability judgments) from previously conducted standard setting studies were used first in a re-sampling study, followed by D-studies. For the re-sampling study, proportionally stratified subsets of items were extracted under various sampling and test-length conditions. The mean cut score, variance components, expected standard error (SE) around the mean cut score, and root-mean-squared deviation (RMSD) across 1,000 replications were estimated at each study condition. The SE and the RMSD decreased as the number of items increased, but this reduction tapered off after approximately 45 items. Subsequently, D-studies were performed on the same datasets. The expected SE was computed at various test lengths. Results from both studies are consistent with previous research indicating that between 40–50 items are sufficient to make generalizable cut score recommendations.  相似文献   

11.
Evidence to support the credibility of standard setting procedures is a critical part of the validity argument for decisions made based on tests that are used for classification. One area in which there has been limited empirical study is the impact of standard setting judge selection on the resulting cut score. One important issue related to judge selection is whether the extent of judges’ content knowledge impacts their perceptions of the probability that a minimally proficient examinee will answer the item correctly. The present article reports on two studies conducted in the context of Angoff‐style standard setting for medical licensing examinations. In the first study, content experts answered and subsequently provided Angoff judgments for a set of test items. After accounting for perceived item difficulty and judge stringency, answering the item correctly accounted for a significant (and potentially important) impact on expert judgment. The second study examined whether providing the correct answer to the judges would result in a similar effect to that associated with knowing the correct answer. The results suggested that providing the correct answer did not impact judgments. These results have important implications for the validity of standard setting outcomes in general and on judge recruitment specifically.  相似文献   

12.
Despite being widely used and frequently studied, the Angoff standard setting procedure has received little attention with respect to an integral part of the process: how judges incorporate examinee performance data in the decision‐making process. Without performance data, subject matter experts have considerable difficulty accurately making the required judgments. Providing data introduces the very real possibility that judges will turn their content‐based judgments into norm‐referenced judgments. This article reports on three Angoff standard setting panels for which some items were randomly assigned to have incorrect performance data. Judges were informed that some of the items were accompanied by inaccurate data, but were not told which items they were. The purpose of the manipulation was to assess the extent to which changing the instructions given to the judges would impact the extent to which they relied on the performance data. The modified instructions resulted in the judges making less use of the performance data than judges participating in recent parallel studies. The relative extent of the change judges made did not appear to be substantially influenced by the accuracy of the data.  相似文献   

13.
One common phenomenon in Angoff standard setting is that panelists regress their ratings in toward the middle of the probability scale. This study describes two indices based on taking ratios of standard deviations that can be utilized with a scatterplot of item ratings versus expected probabilities of success to identify whether ratings are regressed in toward the middle of the probability scale. Results from a simulation study show that the standard deviation ratio indices can successfully detect ratings for hard and easy items that are regressed in toward the middle of the probability scale in Angoff standard‐setting data, where previously proposed indices often do not work as well to detect these effects. Results from a real data set show that, while virtually all raters improve from Round 1 to Round 2 as measured by previously developed indices, the standard deviation ratios in conjunction with a scatterplot of item ratings versus expected probabilities of success can identify individuals who may still be regressing their ratings in toward the middle of the probability scale even after receiving feedback. The authors suggest using the scatterplot along with the standard deviation ratio indices and other statistics for measuring the quality of Angoff standard‐setting data.  相似文献   

14.
The state of Pennsylvania, like many organizations interested in performance improvement, routinely engages in professional development activities. Educators in this hands‐on activity engaged in setting meaningful criterion‐referenced cut scores for career and technical education assessments using two methods. The main purposes of this study were to (a) assess if training differences had a differential impact on standard setting of the cut scores, (b) determine if there is a significant difference in cut scores between two groups of educators, and (c) examine how cut scores established by this analytical method might differ from holistic impressions cut scores. The results showed general agreement among the career and technical education judges on the cut scores established. These judgments were not influenced by the characteristics of career and technical education students. However, the judges' analytical cut scores were significantly lower than their corresponding holistic impressions cut scores.  相似文献   

15.
Historically, Angoff‐based methods were used to establish cut scores on the National Assessment of Educational Progress (NAEP). In 2005, the National Assessment Governing Board oversaw multiple studies aimed at evaluating the reliability and validity of Bookmark‐based methods via a comparison to Angoff‐based methods. As the Board considered adoption of Bookmark‐based methods, it considered several criteria, including reliability of the cut scores, validity of the cut scores as evidenced by comparability of results to those from Angoff, and procedural validity as evidenced by panelist understanding of the method tasks and instructions and confidence in the results. As a result of their review, a Bookmark‐based method was adopted for NAEP, and has been used since that time. This article goes beyond the Governing Board's initial evaluations to conduct a systematic review of 27 studies in NAEP research conducted over 15 years. This research is used to evaluate Bookmark‐based methods on key criteria originally considered by the Governing Board. Findings suggest that Bookmark‐based methods have comparable reliability, resulting cut scores, and panelist evaluations to Angoff. Given that Bookmark‐based methods are shorter in duration and less costly, Bookmark‐based methods may be preferable to Angoff for NAEP standard setting.  相似文献   

16.
This article introduces the Diagnostic Profiles (DP) standard setting method for setting a performance standard on a test developed from a cognitive diagnostic model (CDM), the outcome of which is a profile of mastered and not‐mastered skills or attributes rather than a single test score. In the DP method, the key judgment task for panelists is a decision on whether or not individual cognitive skill profiles meet the performance standard. A randomized experiment was carried out in which secondary mathematics teachers were randomly assigned to either the DP method or the modified Angoff method. The standard setting methods were applied to a test of student readiness to enter high school algebra (Algebra I). While the DP profile judgments were perceived to be more difficult than the Angoff item judgments, there was a high degree of agreement among the panelists for most of the profiles. In order to compare the methods, cut scores were generated from the DP method. The results of the DP group were comparable to the Angoff group, with less cut score variability in the DP group. The DP method shows promise for testing situations in which diagnostic information is needed about examinees and where that information needs to be linked to a performance standard.  相似文献   

17.
A Comparison of Three Variations on a Standard-Setting Method   总被引:1,自引:0,他引:1  
The purpose of this study was to determine whether two variations on the typical Angoff group standard-setting process would produce sufficiently consistent results to recommend their use. Judgments obtained from a group of experts during a meeting were compared with judgments gathered from the same group before and after the meeting. The results indicate that differences between passing scores obtained with the three variations are relatively small, but those gathered before the meeting were less consistent than ratings gathered during and after the meeting. These results imply that judgments gathered after an initial traditional group-process session can provide an efficient alternative mechanism for setting cutting scores using the Angoff method.
This research was supported by The American Board of Internal Medicine, but does not necessarily reflect its opinions or policies.  相似文献   

18.
This article discusses regression effects that are commonly observed in Angoff ratings where panelists tend to think that hard items are easier than they are and easy items are more difficult than they are in comparison to estimated item difficulties. Analyses of data from two credentialing exams illustrate these regression effects and the persistence of these regression effects across rounds of standard setting, even after panelists have received feedback information and have been given the opportunity to discuss their ratings. Additional analyses show that there tended to be a relationship between the average item ratings provided by panelists and the standard deviations of those item ratings and that the relationship followed a quadratic form with peak variation in average item ratings found toward the middle of the item difficulty scale. The study concludes with discussion of these findings and what they may imply for future standard settings.  相似文献   

19.
This article illustrates five different methods for estimating Angoff cut scores using item response theory (IRT) models. These include maximum likelihood (ML), expected a priori (EAP), modal a priori (MAP), and weighted maximum likelihood (WML) estimators, as well as the most commonly used approach based on translating ratings through the test characteristic curve (i.e., the IRT true‐score (TS) estimator). The five methods are compared using a simulation study and a real data example. Results indicated that the application of different methods can sometimes lead to different estimated cut scores, and that there can be some key differences in impact data when using the IRT TS estimator compared to other methods. It is suggested that one should carefully think about their choice of methods to estimate ability and cut scores because different methods have distinct features and properties. An important consideration in the application of Bayesian methods relates to the choice of the prior and the potential bias that priors may introduce into estimates.  相似文献   

20.
An important consideration in standard setting is recruiting a group of panelists with different experiences and backgrounds to serve on the standard-setting panel. This study uses data from 14 different Angoff standard settings from a variety of medical imaging credentialing programs to examine whether people with different professional roles and test development experiences tended to recommend higher or lower cut scores or were more or less accurate in their standard-setting judgments. Results suggested that there were not any statistically significant differences for different types of panelists in terms of the cut scores they recommended or the accuracy of their judgments. Discussion of what these results may mean for panelist selection and recruitment is provided.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号