期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

The Bookmark Standard-Setting Method: A Literature Review 总被引：1，自引：0，他引：1

Ana Karantonis Stephen G. Sireci 《Educational Measurement》2006,25(1):4-12

The Bookmark method for setting standards on educational tests is currently one of the most popular standard-setting methods. However, research to support the method is scarce. In this report, we review the published and unpublished literature on this method as well as some seminal work in the area of evaluating standard-setting studies. Our review highlights both strengths and limitations of the method. Strengths include its wide acceptance and panelist confidence in the method. Limitations include a potential bias to produce lower-than-intended standards and problems in selecting the most appropriate response probability value for ordering the items presented to panelists. It is clear that more research on this method is needed to support its wide use. Several areas for future research to better understand the validity of the Bookmark method for setting standards on educational tests are presented. 相似文献

2.

Adopting Cut Scores: Post-Standard-Setting Panel Considerations for Decision Makers

Kurt F. Geisinger Carina M. McCormick 《Educational Measurement》2010,29(1):38-44

Standard-setting studies utilizing procedures such as the Bookmark or Angoff methods are just one component of the complete standard-setting process. Decision makers ultimately must determine what they believe to be the most appropriate standard or cut score to use, employing the input of the standard-setting panelists as one piece of information among multiple sources. However, guidance for weighing the various components is limited. The current article describes considerations about data that are used to make standard-setting decisions, as previously outlined by Geisinger (1991) . The ten points provided by Geisinger have been expanded as they relate to shifts in educational policy and practice in educational measurement. They have been amended with six new components as well. The new considerations addressed are smoothing across grades, raising standards in progression (over grades or over time), opportunity to learn or instructional validity, input from other groups, equating or linking to previous standards, and organizational vision and goals . 相似文献

3.

Use of the Rasch IRT Model in Standard Setting: An Item-Mapping Method 总被引：1，自引：0，他引：1

Ning Wang 《Journal of Educational Measurement》2003,40(3):231-253

This article provides both logical and empirical evidence to justify the use of an item-mapping method for establishing passing scores for multiple-choice licensure and certification examinations. After describing the item-mapping standard-setting process, the rationale and theoretical basis for this method are discussed, and the similarities and differences between the item-mapping and the Bookmark methods are also provided. Empirical evidence supporting use of the item-mapping method is provided by comparing results from four standard-setting studies for diverse licensure and certification examinations. The four cut score studies were conducted using both the item-mapping and the Angoff methods. Rating data from the four standard-setting studies, using each of the two methods, were analyzed using item-by-rater random effects generalizability and dependability studies to examine which method yielded higher inter-judge consistency. Results indicated that the item-mapping method produced higher inter-judge consistency and achieved greater rater agreement than the Angoff method. 相似文献

4.

Reliability and Validity of Bookmark‐Based Methods for Standard Setting: Comparisons to Angoff‐Based Methods in the National Assessment of Educational Progress

Christina Hamme Peterson E. Matthew Schulz George Engelhard Jr. 《Educational Measurement》2011,30(2):3-14

Historically, Angoff‐based methods were used to establish cut scores on the National Assessment of Educational Progress (NAEP). In 2005, the National Assessment Governing Board oversaw multiple studies aimed at evaluating the reliability and validity of Bookmark‐based methods via a comparison to Angoff‐based methods. As the Board considered adoption of Bookmark‐based methods, it considered several criteria, including reliability of the cut scores, validity of the cut scores as evidenced by comparability of results to those from Angoff, and procedural validity as evidenced by panelist understanding of the method tasks and instructions and confidence in the results. As a result of their review, a Bookmark‐based method was adopted for NAEP, and has been used since that time. This article goes beyond the Governing Board's initial evaluations to conduct a systematic review of 27 studies in NAEP research conducted over 15 years. This research is used to evaluate Bookmark‐based methods on key criteria originally considered by the Governing Board. Findings suggest that Bookmark‐based methods have comparable reliability, resulting cut scores, and panelist evaluations to Angoff. Given that Bookmark‐based methods are shorter in duration and less costly, Bookmark‐based methods may be preferable to Angoff for NAEP standard setting. 相似文献

5.

Regression Effects in Angoff Ratings: Examples from Credentialing Exams

Adam E. Wyse 《教育实用测度》2018,31(1):68-78

This article discusses regression effects that are commonly observed in Angoff ratings where panelists tend to think that hard items are easier than they are and easy items are more difficult than they are in comparison to estimated item difficulties. Analyses of data from two credentialing exams illustrate these regression effects and the persistence of these regression effects across rounds of standard setting, even after panelists have received feedback information and have been given the opportunity to discuss their ratings. Additional analyses show that there tended to be a relationship between the average item ratings provided by panelists and the standard deviations of those item ratings and that the relationship followed a quadratic form with peak variation in average item ratings found toward the middle of the item difficulty scale. The study concludes with discussion of these findings and what they may imply for future standard settings. 相似文献

6.

A Method for Detecting Regression of Hard and Easy Item Angoff Ratings

Adam E. Wyse Ben Babcock 《Journal of Educational Measurement》2019,56(1):28-50

One common phenomenon in Angoff standard setting is that panelists regress their ratings in toward the middle of the probability scale. This study describes two indices based on taking ratios of standard deviations that can be utilized with a scatterplot of item ratings versus expected probabilities of success to identify whether ratings are regressed in toward the middle of the probability scale. Results from a simulation study show that the standard deviation ratio indices can successfully detect ratings for hard and easy items that are regressed in toward the middle of the probability scale in Angoff standard‐setting data, where previously proposed indices often do not work as well to detect these effects. Results from a real data set show that, while virtually all raters improve from Round 1 to Round 2 as measured by previously developed indices, the standard deviation ratios in conjunction with a scatterplot of item ratings versus expected probabilities of success can identify individuals who may still be regressing their ratings in toward the middle of the probability scale even after receiving feedback. The authors suggest using the scatterplot along with the standard deviation ratio indices and other statistics for measuring the quality of Angoff standard‐setting data. 相似文献

7.

Examining How Professional Roles and Test Development Experiences Impact Angoff Ratings

Adam E. Wyse 《教育实用测度》2018,31(4):324-334

An important consideration in standard setting is recruiting a group of panelists with different experiences and backgrounds to serve on the standard-setting panel. This study uses data from 14 different Angoff standard settings from a variety of medical imaging credentialing programs to examine whether people with different professional roles and test development experiences tended to recommend higher or lower cut scores or were more or less accurate in their standard-setting judgments. Results suggested that there were not any statistically significant differences for different types of panelists in terms of the cut scores they recommended or the accuracy of their judgments. Discussion of what these results may mean for panelist selection and recruitment is provided. 相似文献

8.

A Comparison of Angoff and Bookmark Standard Setting Methods

Chad W. Buckendahl Russell W. Smith James C. Impara Barbara S. Plake 《Journal of Educational Measurement》2002,39(3):253-263

This article presents a comparison of simplified variations on two prevalent methods, Angoff and Bookmark, for setting cut scores on educational assessments. The comparison is presented through an application with a Grade 7 Mathematics Assessment in a midwestem school district. Training and operational methods and procedures for each method are described in detail along with comparative results for the application. An alternative item ordering strategy for the Bookmark method that may increase its usability is also introduced. Although the Angoff method is more widely used, the Bookmark method has some promising features, specifically in educational settings. Teachers are able to focus on the expected performance of the "barely proficient" student without the additional challenge of estimating absolute item dificulty. 相似文献

9.

Evaluating the Operational Feasibility of Using Subsets of Items to Recommend Minimal Competency Cut Scores

Priya Kannan Adrienne Sgammato Richard J. Tannenbaum 《教育实用测度》2015,28(4):292-307

Establishing cut scores using the Angoff method requires panelists to evaluate every item on a test and make a probability judgment. This can be time-consuming when there are large numbers of items on the test. Previous research using resampling studies suggest that it is possible to recommend stable Angoff-based cut score estimates using a content-stratified subset of ?45 items. Recommendations from earlier work were directly applied in this study in two operational standard-setting meetings. Angoff cut scores from two panels of raters were collected at each study, wherein one panel established the cut score based on the entire test, and another comparable panel first used a proportionally stratified subset of 45 items, and subsequently used the entire test in recommending the cut scores. The cut scores recommended for the subset of items were compared to the cut scores recommended based on the entire test for the same panel, and a comparable independent panel. Results from both studies suggest that cut scores recommended using a subset of items are comparable (i.e., within one standard error) to the cut score estimates from the full test. 相似文献

10.

Modeling Judgments in the Angoff and Contrasting-Groups Method of Standard Setting

Daniël Van Nijlen Rianne Janssen 《Journal of Educational Measurement》2008,45(1):45-63

Essential for the validity of the judgments in a standard-setting study is that they follow the implicit task assumptions. In the Angoff method, judgments are assumed to be inversely related to the difficulty of the items; contrasting-groups judgments are assumed to be positively related to the ability of the students. In the present study, judgments from both procedures were modeled with a random-effects probit regression model. The Angoff judgments showed a weaker link with the position of the items on the latent scale than the contrasting-groups judgments with the position of the students. Hence, in the specific context of the study, the contrasting-groups judgments were more aligned with the underlying assumptions of the method than the Angoff judgments . 相似文献

11.

An Integration and Reprise: What We Think We Have Learned

《教育实用测度》2013,26(1):85-92

This article summarizes and contrasts the three standard-setting methods described in this special issue: judgmental policy capturing, the extended Angoff method, and the dominant profile method. An integrative summary of findings is presented, followed by conclusions concerning the relative efficacy and utility of the three methods. The article concludes with recommendations for modifying the methods and for further investigations of their psychometric properties. 相似文献

12.

Rejoinder: Evaluating Standard Setting Methods Using Error Models Proposed by Schulz

Mark D. Reckase 《Educational Measurement》2006,25(3):14-17

相似文献

13.

Diagnostic Profiles: A Standard Setting Method for Use With a Cognitive Diagnostic Model

Gary Skaggs Serge F. Hein Jesse L. M. Wilkins 《Journal of Educational Measurement》2016,53(4):448-458

This article introduces the Diagnostic Profiles (DP) standard setting method for setting a performance standard on a test developed from a cognitive diagnostic model (CDM), the outcome of which is a profile of mastered and not‐mastered skills or attributes rather than a single test score. In the DP method, the key judgment task for panelists is a decision on whether or not individual cognitive skill profiles meet the performance standard. A randomized experiment was carried out in which secondary mathematics teachers were randomly assigned to either the DP method or the modified Angoff method. The standard setting methods were applied to a test of student readiness to enter high school algebra (Algebra I). While the DP profile judgments were perceived to be more difficult than the Angoff item judgments, there was a high degree of agreement among the panelists for most of the profiles. In order to compare the methods, cut scores were generated from the DP method. The results of the DP group were comparable to the Angoff group, with less cut score variability in the DP group. The DP method shows promise for testing situations in which diagnostic information is needed about examinees and where that information needs to be linked to a performance standard. 相似文献

14.

Multivariate Generalizability Analysis of the Impact of Training and Examinee Performance Information on Judgments Made in an Angoff-Style Standard-Setting Procedure

Brian E. Clauser David B. Swanson Polina Harik 《Journal of Educational Measurement》2002,39(4):269-290

Cut scores, estimated using the Angoff procedure, are routinely used to make high-stakes classification decisions based on examinee scores. Precision is necessary in estimation of cut scores because of the importance of these decisions. Although much has been written about how these procedures should be implemented, there is relatively little literature providing empirical support for specific approaches to providing training and feedback to standard-setting judges. This article presents a multivariate generalizability analysis designed to examine the impact of training and feedback on various sources of error in estimation of cut scores for a standard-setting procedure in which multiple independent groups completed the judgments. The results indicate that after training, there was little improvement in the ability of judges to rank order items by difficulty but there was a substantial improvement in inter-judge consistency in centering ratings. The results also show a substantial group effect. Consistent with this result, the direction of change for the estimated cut score was shown to be group dependent. 相似文献

15.

The Issue of Range Restriction in Bookmark Standard Setting

下载免费PDF全文

Adam E. Wyse 《Educational Measurement》2015,34(2):47-54

This article uses data from a large‐scale assessment program to illustrate the potential issue of range restriction with the Bookmark method in the context of trying to set cut scores to closely align with a set of college and career readiness benchmarks. Analyses indicated that range restriction issues existed across different response probability (RP) values and item response theory (IRT) models if one were to apply the Bookmark procedure using intact test forms. Results also suggested that range restriction may still be present if one had access to additional data from an item bank. This demonstration critically highlights challenges that may exist in some practical applications of the Bookmark method due items not being designed to cover the full range of examinee abilities. 相似文献

16.

Teachers' Ability to Estimate Item Difficulty: A Test of the Assumptions in the Angoff Standard Setting Method

James C. Impara Barbara S. Plake 《Journal of Educational Measurement》1998,35(1):69-81

The Angoff (1971) standard setting method requires expert panelists to (a) conceptualize candidates who possess the qualifications of interest (e.g., the minimally qualified) and (b) estimate actual item performance for these candidates. Past and current research (Bejar, 1983; Shepard, 1994) suggests that estimating item performance is difficult for panelists. If panelists cannot perform this task, the validity of the standard based on these estimates is in question. This study tested the ability of 26 classroom teachers to estimate item performance for two groups of their students on a locally developed district-wide science test. Teachers were more accurate in estimating the performance of the total group than of the "borderline group," but in neither case was their accuracy level high. Implications of this finding for the validity of item performance estimates by panelists using the Angoff standard setting method are discussed. 相似文献

17.

A Critical Look into the Beuk Standard-Setting Method

Adam E. Wyse 《Educational Measurement》2020,39(1):52-60

One commonly used compromise standard-setting method is the Beuk (1984) method. A key assumption of the Beuk method is that the emphasis given to the pass rate and the percent correct ratings should be proportional to the extent that the panelists agree on their ratings. However, whether the slope of Beuk line reflects the emphasis that panelists believe should be assigned to the pass rate and the percentage correct ratings has not be fully tested. In this article, I evaluate this critical assumption of the Beuk method by asking panelists to assign importance weights to their percentage correct and pass rate judgments. I show that in several cases that the emphasis suggested by the Beuk slope is noticeably different from what one would expect and is inconsistent with importance weight ratings. I also suggest two ways that the importance weights can be used to calculate alternate cut scores, and I show that one of the ways of calculating cut scores using the importance weights leads to larger potential differences in cut score estimates. I suggest that practitioners should consider collecting importance weights when the Beuk method is used for determining cut scores. 相似文献

18.

Interpreting the Results of Three Different Standard-Setting Procedures

Donald Ross Green C. Scott Trimble Daniel M. Lewis 《Educational Measurement》2003,22(1):22-32

Different standard-setting procedures usually produce different cut points even if each has a rational basis. In 2000, three standard-setting procedures were implemented to set cut scores in each of the 18 grade/content areas comprising Kentucky's state assessment system: the Contrasting Groups, Bookmark, and Jaeger-Mills procedures. Subsequently, participants from each of the three procedures worked together in each grade/content area to synthesize the results. These synthesis participants considered the results of, and examined the materials and information provided by, each of the three separate procedures. In this article the synthesis processes are described and discussed. 相似文献

19.

全港性系统评估:水平参照分数报告和水平厘定技术在香港的实践

常蕤《考试研究》2008,(2):48-70

本文以小学数学科为例,介绍了香港全港性系统评估应用项目反应理论中的Rasch模型进行测验设计和数据分析,以及用改进的Angoff方法和书签法进行水平厘定,获得水平参照分数报告的过程。相似文献

20.

A Multi-Stage Dominant Profile Method for Setting Standards on Complex Performance Assessments

《教育实用测度》2013,26(1):57-83

相似文献