期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Examining How Professional Roles and Test Development Experiences Impact Angoff Ratings

Adam E. Wyse 《教育实用测度》2018,31(4):324-334

An important consideration in standard setting is recruiting a group of panelists with different experiences and backgrounds to serve on the standard-setting panel. This study uses data from 14 different Angoff standard settings from a variety of medical imaging credentialing programs to examine whether people with different professional roles and test development experiences tended to recommend higher or lower cut scores or were more or less accurate in their standard-setting judgments. Results suggested that there were not any statistically significant differences for different types of panelists in terms of the cut scores they recommended or the accuracy of their judgments. Discussion of what these results may mean for panelist selection and recruitment is provided. 相似文献

2.

Evaluating the Operational Feasibility of Using Subsets of Items to Recommend Minimal Competency Cut Scores

Priya Kannan Adrienne Sgammato Richard J. Tannenbaum 《教育实用测度》2015,28(4):292-307

Establishing cut scores using the Angoff method requires panelists to evaluate every item on a test and make a probability judgment. This can be time-consuming when there are large numbers of items on the test. Previous research using resampling studies suggest that it is possible to recommend stable Angoff-based cut score estimates using a content-stratified subset of ?45 items. Recommendations from earlier work were directly applied in this study in two operational standard-setting meetings. Angoff cut scores from two panels of raters were collected at each study, wherein one panel established the cut score based on the entire test, and another comparable panel first used a proportionally stratified subset of 45 items, and subsequently used the entire test in recommending the cut scores. The cut scores recommended for the subset of items were compared to the cut scores recommended based on the entire test for the same panel, and a comparable independent panel. Results from both studies suggest that cut scores recommended using a subset of items are comparable (i.e., within one standard error) to the cut score estimates from the full test. 相似文献

3.

A Multi-Stage Dominant Profile Method for Setting Standards on Complex Performance Assessments

《教育实用测度》2013,26(1):57-83

相似文献

4.

Diagnostic Profiles: A Standard Setting Method for Use With a Cognitive Diagnostic Model

Gary Skaggs Serge F. Hein Jesse L. M. Wilkins 《Journal of Educational Measurement》2016,53(4):448-458

This article introduces the Diagnostic Profiles (DP) standard setting method for setting a performance standard on a test developed from a cognitive diagnostic model (CDM), the outcome of which is a profile of mastered and not‐mastered skills or attributes rather than a single test score. In the DP method, the key judgment task for panelists is a decision on whether or not individual cognitive skill profiles meet the performance standard. A randomized experiment was carried out in which secondary mathematics teachers were randomly assigned to either the DP method or the modified Angoff method. The standard setting methods were applied to a test of student readiness to enter high school algebra (Algebra I). While the DP profile judgments were perceived to be more difficult than the Angoff item judgments, there was a high degree of agreement among the panelists for most of the profiles. In order to compare the methods, cut scores were generated from the DP method. The results of the DP group were comparable to the Angoff group, with less cut score variability in the DP group. The DP method shows promise for testing situations in which diagnostic information is needed about examinees and where that information needs to be linked to a performance standard. 相似文献

5.

A Critical Look into the Beuk Standard-Setting Method

Adam E. Wyse 《Educational Measurement》2020,39(1):52-60

One commonly used compromise standard-setting method is the Beuk (1984) method. A key assumption of the Beuk method is that the emphasis given to the pass rate and the percent correct ratings should be proportional to the extent that the panelists agree on their ratings. However, whether the slope of Beuk line reflects the emphasis that panelists believe should be assigned to the pass rate and the percentage correct ratings has not be fully tested. In this article, I evaluate this critical assumption of the Beuk method by asking panelists to assign importance weights to their percentage correct and pass rate judgments. I show that in several cases that the emphasis suggested by the Beuk slope is noticeably different from what one would expect and is inconsistent with importance weight ratings. I also suggest two ways that the importance weights can be used to calculate alternate cut scores, and I show that one of the ways of calculating cut scores using the importance weights leads to larger potential differences in cut score estimates. I suggest that practitioners should consider collecting importance weights when the Beuk method is used for determining cut scores. 相似文献

6.

Installing a System of Performance Standards for National Assessments in the Republic of Trinidad and Tobago: Issues and Challenges

Jerome De Lisle 《教育实用测度》2015,28(4):308-329

This article explores the challenge of setting performance standards in a non-Western context. The study is centered on standard-setting practice in the national learning assessments of Trinidad and Tobago. Quantitative and qualitative data from annual evaluations between 2005 and 2009 were compiled, analyzed, and deconstructed. In the mixed methods research design, data were integrated under an evaluation framework for validating performance standards. The quantitative data included panelists’ judgments across standard-setting rounds and methods. The qualitative data included both retrospective comments from open-ended surveys and real-time data from reflective diaries. Findings for procedural and internal validity were mixed, but the evidence for external validity suggested that the final outcomes were reasonable and defensible. Nevertheless, the real-time qualitative data from the reflective diaries highlighted several cognitive challenges experienced by panelists that may have impinged on procedural and internal validity. Additional unique hindrances were lack of resources and wide variation in achievement scores. Ensuring a sustainable system of performance standards requires attention to these deficits. 相似文献

7.

The Choice of Response Probability in Bookmark Standard Setting: An Experimental Study

Peter Baldwin Melissa J. Margolis Brian E. Clauser Janet Mee Marcia Winward 《Educational Measurement》2020,39(1):37-44

Evidence of the internal consistency of standard-setting judgments is a critical part of the validity argument for tests used to make classification decisions. The bookmark standard-setting procedure is a popular approach to establishing performance standards, but there is relatively little research that reflects on the internal consistency of the resulting judgments. This article presents the results of an experiment in which content experts were randomly assigned to one of two response probability conditions: .67 and .80. If the standard-setting judgments collected with the bookmark procedure are internally consistent, both conditions should produce highly similar cut scores. The results showed substantially different cut scores for the two conditions; this calls into question whether content experts can produce the type of internally consistent judgments that are required using the bookmark procedure. 相似文献

8.

Standard Setting as a Participatory Process: Implications for Validation of Standards-Based Accountability Programs 总被引：1，自引：0，他引：1

Edward H. Haertel 《Educational Measurement》2002,21(1):16-22

In standards-based accountability programs, test scores are interpreted with reference to cut scores that establish categories like "proficient" or "below basic." The meaning of these cut scores is set forth in their associated "performance standards." Validity arguments for such interpretations require both a criterion-referenced score scale and a legitimate exercise of authority by those who set the standards. Stakeholder participation in a rational and coherent deliberative process is necessary to assure that these conditions are satisfied. This article sets forth a framework for the required validity argument and suggests possible ways to enable such participation. A new standard-setting method, the "briefing book" method, is suggested for possible study. 相似文献

9.

Adopting Cut Scores: Post-Standard-Setting Panel Considerations for Decision Makers

Kurt F. Geisinger Carina M. McCormick 《Educational Measurement》2010,29(1):38-44

Standard-setting studies utilizing procedures such as the Bookmark or Angoff methods are just one component of the complete standard-setting process. Decision makers ultimately must determine what they believe to be the most appropriate standard or cut score to use, employing the input of the standard-setting panelists as one piece of information among multiple sources. However, guidance for weighing the various components is limited. The current article describes considerations about data that are used to make standard-setting decisions, as previously outlined by Geisinger (1991) . The ten points provided by Geisinger have been expanded as they relate to shifts in educational policy and practice in educational measurement. They have been amended with six new components as well. The new considerations addressed are smoothing across grades, raising standards in progression (over grades or over time), opportunity to learn or instructional validity, input from other groups, equating or linking to previous standards, and organizational vision and goals . 相似文献

10.

Equivalent Estimates of Borderline Group Performance in Standard Setting

John J. Norcini Judy A. Shea 《Journal of Educational Measurement》1992,29(1):19-24

The purpose of this study was to determine if a linear procedure, typically applied to an entire examination when equating scores and reseating judges' standards, could be used with individual item data gathered through Angoffs standard-setting method (1971). Specifically, experts estimates of borderline group performance on one form of a test were transformed to be on the same scale as experts' estimates of borderline group performance on another form of the test. The transformations were based on examinees' responses to the items and on judges' estimates of borderline group performance. The transformed values were compared to the actual estimates provided by a group of judges. The equated and reseated values were reasonably close to those actually assigned by the experts. Bias in the estimates was also relatively small. In general, the reseating procedure was more accurate than the equating procedure, especially when the examinee sample size for equating was small. 相似文献

11.

Consistency of Angoff-Based Standard-Setting Judgments: Are Item Judgments and Passing Scores Replicable Across Different Panels of Experts?

Richard J. Tannenbaum Priya Kannan 《Educational Assessment》2013,18(1):66-78

Angoff-based standard setting is widely used, especially for high-stakes licensure assessments. Nonetheless, some critics have claimed that the judgment task is too cognitively complex for panelists, whereas others have explicitly challenged the consistency in (replicability of) standard-setting outcomes. Evidence of consistency in item judgments and passing scores is necessary to justify using the passing scores for consequential decisions. Few studies, however, have directly evaluated consistency across different standard-setting panels. The purpose of this study was to investigate consistency of Angoff-based standard-setting judgments and passing scores across 9 different educator licensure assessments. Two independent, multistate panels of educators were formed to recommend the passing score for each assessment, with each panel engaging in 2 rounds of judgments. Multiple measures of consistency were applied to each round of judgments. The results provide positive evidence of the consistency in judgments and passing scores. 相似文献

12.

An Evaluation of Conjunctive and Compensatory Standard-Setting Strategies for Test Decisions

《Educational Assessment》2013,18(2):129-153

States are increasingly using test scores as part of the requirements for high school graduation or certification. In these circumstances, a battery of tests or, with writing, analytic traits are considered that usually cover different aspects of the state's content standards. Because pass or fail decisions are made affecting students' futures, the validity of standard-setting procedures and strategies is a major concern. Policymakers and legislators must decide which of these 2 standard-setting strategies to use for making pass or fail decisions for students seeking certification or for meeting a high school graduation requirement. The compensatory strategy focuses on total performance, summing scores across all tests in the battery. The conjunctive strategy requires passing performance for each test in the battery. This article reviews and evaluates compensatory and conjunctive standard-setting strategies. The rationales for each type are presented and discussed. Results from a study comparing the compensatory and conjunctive strategies for a state high school certification writing test provide insight into the problem of choosing either strategy. This article concludes with a set of recommendations for those who must decide which type of standard-setting strategy to use. 相似文献

13.

Setting Performance Standards: Contemporary Methods

Gregory J. Cizek Michael B. Bunch Heather Koons 《Educational Measurement》2004,23(4):31-31

This module describes some common standard-setting procedures used to derive performance levels for achievement tests in education, licensure, and certification. Upon completing the module, readers will be able to: describe what standard setting is; understand why standard setting is necessary; recognize some of the purposes of standard setting; calculate cut scores using various methods; and identify elements to be considered when evaluating standard-setting procedures. A self-test and annotated bibliography are provided at the end of the module. Teaching aids to accompany the module are available through NCME. 相似文献

14.

The Effect of Various Factors on Standard Setting

John J. Norcini Judy A. Shea D. Theresa Kanya 《Journal of Educational Measurement》1988,25(1):57-65

This paper reports two studies of standard setting using Angoff's method. Results of the first study suggest that specialization within broad content areas does not affect an expert's estimates of the performance of the borderline group. This is reassuring because the knowledge base of many professions is so large that no individual can be considered an expert in all aspects of it. Results of the second study support the recommendation that performance data be provided during the standard-setting process. They are frequently used by experts, but will not have an impact on the standard unless the distribution of item difficulties is skewed markedly. It also increases the correspondence between p-values and estimates of borderline group performance, thereby reducing errors in pass/fail decisions. Overall, the results support recommendations often made in standard-setting literature, but they need to be replicated with other groups of experts 相似文献

15.

The Impact of Examinee Performance Information on Judges’ Cut Scores in Modified Angoff Standard‐Setting Exercises

Melissa J. Margolis Brian E. Clauser 《Educational Measurement》2014,33(1):15-22

This research evaluated the impact of a common modification to Angoff standard‐setting exercises: the provision of examinee performance data. Data from 18 independent standard‐setting panels across three different medical licensing examinations were examined to investigate whether and how the provision of performance information impacted judgments and the resulting cut scores. Results varied by panel but in general indicated that both the variability among the panelists and the resulting cut scores were affected by the data. After the review of performance data, panelist variability generally decreased. In addition, for all panels and examinations pre‐ and post‐data cut scores were significantly different. Investigation of the practical significance of the findings indicated that nontrivial fail rate changes were associated with the cut score changes for a majority of standard‐setting exercises. This study is the first to provide a large‐scale, systematic evaluation of the impact of a common standard setting practice, and the results can provide practitioners with insight into how the practice influences panelist variability and resulting cut scores. 相似文献

16.

Multivariate Generalizability Analysis of the Impact of Training and Examinee Performance Information on Judgments Made in an Angoff-Style Standard-Setting Procedure

Brian E. Clauser David B. Swanson Polina Harik 《Journal of Educational Measurement》2002,39(4):269-290

Cut scores, estimated using the Angoff procedure, are routinely used to make high-stakes classification decisions based on examinee scores. Precision is necessary in estimation of cut scores because of the importance of these decisions. Although much has been written about how these procedures should be implemented, there is relatively little literature providing empirical support for specific approaches to providing training and feedback to standard-setting judges. This article presents a multivariate generalizability analysis designed to examine the impact of training and feedback on various sources of error in estimation of cut scores for a standard-setting procedure in which multiple independent groups completed the judgments. The results indicate that after training, there was little improvement in the ability of judges to rank order items by difficulty but there was a substantial improvement in inter-judge consistency in centering ratings. The results also show a substantial group effect. Consistent with this result, the direction of change for the estimated cut score was shown to be group dependent. 相似文献

17.

Embedded Standard Setting: Aligning Standard-Setting Methodology with Contemporary Assessment Design Principles

Daniel Lewis Robert Cook 《Educational Measurement》2020,39(1):8-21

相似文献

18.

Use of the Rasch IRT Model in Standard Setting: An Item-Mapping Method 总被引：1，自引：0，他引：1

Ning Wang 《Journal of Educational Measurement》2003,40(3):231-253

This article provides both logical and empirical evidence to justify the use of an item-mapping method for establishing passing scores for multiple-choice licensure and certification examinations. After describing the item-mapping standard-setting process, the rationale and theoretical basis for this method are discussed, and the similarities and differences between the item-mapping and the Bookmark methods are also provided. Empirical evidence supporting use of the item-mapping method is provided by comparing results from four standard-setting studies for diverse licensure and certification examinations. The four cut score studies were conducted using both the item-mapping and the Angoff methods. Rating data from the four standard-setting studies, using each of the two methods, were analyzed using item-by-rater random effects generalizability and dependability studies to examine which method yielded higher inter-judge consistency. Results indicated that the item-mapping method produced higher inter-judge consistency and achieved greater rater agreement than the Angoff method. 相似文献

19.

Teachers' Ability to Estimate Item Difficulty: A Test of the Assumptions in the Angoff Standard Setting Method

James C. Impara Barbara S. Plake 《Journal of Educational Measurement》1998,35(1):69-81

The Angoff (1971) standard setting method requires expert panelists to (a) conceptualize candidates who possess the qualifications of interest (e.g., the minimally qualified) and (b) estimate actual item performance for these candidates. Past and current research (Bejar, 1983; Shepard, 1994) suggests that estimating item performance is difficult for panelists. If panelists cannot perform this task, the validity of the standard based on these estimates is in question. This study tested the ability of 26 classroom teachers to estimate item performance for two groups of their students on a locally developed district-wide science test. Teachers were more accurate in estimating the performance of the total group than of the "borderline group," but in neither case was their accuracy level high. Implications of this finding for the validity of item performance estimates by panelists using the Angoff standard setting method are discussed. 相似文献

20.

Interpreting the Results of Three Different Standard-Setting Procedures

Donald Ross Green C. Scott Trimble Daniel M. Lewis 《Educational Measurement》2003,22(1):22-32

Different standard-setting procedures usually produce different cut points even if each has a rational basis. In 2000, three standard-setting procedures were implemented to set cut scores in each of the 18 grade/content areas comprising Kentucky's state assessment system: the Contrasting Groups, Bookmark, and Jaeger-Mills procedures. Subsequently, participants from each of the three procedures worked together in each grade/content area to synthesize the results. These synthesis participants considered the results of, and examined the materials and information provided by, each of the three separate procedures. In this article the synthesis processes are described and discussed. 相似文献