期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An Experimental Study of the Internal Consistency of Judgments Made in Bookmark Standard Setting

Brian E. Clauser Peter Baldwin Melissa J. Margolis Janet Mee Marcia Winward 《Journal of Educational Measurement》2017,54(4):481-497

Validating performance standards is challenging and complex. Because of the difficulties associated with collecting evidence related to external criteria, validity arguments rely heavily on evidence related to internal criteria—especially evidence that expert judgments are internally consistent. Given its importance, it is somewhat surprising that evidence of this kind has rarely been published in the context of the widely used bookmark standard‐setting procedure. In this article we examined the effect of ordered item booklet difficulty on content experts’ bookmark judgments. If panelists make internally consistent judgments, their resultant cut scores should be unaffected by the difficulty of their respective booklets. This internal consistency was not observed: the results suggest that substantial systematic differences in the resultant cut scores can arise when the difficulty of the ordered item booklets varies. These findings raise questions about the ability of content experts to make the judgments required by the bookmark procedure. 相似文献

2.

Multivariate Generalizability Analysis of the Impact of Training and Examinee Performance Information on Judgments Made in an Angoff-Style Standard-Setting Procedure

Brian E. Clauser David B. Swanson Polina Harik 《Journal of Educational Measurement》2002,39(4):269-290

Cut scores, estimated using the Angoff procedure, are routinely used to make high-stakes classification decisions based on examinee scores. Precision is necessary in estimation of cut scores because of the importance of these decisions. Although much has been written about how these procedures should be implemented, there is relatively little literature providing empirical support for specific approaches to providing training and feedback to standard-setting judges. This article presents a multivariate generalizability analysis designed to examine the impact of training and feedback on various sources of error in estimation of cut scores for a standard-setting procedure in which multiple independent groups completed the judgments. The results indicate that after training, there was little improvement in the ability of judges to rank order items by difficulty but there was a substantial improvement in inter-judge consistency in centering ratings. The results also show a substantial group effect. Consistent with this result, the direction of change for the estimated cut score was shown to be group dependent. 相似文献

3.

Consistency of Angoff-Based Standard-Setting Judgments: Are Item Judgments and Passing Scores Replicable Across Different Panels of Experts?

Richard J. Tannenbaum Priya Kannan 《Educational Assessment》2013,18(1):66-78

Angoff-based standard setting is widely used, especially for high-stakes licensure assessments. Nonetheless, some critics have claimed that the judgment task is too cognitively complex for panelists, whereas others have explicitly challenged the consistency in (replicability of) standard-setting outcomes. Evidence of consistency in item judgments and passing scores is necessary to justify using the passing scores for consequential decisions. Few studies, however, have directly evaluated consistency across different standard-setting panels. The purpose of this study was to investigate consistency of Angoff-based standard-setting judgments and passing scores across 9 different educator licensure assessments. Two independent, multistate panels of educators were formed to recommend the passing score for each assessment, with each panel engaging in 2 rounds of judgments. Multiple measures of consistency were applied to each round of judgments. The results provide positive evidence of the consistency in judgments and passing scores. 相似文献

4.

A Comparison of Three Variations on a Standard-Setting Method 总被引：1，自引：0，他引：1

John J. Norcini Rebecca S. Lipner Lynn O. Langdon Carolyn A. Strecker 《Journal of Educational Measurement》1987,24(1):56-64

The purpose of this study was to determine whether two variations on the typical Angoff group standard-setting process would produce sufficiently consistent results to recommend their use. Judgments obtained from a group of experts during a meeting were compared with judgments gathered from the same group before and after the meeting. The results indicate that differences between passing scores obtained with the three variations are relatively small, but those gathered before the meeting were less consistent than ratings gathered during and after the meeting. These results imply that judgments gathered after an initial traditional group-process session can provide an efficient alternative mechanism for setting cutting scores using the Angoff method.
This research was supported by The American Board of Internal Medicine, but does not necessarily reflect its opinions or policies. 相似文献

5.

Rejoinder: Evaluating Standard Setting Methods Using Error Models Proposed by Schulz

Mark D. Reckase 《Educational Measurement》2006,25(3):14-17

相似文献

6.

Examining How Professional Roles and Test Development Experiences Impact Angoff Ratings

Adam E. Wyse 《教育实用测度》2018,31(4):324-334

An important consideration in standard setting is recruiting a group of panelists with different experiences and backgrounds to serve on the standard-setting panel. This study uses data from 14 different Angoff standard settings from a variety of medical imaging credentialing programs to examine whether people with different professional roles and test development experiences tended to recommend higher or lower cut scores or were more or less accurate in their standard-setting judgments. Results suggested that there were not any statistically significant differences for different types of panelists in terms of the cut scores they recommended or the accuracy of their judgments. Discussion of what these results may mean for panelist selection and recruitment is provided. 相似文献

7.

Interpreting the Results of Three Different Standard-Setting Procedures

Donald Ross Green C. Scott Trimble Daniel M. Lewis 《Educational Measurement》2003,22(1):22-32

Different standard-setting procedures usually produce different cut points even if each has a rational basis. In 2000, three standard-setting procedures were implemented to set cut scores in each of the 18 grade/content areas comprising Kentucky's state assessment system: the Contrasting Groups, Bookmark, and Jaeger-Mills procedures. Subsequently, participants from each of the three procedures worked together in each grade/content area to synthesize the results. These synthesis participants considered the results of, and examined the materials and information provided by, each of the three separate procedures. In this article the synthesis processes are described and discussed. 相似文献

8.

Increasing the Validity of Angoff Standards Through Analysis of Judge-Level Internal Consistency

Jerome C. Clauser Brian E. Clauser Ronald K. Hambleton 《教育实用测度》2013,26(1):19-30

The purpose of the present study was to extend past work with the Angoff method for setting standards by examining judgments at the judge level rather than the panel level. The focus was on investigating the relationship between observed Angoff standard setting judgments and empirical conditional probabilities. This relationship has been used as a measure of internal consistency by previous researchers. Results indicated that judges varied in the degree to which they were able to produce internally consistent ratings; some judges produced ratings that were highly correlated with empirical conditional probabilities and other judges’ ratings had essentially no correlation with the conditional probabilities. The results also showed that weighting procedures applied to individual judgments both increased panel-level internal consistency and produced convergence across panels. 相似文献

9.

Use of the Rasch IRT Model in Standard Setting: An Item-Mapping Method 总被引：1，自引：0，他引：1

Ning Wang 《Journal of Educational Measurement》2003,40(3):231-253

This article provides both logical and empirical evidence to justify the use of an item-mapping method for establishing passing scores for multiple-choice licensure and certification examinations. After describing the item-mapping standard-setting process, the rationale and theoretical basis for this method are discussed, and the similarities and differences between the item-mapping and the Bookmark methods are also provided. Empirical evidence supporting use of the item-mapping method is provided by comparing results from four standard-setting studies for diverse licensure and certification examinations. The four cut score studies were conducted using both the item-mapping and the Angoff methods. Rating data from the four standard-setting studies, using each of the two methods, were analyzed using item-by-rater random effects generalizability and dependability studies to examine which method yielded higher inter-judge consistency. Results indicated that the item-mapping method produced higher inter-judge consistency and achieved greater rater agreement than the Angoff method. 相似文献

10.

Setting Passing Scores on Passage-Based Tests: A Comparison of Traditional and Single-Passage Bookmark Methods

Gary Skaggs Serge F. Hein Risper Awuor 《教育实用测度》2013,26(4):405-426

In this study, a variation of the bookmark standard setting procedure for passage-based tests is proposed in which separate ordered item booklets are created for the items associated with each passage. This variation is compared to the traditional bookmark procedure for a fifth-grade reading test. The results showed that the single-passage bookmark method produced greater consistency among the participants' cutscores, and most participants' bookmark placements did not change after the first round. In addition, participants reported greater understanding of the bookmarking task and greater confidence in their recommended cutscores. Both procedures required approximately the same amount of time to complete, but it is likely that the single-passage bookmark method could be carried out in two, or possibly even one, round of bookmarking rather than the three rounds used in traditional bookmarking. On the other hand, there are several concerns about the single-passage bookmark method that warrant further research. These include floor and ceiling effects, training issues, optimal booklet length, and multiple standards. 相似文献

11.

A Multi-Stage Dominant Profile Method for Setting Standards on Complex Performance Assessments

《教育实用测度》2013,26(1):57-83

相似文献

12.

Setting Standards for English Foreign Language Assessment: Methodology,Validation, and a Degree of Arbitrariness

Simon P. Tiffin‐Richards Hans Anand Pant Olaf Köller 《Educational Measurement》2013,32(2):15-25

Cut‐scores were set by expert judges on assessments of reading and listening comprehension of English as a foreign language (EFL), using the bookmark standard‐setting method to differentiate proficiency levels defined by the Common European Framework of Reference (CEFR). Assessments contained stratified item samples drawn from extensive item pools, calibrated using Rasch models on the basis of examinee responses of a German nationwide assessment of secondary school language performance. The results suggest significant effects of item sampling strategies for the bookmark method on cut‐score recommendations, as well as significant cut‐score judgment revision over cut‐score placement rounds. Results are discussed within a framework of establishing validity evidence supporting cut‐score recommendations using the widely employed bookmark method. 相似文献

13.

Using Diagnostic Profiles to Describe Borderline Performance in Standard Setting

Gary Skaggs Serge F. Hein Jesse L. M. Wilkins 《Educational Measurement》2020,39(1):45-51

In test-centered standard-setting methods, borderline performance can be represented by many different profiles of strengths and weaknesses. As a result, asking panelists to estimate item or test performance for a hypothetical group study of borderline examinees, or a typical borderline examinee, may be an extremely difficult task and one that can lead to questionable results in setting cut scores. In this study, data collected from a previous standard-setting study are used to deduce panelists’ conceptions of profiles of borderline performance. These profiles are then used to predict cut scores on a test of algebra readiness. The results indicate that these profiles can predict a very wide range of cut scores both within and between panelists. Modifications are proposed to existing training procedures for test-centered methods that can account for the variation in borderline profiles. 相似文献

14.

A Critical Look into the Beuk Standard-Setting Method

Adam E. Wyse 《Educational Measurement》2020,39(1):52-60

One commonly used compromise standard-setting method is the Beuk (1984) method. A key assumption of the Beuk method is that the emphasis given to the pass rate and the percent correct ratings should be proportional to the extent that the panelists agree on their ratings. However, whether the slope of Beuk line reflects the emphasis that panelists believe should be assigned to the pass rate and the percentage correct ratings has not be fully tested. In this article, I evaluate this critical assumption of the Beuk method by asking panelists to assign importance weights to their percentage correct and pass rate judgments. I show that in several cases that the emphasis suggested by the Beuk slope is noticeably different from what one would expect and is inconsistent with importance weight ratings. I also suggest two ways that the importance weights can be used to calculate alternate cut scores, and I show that one of the ways of calculating cut scores using the importance weights leads to larger potential differences in cut score estimates. I suggest that practitioners should consider collecting importance weights when the Beuk method is used for determining cut scores. 相似文献

15.

Standard Setting as a Participatory Process: Implications for Validation of Standards-Based Accountability Programs 总被引：1，自引：0，他引：1

Edward H. Haertel 《Educational Measurement》2002,21(1):16-22

In standards-based accountability programs, test scores are interpreted with reference to cut scores that establish categories like "proficient" or "below basic." The meaning of these cut scores is set forth in their associated "performance standards." Validity arguments for such interpretations require both a criterion-referenced score scale and a legitimate exercise of authority by those who set the standards. Stakeholder participation in a rational and coherent deliberative process is necessary to assure that these conditions are satisfied. This article sets forth a framework for the required validity argument and suggests possible ways to enable such participation. A new standard-setting method, the "briefing book" method, is suggested for possible study. 相似文献

16.

Evaluating the Operational Feasibility of Using Subsets of Items to Recommend Minimal Competency Cut Scores

Priya Kannan Adrienne Sgammato Richard J. Tannenbaum 《教育实用测度》2015,28(4):292-307

Establishing cut scores using the Angoff method requires panelists to evaluate every item on a test and make a probability judgment. This can be time-consuming when there are large numbers of items on the test. Previous research using resampling studies suggest that it is possible to recommend stable Angoff-based cut score estimates using a content-stratified subset of ?45 items. Recommendations from earlier work were directly applied in this study in two operational standard-setting meetings. Angoff cut scores from two panels of raters were collected at each study, wherein one panel established the cut score based on the entire test, and another comparable panel first used a proportionally stratified subset of 45 items, and subsequently used the entire test in recommending the cut scores. The cut scores recommended for the subset of items were compared to the cut scores recommended based on the entire test for the same panel, and a comparable independent panel. Results from both studies suggest that cut scores recommended using a subset of items are comparable (i.e., within one standard error) to the cut score estimates from the full test. 相似文献

17.

Evaluating the Consistency of Test Content Across Two Successive Administrations of a State-Mandated Science Assessment

Timothy O'Neil Stephen G. Sireci Kristen L. Huff 《Educational Assessment》2013,18(3-4):129-151

Educational tests used for accountability purposes must represent the content domains they purport to measure. When such tests are used to monitor progress over time, the consistency of the test content across years is important for ensuring that observed changes in test scores are due to student achievement rather than to changes in what the test is measuring. In this study, expert science teachers evaluated the content and cognitive characteristics of the items from 2 consecutive annual administrations of a 10th-grade science assessment. The results indicated the content area representation was fairly consistent across years and the proportion of items measuring the different cognitive skill areas was also consistent. However, the experts identified important cognitive distinctions among the test items that were not captured in the test specifications. The implications of this research for the design of science assessments and for appraising the content validity of state-mandated assessments are discussed. 相似文献

18.

Determination of clinically relevant content for a musculoskeletal anatomy curriculum for physical medicine and rehabilitation residents

下载免费PDF全文

Kristina Lisk John F. Flannery Eldon Y. Loh Denyse Richardson Anne M.R. Agur Nicole N. Woods 《Anatomical sciences education》2014,7(2):135-143

To address the need for more clinical anatomy training in residency education, many postgraduate programs have implemented structured anatomy courses into their curriculum. Consensus often does not exist on specific content and level of detail of the content that should be included in such curricula. This article describes the use of the Delphi method to identify clinically relevant content to incorporate in a musculoskeletal anatomy curriculum for Physical Medicine and Rehabilitation (PM&R) residents. A two round modified Delphi involving PM&R experts was used to establish the curricular content. The anatomical structures and clinical conditions presented to the expert group were compiled using multiple sources: clinical musculoskeletal anatomy cases from the PM&R residency program at the University of Toronto; consultation with PM&R experts; and textbooks. In each round, experts rated the importance of each curricular item to PM&R residency education using a five‐point Likert scale. Internal consistency (Cronbach's alpha) was used to determine consensus at the end of each round and agreement scores were used as an outcome measure to determine the content to include in the curriculum. The overall internal consistency in both rounds was 0.99. A total of 37 physiatrists from across Canada participated and the overall response rate over two rounds was 97%. The initial curricular list consisted of 361 items. After the second iteration, the list was reduced by 44%. By using a national consensus method we were able to objectively determine the relevant anatomical structures and clinical musculoskeletal conditions important in daily PM&R practice. Anat Sci Educ 7: 135–143. © 2013 American Association of Anatomists. 相似文献

19.

Installing a System of Performance Standards for National Assessments in the Republic of Trinidad and Tobago: Issues and Challenges

Jerome De Lisle 《教育实用测度》2015,28(4):308-329

This article explores the challenge of setting performance standards in a non-Western context. The study is centered on standard-setting practice in the national learning assessments of Trinidad and Tobago. Quantitative and qualitative data from annual evaluations between 2005 and 2009 were compiled, analyzed, and deconstructed. In the mixed methods research design, data were integrated under an evaluation framework for validating performance standards. The quantitative data included panelists’ judgments across standard-setting rounds and methods. The qualitative data included both retrospective comments from open-ended surveys and real-time data from reflective diaries. Findings for procedural and internal validity were mixed, but the evidence for external validity suggested that the final outcomes were reasonable and defensible. Nevertheless, the real-time qualitative data from the reflective diaries highlighted several cognitive challenges experienced by panelists that may have impinged on procedural and internal validity. Additional unique hindrances were lack of resources and wide variation in achievement scores. Ensuring a sustainable system of performance standards requires attention to these deficits. 相似文献

20.

The Effect of Various Factors on Standard Setting

John J. Norcini Judy A. Shea D. Theresa Kanya 《Journal of Educational Measurement》1988,25(1):57-65

This paper reports two studies of standard setting using Angoff's method. Results of the first study suggest that specialization within broad content areas does not affect an expert's estimates of the performance of the borderline group. This is reassuring because the knowledge base of many professions is so large that no individual can be considered an expert in all aspects of it. Results of the second study support the recommendation that performance data be provided during the standard-setting process. They are frequently used by experts, but will not have an impact on the standard unless the distribution of item difficulties is skewed markedly. It also increases the correspondence between p-values and estimates of borderline group performance, thereby reducing errors in pass/fail decisions. Overall, the results support recommendations often made in standard-setting literature, but they need to be replicated with other groups of experts 相似文献