期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

An Experimental Study of the Internal Consistency of Judgments Made in Bookmark Standard Setting

Brian E. Clauser Peter Baldwin Melissa J. Margolis Janet Mee Marcia Winward 《Journal of Educational Measurement》2017,54(4):481-497

Validating performance standards is challenging and complex. Because of the difficulties associated with collecting evidence related to external criteria, validity arguments rely heavily on evidence related to internal criteria—especially evidence that expert judgments are internally consistent. Given its importance, it is somewhat surprising that evidence of this kind has rarely been published in the context of the widely used bookmark standard‐setting procedure. In this article we examined the effect of ordered item booklet difficulty on content experts’ bookmark judgments. If panelists make internally consistent judgments, their resultant cut scores should be unaffected by the difficulty of their respective booklets. This internal consistency was not observed: the results suggest that substantial systematic differences in the resultant cut scores can arise when the difficulty of the ordered item booklets varies. These findings raise questions about the ability of content experts to make the judgments required by the bookmark procedure. 相似文献

2.

Teachers' Ability to Estimate Item Difficulty: A Test of the Assumptions in the Angoff Standard Setting Method

James C. Impara Barbara S. Plake 《Journal of Educational Measurement》1998,35(1):69-81

The Angoff (1971) standard setting method requires expert panelists to (a) conceptualize candidates who possess the qualifications of interest (e.g., the minimally qualified) and (b) estimate actual item performance for these candidates. Past and current research (Bejar, 1983; Shepard, 1994) suggests that estimating item performance is difficult for panelists. If panelists cannot perform this task, the validity of the standard based on these estimates is in question. This study tested the ability of 26 classroom teachers to estimate item performance for two groups of their students on a locally developed district-wide science test. Teachers were more accurate in estimating the performance of the total group than of the "borderline group," but in neither case was their accuracy level high. Implications of this finding for the validity of item performance estimates by panelists using the Angoff standard setting method are discussed. 相似文献

3.

Examining How Professional Roles and Test Development Experiences Impact Angoff Ratings

Adam E. Wyse 《教育实用测度》2018,31(4):324-334

An important consideration in standard setting is recruiting a group of panelists with different experiences and backgrounds to serve on the standard-setting panel. This study uses data from 14 different Angoff standard settings from a variety of medical imaging credentialing programs to examine whether people with different professional roles and test development experiences tended to recommend higher or lower cut scores or were more or less accurate in their standard-setting judgments. Results suggested that there were not any statistically significant differences for different types of panelists in terms of the cut scores they recommended or the accuracy of their judgments. Discussion of what these results may mean for panelist selection and recruitment is provided. 相似文献

4.

Maintaining Equivalent Cut Scores for Small Sample Test Forms

Andrew C. Dwyer 《Journal of Educational Measurement》2016,53(1):3-22

This study examines the effectiveness of three approaches for maintaining equivalent performance standards across test forms with small samples: (1) common‐item equating, (2) resetting the standard, and (3) rescaling the standard. Rescaling the standard (i.e., applying common‐item equating methodology to standard setting ratings to account for systematic differences between standard setting panels) has received almost no attention in the literature. Identity equating was also examined to provide context. Data from a standard setting form of a large national certification test (N examinees = 4,397; N panelists = 13) were split into content‐equivalent subforms with common items, and resampling methodology was used to investigate the error introduced by each approach. Common‐item equating (circle‐arc and nominal weights mean) was evaluated at samples of size 10, 25, 50, and 100. The standard setting approaches (resetting and rescaling the standard) were evaluated by resampling (N = 8) and by simulating panelists (N = 8, 13, and 20). Results were inconclusive regarding the relative effectiveness of resetting and rescaling the standard. Small‐sample equating, however, consistently produced new form cut scores that were less biased and less prone to random error than new form cut scores based on resetting or rescaling the standard. 相似文献

5.

Installing a System of Performance Standards for National Assessments in the Republic of Trinidad and Tobago: Issues and Challenges

Jerome De Lisle 《教育实用测度》2015,28(4):308-329

This article explores the challenge of setting performance standards in a non-Western context. The study is centered on standard-setting practice in the national learning assessments of Trinidad and Tobago. Quantitative and qualitative data from annual evaluations between 2005 and 2009 were compiled, analyzed, and deconstructed. In the mixed methods research design, data were integrated under an evaluation framework for validating performance standards. The quantitative data included panelists’ judgments across standard-setting rounds and methods. The qualitative data included both retrospective comments from open-ended surveys and real-time data from reflective diaries. Findings for procedural and internal validity were mixed, but the evidence for external validity suggested that the final outcomes were reasonable and defensible. Nevertheless, the real-time qualitative data from the reflective diaries highlighted several cognitive challenges experienced by panelists that may have impinged on procedural and internal validity. Additional unique hindrances were lack of resources and wide variation in achievement scores. Ensuring a sustainable system of performance standards requires attention to these deficits. 相似文献

6.

Using Diagnostic Profiles to Describe Borderline Performance in Standard Setting

Gary Skaggs Serge F. Hein Jesse L. M. Wilkins 《Educational Measurement》2020,39(1):45-51

In test-centered standard-setting methods, borderline performance can be represented by many different profiles of strengths and weaknesses. As a result, asking panelists to estimate item or test performance for a hypothetical group study of borderline examinees, or a typical borderline examinee, may be an extremely difficult task and one that can lead to questionable results in setting cut scores. In this study, data collected from a previous standard-setting study are used to deduce panelists’ conceptions of profiles of borderline performance. These profiles are then used to predict cut scores on a test of algebra readiness. The results indicate that these profiles can predict a very wide range of cut scores both within and between panelists. Modifications are proposed to existing training procedures for test-centered methods that can account for the variation in borderline profiles. 相似文献

7.

Alignment and Implications for Test Takers

Catherine J. Welch Stephen B. Dunbar 《Educational Measurement》2020,39(2):8-17

The use of assessment results to inform school accountability relies on the assumption that the test design appropriately represents the content and cognitive emphasis reflected in the state's standards. Since the passage of the Every Student Succeeds Act and the certification of accountability assessments through federal peer review practices, the content validity arguments supporting accountability have relied almost exclusively on the alignment of statewide assessments to state standards. It is assumed that if alignment does not hold, the scores will not provide valid inferences regarding the degree to which test takers have performed. Although alignment results are commonly used as evidence of test appropriateness, Polikoff (this issue) would argue that given the importance of alignment in policy decisions, research related to alignment is surprisingly limited. Few studies have addressed the adequacy of alignment methodologies and results as support for the inferences to be made (i.e., proficient on state standards). This paper uses an example of test taker performance (and common performance indicators) to investigate to what extent the degree of alignment impacts inferences made about performance (i.e., classification into performance levels, estimates of student ability, and student rank order). 相似文献

8.

Rater Agreement in Test‐to‐Curriculum Alignment Reviews

下载免费PDF全文

A. Traynor H. E. Merzdorf 《Educational Measurement》2018,37(3):55-64

During the development of large‐scale curricular achievement tests, recruited panels of independent subject‐matter experts use systematic judgmental methods—often collectively labeled “alignment” methods—to rate the correspondence between a given test's items and the objective statements in a particular curricular standards document. High disagreement among the expert panelists may indicate problems with training, feedback, or other steps of the alignment procedure. Existing procedural recommendations for alignment reviews have been derived largely from single‐panel research studies; support for their use during operational large‐scale test development may be limited. Synthesizing data from more than 1,000 alignment reviews of state achievement tests, this study identifies features of test–standards alignment review procedures that impact agreement about test item content. The researchers then use their meta‐regression results to propose some practical suggestions for alignment review implementation. 相似文献

9.

Diagnostic Profiles: A Standard Setting Method for Use With a Cognitive Diagnostic Model

Gary Skaggs Serge F. Hein Jesse L. M. Wilkins 《Journal of Educational Measurement》2016,53(4):448-458

This article introduces the Diagnostic Profiles (DP) standard setting method for setting a performance standard on a test developed from a cognitive diagnostic model (CDM), the outcome of which is a profile of mastered and not‐mastered skills or attributes rather than a single test score. In the DP method, the key judgment task for panelists is a decision on whether or not individual cognitive skill profiles meet the performance standard. A randomized experiment was carried out in which secondary mathematics teachers were randomly assigned to either the DP method or the modified Angoff method. The standard setting methods were applied to a test of student readiness to enter high school algebra (Algebra I). While the DP profile judgments were perceived to be more difficult than the Angoff item judgments, there was a high degree of agreement among the panelists for most of the profiles. In order to compare the methods, cut scores were generated from the DP method. The results of the DP group were comparable to the Angoff group, with less cut score variability in the DP group. The DP method shows promise for testing situations in which diagnostic information is needed about examinees and where that information needs to be linked to a performance standard. 相似文献

10.

A Guide to Understanding and Developing Performance‐Level Descriptors

Marianne Perie 《Educational Measurement》2008,27(4):15-29

相似文献

11.

A Note on the Application of Multiple Matrix Sampling to Standard Setting

John J. Norcini Judy A. Shea James C. Ping 《Journal of Educational Measurement》1988,25(2):159-164

In many of the methods currently proposed for standard setting, all experts are asked to judge all items, and the standard is taken as the mean of their judgments. When resources are limited, gathering the judgments of all experts in a single group can become impractical. Multiple matrix sampling (MMS) provides an alternative. This paper applies MMS to a variation on Angoff's method (1971) of standard setting. A pool of 36 experts and 190 items were divided randomly into 5 groups, and estimates of borderline examinee performance were acquired. Results indicated some variability in the cutting scores produced by the individual groups, but the variance components were reasonably well estimated. The standard error of the cutting score was very small, and the width of the 90% confidence interval around it was only 1.3 items. The reliability of the final cutting score was.98 相似文献

12.

Adding Objectivity to Standard Setting: Evaluating Consequence Using the Conscious and Subconscious Weight Methods

Brian C. Leventhal Irina Grabovsky 《Educational Measurement》2020,39(1):30-36

Standard setting is arguably one of the most subjective techniques in test development and psychometrics. The decisions when scores are compared to standards, however, are arguably the most consequential outcomes of testing. Providing licensure to practice in a profession has high stake consequences for the public. Denying graduation or forcing remediation has high-impact consequences for students. Unfortunately, tests that classify individuals are subjected to false positive and false negative misclassifications. When determining a standard, standard setting panelists implicitly consider the negative consequences of the decisions made from test use. We propose the conscious weight method and subconscious weight method to bring more objectivity to the standard setting process. To do this, these methods quantify the relative harm of the negative consequences of false positive and false negative misclassification. 相似文献

13.

The Impact of Examinee Performance Information on Judges’ Cut Scores in Modified Angoff Standard‐Setting Exercises

Melissa J. Margolis Brian E. Clauser 《Educational Measurement》2014,33(1):15-22

This research evaluated the impact of a common modification to Angoff standard‐setting exercises: the provision of examinee performance data. Data from 18 independent standard‐setting panels across three different medical licensing examinations were examined to investigate whether and how the provision of performance information impacted judgments and the resulting cut scores. Results varied by panel but in general indicated that both the variability among the panelists and the resulting cut scores were affected by the data. After the review of performance data, panelist variability generally decreased. In addition, for all panels and examinations pre‐ and post‐data cut scores were significantly different. Investigation of the practical significance of the findings indicated that nontrivial fail rate changes were associated with the cut score changes for a majority of standard‐setting exercises. This study is the first to provide a large‐scale, systematic evaluation of the impact of a common standard setting practice, and the results can provide practitioners with insight into how the practice influences panelist variability and resulting cut scores. 相似文献

14.

Validating High-Stakes Testing Programs

Michael Kane 《Educational Measurement》2002,21(1):31-41

Cronbach made the point that for validity arguments to be convincing to diverse stakeholders, they need to be based on assumptions that are credible to these stakeholders. The interpretations and uses of high-stakes test scores rely on a number of policy assumptions about what should be taught in schools, and more specifically, about the content standards and performance standards that should be applied to students and schools. For example, a high-school graduation test can be developed as a measure of readiness for the world of work, for college, or for citizenship and the activities of daily life. The assumptions built into the assessment need to be subjected to scrutiny and criticism if a strong case is to be made for the validity of the proposed interpretation and use. 相似文献

15.

A Multi-Stage Dominant Profile Method for Setting Standards on Complex Performance Assessments

《教育实用测度》2013,26(1):57-83

相似文献

16.

Predicting Freshman Grade‐Point Average from Test Scores: Effects of Variation Within and Between High Schools

下载免费PDF全文

D. Koretz M. Langi 《Educational Measurement》2018,37(2):9-19

Most studies predicting college performance from high‐school grade point average (HSGPA) and college admissions test scores use single‐level regression models that conflate relationships within and between high schools. Because grading standards vary among high schools, these relationships are likely to differ within and between schools. We used two‐level regression models to predict freshman grade point average from HSGPA and scores on both college admissions and state tests. When HSGPA and scores are considered together, HSGPA predicts more strongly within high schools than between, as expected in the light of variations in grading standards. In contrast, test scores, particularly mathematics scores, predict more strongly between schools than within. Within‐school variation in mathematics scores has no net predictive value, but between‐school variation is substantially predictive. Whereas other studies have shown that adding test scores to HSGPA yields only a minor improvement in aggregate prediction, our findings suggest that a potentially more important effect of admissions tests is statistical moderation, that is, partially offsetting differences in grading standards across high schools. 相似文献

17.

A Conceptual Framework for a Psychometric Theory for Standard Setting with Examples of Its Use for Evaluating the Functioning of Two Standard Setting Methods

Mark D. Reckase 《Educational Measurement》2006,25(2):4-18

A conceptual framework is proposed for a psychometric theory of standard setting. The framework suggests that participants in a standard setting process (panelists) develop an internal, intended standard as a result of training and the participant's background. The goal of a standard setting process is to convert panelists' intended standards to points on a test's score scale. Psychometrics is involved in this process because the points on the score scale are estimated from ratings provided by participants. The conceptual framework is used to derive three criteria for evaluating standard setting processes. The use of these criteria is demonstrated by applying them to variations of bookmark and modified Angoff standard setting methods. 相似文献

18.

Standard Setting as a Participatory Process: Implications for Validation of Standards-Based Accountability Programs 总被引：1，自引：0，他引：1

Edward H. Haertel 《Educational Measurement》2002,21(1):16-22

In standards-based accountability programs, test scores are interpreted with reference to cut scores that establish categories like "proficient" or "below basic." The meaning of these cut scores is set forth in their associated "performance standards." Validity arguments for such interpretations require both a criterion-referenced score scale and a legitimate exercise of authority by those who set the standards. Stakeholder participation in a rational and coherent deliberative process is necessary to assure that these conditions are satisfied. This article sets forth a framework for the required validity argument and suggests possible ways to enable such participation. A new standard-setting method, the "briefing book" method, is suggested for possible study. 相似文献

19.

A. J. Massey 《Assessment in Education: Principles, Policy & Practice》1995,2(2):187-203

The evolving specification for a series of vertically equated overlapping Key Stage 3 national tests in science in England and Wales sets a series of test development challenges. These include the need to relate standards defined by hierarchically organised ‘level’ criteria to cut‐scores based on total test scores; and the need to allow compensation across the boundaries of sets of items targeted at different levels. A criterion‐related model for test development is described which is governed by a pattern of expectations about the performance of pupils relating to the hierarchical level criteria and builds determination of cut‐scores into the test development process. Some other relevant approaches to standard setting are also discussed. 相似文献

20.

Electronic Graphic Organizers for Learning Science Vocabulary and Concepts: The Effects of Online Synchronous Discussion

Deborah K. Reed Emily Jemison Jessica Sidler-Folsom Ashley Weber 《Journal of Experimental Education》2013,81(4):552-574

This study investigated the content learning of fourth graders (N?=?92) randomly assigned to complete electronic Frayer Models on life science vocabulary by themselves or while engaging in synchronous online discussions with a partner. The use of the graphic organizers was interspersed with other science activities. After seven weeks, all students significantly improved their science content knowledge (d ≈ 0.60), but the two treatment groups did not demonstrate significantly different performance after controlling for pretest abilities. Path analysis revealed the rubric scores on students' organizers were more strongly predictive of the posttest science benchmark in the online discussion group than the independent group, suggesting that collaboratively completing Frayer Models may help bolster the relationship between reading and science. 相似文献