期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Investigating the Effectiveness of Equating Designs for Constructed-Response Tests in Large-Scale Assessments

Sooyeon Kim Michael E. Walker Frederick McHale 《Journal of Educational Measurement》2010,47(2):186-201

Using data from a large-scale exam, in this study we compared various designs for equating constructed-response (CR) tests to determine which design was most effective in producing equivalent scores across the two tests to be equated. In the context of classical equating methods, four linking designs were examined: (a) an anchor set containing common CR items, (b) an anchor set incorporating common CR items rescored, (c) an external multiple-choice (MC) anchor test, and (d) an equivalent groups design incorporating rescored CR items (no anchor test). The use of CR items without rescoring resulted in much larger bias than the other designs. The use of an external MC anchor resulted in the next largest bias. The use of a rescored CR anchor and the equivalent groups design led to similar levels of equating error. 相似文献

2.

Multidimensional Linking for Tests with Mixed Item Types

Lihua Yao Keith Boughton 《Journal of Educational Measurement》2009,46(2):177-197

Numerous assessments contain a mixture of multiple choice (MC) and constructed response (CR) item types and many have been found to measure more than one trait. Thus, there is a need for multidimensional dichotomous and polytomous item response theory (IRT) modeling solutions, including multidimensional linking software. For example, multidimensional item response theory (MIRT) may have a promising future in subscale score proficiency estimation, leading toward a more diagnostic orientation, which requires the linking of these subscale scores across different forms and populations. Several multidimensional linking studies can be found in the literature; however, none have used a combination of MC and CR item types. Thus, this research explores multidimensional linking accuracy for tests composed of both MC and CR items using a matching test characteristic/response function approach. The two-dimensional simulation study presented here used real data-derived parameters from a large-scale statewide assessment with two subscale scores for diagnostic profiling purposes, under varying conditions of anchor set lengths (6, 8, 16, 32, 60), across 10 population distributions, with a mixture of simple versus complex structured items, using a sample size of 3,000. It was found that for a well chosen anchor set, the parameters recovered well after equating across all populations, even for anchor sets composed of as few as six items. 相似文献

3.

Test Score Equating Using a Mini‐Version Anchor and a Midi Anchor: A Case Study Using SAT® Data

Jinghua Liu Sandip Sinharay Paul W. Holland Edward Curley Miriam Feigenbaum 《Journal of Educational Measurement》2011,48(4):361-379

This study explores an anchor that is different from the traditional miniature anchor in test score equating. In contrast to a traditional “mini” anchor that has the same spread of item difficulties as the tests to be equated, the studied anchor, referred to as a “midi” anchor (Sinharay & Holland), has a smaller spread of item difficulties than the tests to be equated. Both anchors were administered in an operational SAT administration and the impact of anchor type on equating was evaluated with respect to systematic error or equating bias. Contradicting the popular belief that the mini anchor is best, the results showed that the mini anchor does not always produce more accurate equating functions than the midi anchor; the midi anchor was found to perform as well as or even better than the mini anchor. Because testing programs usually have more middle difficulty items and few very hard or very easy items, midi external anchors are operationally easier to build. Therefore, the results of our study provide evidence in favor of the midi anchor, the use of which will lead to cost saving with no reduction in equating quality. 相似文献

4.

Equating Minimum-Competency Tests: Comparisons of Methods

John R. Hills Raja G. Subhiyah Thomas M. Hirsch 《Journal of Educational Measurement》1988,25(3):221-231

The 1986 scores from Florida's Statewide Student Assessment Test, Part II (SSAT-II), a minimum-competency test required for high school graduation in Florida, were placed on the scale of the 1984 scores from that test using five different equating procedures. For the highest scoring 84 % of the students, four of the five methods yielded results within 1.5 raw-score points of each other. They would be essentially equally satisfactory in this situation, in which the tests were made parallel item by item in difficulty and content and the groups of examinees were population cohorts separated by only 2 years. Also, the results from six different lengths of anchor items were compared. Anchors of 25, 20, 15, or 10 randomly selected items provided equatings as effective as 30 items using the concurrent IRT equating method, but an anchor of 5 randomly selected items did not 相似文献

5.

锚测验难度参数方差特征对测验等值的影响研究

曹文娟白俊梅《考试研究》2013,(3):79-85,33

本文使用R-2.15.2软件模拟研究锚测验难度参数方差特征对测验等值误差的影响,采用三种等值方法(链百分位等值法、Levine等值法和Tucker等值法)对锚测验不同类型的难度方差进行比较研究。结果显示,当锚测验难度方差小于全测验难度方差时,其等值的随机误差和系统误差与锚测验难度方差和全测验难度方差一致时(即锚测验为全测验的平行缩减版minitest时)的表现基本相同。因此,对锚测验而言,要求其与全测验具有相同的统计规格可能过于严格。相似文献

6.

Local Observed‐Score Kernel Equating

Marie Wiberg Wim J. van der Linden Alina A. von Davier 《Journal of Educational Measurement》2014,51(1):57-74

Three local observed‐score kernel equating methods that integrate methods from the local equating and kernel equating frameworks are proposed. The new methods were compared with their earlier counterparts with respect to such measures as bias—as defined by Lord's criterion of equity—and percent relative error. The local kernel item response theory observed‐score equating method, which can be used for any of the common equating designs, had a small amount of bias, a low percent relative error, and a relatively low kernel standard error of equating, even when the accuracy of the test was reduced. The local kernel equating methods for the nonequivalent groups with anchor test generally had low bias and were quite stable against changes in the accuracy or length of the anchor test. Although all proposed methods showed small percent relative errors, the local kernel equating methods for the nonequivalent groups with anchor test design had somewhat larger standard error of equating than their kernel method counterparts. 相似文献

7.

Equating Subscores under the Nonequivalent Anchor Test (NEAT) Design

Gautam Puhan Longjuan Liang 《Educational Measurement》2011,30(1):23-35

The study examined two approaches for equating subscores. They are (1) equating subscores using internal common items as the anchor to conduct the equating, and (2) equating subscores using equated and scaled total scores as the anchor to conduct the equating. Since equated total scores are comparable across the new and old forms, they can be used as an anchor to equate the subscores. Both chained linear and chained equipercentile methods were used. Data from two tests were used to conduct the study and results showed that when more internal common items were available (i.e., 10–12 items), then using common items to equate the subscores is preferable. However, when the number of common items is very small (i.e., five to six items), then using total scaled scores to equate the subscores is preferable. For both tests, not equating (i.e., using raw subscores) is not reasonable as it resulted in a considerable amount of bias. 相似文献

8.

Asymptotic Standard Errors of Observed‐Score Equating With Polytomous IRT Models

Bjrn Andersson 《Journal of Educational Measurement》2016,53(4):459-477

In observed‐score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response theory (IRT) model. The parameters from such a model can be utilized to derive the score probabilities for the tests and these score probabilities may then be used in observed‐score equating. In this study, the asymptotic standard errors of observed‐score equating using score probability vectors from polytomous IRT models are derived using the delta method. The results are applied to the equivalent groups design and the nonequivalent groups design with either chain equating or poststratification equating within the framework of kernel equating. The derivations are presented in a general form and specific formulas for the graded response model and the generalized partial credit model are provided. The asymptotic standard errors are accurate under several simulation conditions relating to sample size, distributional misspecification and, for the nonequivalent groups design, anchor test length. 相似文献

9.

Is It Necessary to Make Anchor Tests Mini‐Versions of the Tests Being Equated or Can Some Restrictions Be Relaxed?

Sandip Sinharay Paul W. Holland 《Journal of Educational Measurement》2007,44(3):249-275

It is a widely held belief that anchor tests should be miniature versions (i.e., minitests), with respect to content and statistical characteristics, of the tests being equated. This article examines the foundations for this belief regarding statistical characteristics. It examines the requirement of statistical representativeness of anchor tests that are content representative. The equating performance of several types of anchor tests, including those having statistical characteristics that differ from those of the tests being equated, is examined through several simulation studies and a real data example. Anchor tests with a spread of item difficulties less than that of a total test seem to perform as well as a minitest with respect to equating bias and equating standard error. Hence, the results demonstrate that requiring an anchor test to mimic the statistical characteristics of the total test may be too restrictive and need not be optimal. As a side benefit, this article also provides a comparison of the equating performance of post-stratification equating and chain equipercentile equating. 相似文献

10.

NCME 2008 Presidential Address: The Impact of Anchor Test Configuration on Student Proficiency Rates

Anne R. Fitzpatrick 《Educational Measurement》2008,27(4):34-40

Examined in this study were the effects of reducing anchor test length on student proficiency rates for 12 multiple‐choice tests administered in an annual, large‐scale, high‐stakes assessment. The anchor tests contained 15 items, 10 items, or five items. Five content representative samples of items were drawn at each anchor test length from a small universe of items in order to investigate the stability of equating results over anchor test samples. The operational tests were calibrated using the one‐parameter model and equated using the mean b‐value method. The findings indicated that student proficiency rates could display important variability over anchor test samples when 15 anchor items were used. Notable increases in this variability were found for some tests when shorter anchor tests were used. For these tests, some of the anchor items had parameters that changed somewhat in relative difficulty from one year to the next. It is recommended that anchor sets with more than 15 items be used to mitigate the instability in equating results due to anchor item sampling. Also, the optimal allocation method of stratified sampling should be evaluated as one means of improving the stability and precision of equating results. 相似文献

11.

Use of Adjustment by Minimum Discriminant Information in Linking Constructed‐Response Test Scores in the Absence of Common Items

Yi‐Hsuan Lee Shelby J. Haberman Neil J. Dorans 《Journal of Educational Measurement》2019,56(2):452-472

In many educational tests, both multiple‐choice (MC) and constructed‐response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form‐specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long‐standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation. 相似文献

12.

Local Linear Observed‐Score Equating

Marie Wiberg Wim J. van der Linden 《Journal of Educational Measurement》2011,48(3):229-254

Two methods of local linear observed‐score equating for use with anchor‐test and single‐group designs are introduced. In an empirical study, the two methods were compared with the current traditional linear methods for observed‐score equating. As a criterion, the bias in the equated scores relative to true equating based on Lord's (1980) definition of equity was used. The local method for the anchor‐test design yielded minimum bias, even for considerable variation of the relative difficulties of the two test forms and the length of the anchor test. Among the traditional methods, the method of chain equating performed best. The local method for single‐group designs yielded equated scores with bias comparable to the traditional methods. This method, however, appears to be of theoretical interest because it forces us to rethink the relationship between score equating and regression. 相似文献

13.

Experiences in the Application of Item Response Theory in Test Construction

《教育实用测度》2013,26(4):297-312

Certain potential benefits of using item response theory in test construction are discussed and evaluated using the experience and evidence accumulated during 9 years of using a three-parameter model in the construction of major achievement batteries. We also discuss several cautions and limitations in realizing these benefits as well as issues in need of further research. The potential benefits considered are those of getting "sample-free" item calibrations and "item-free" person measurement, automatically equating various tests, decreasing the standard errors of scores without increasing the number of items used by using item pattern scoring, assessing item bias (or differential item functioning) independently of difficulty in a manner consistent with item selection, being able to determine just how adequate a tryout pool of items may be, setting up computer-generated "ideal" tests drawn from pools as targets for test developers, and controlling the standard error of a selected test at any desired set of score levels. 相似文献

14.

Determining the Anchor Composition for a Mixed-Format Test: Evaluation of Subpopulation Invariance of Linking Functions

Sooyeon Kim Michael Walker 《教育实用测度》2013,26(2):178-195

This study examined the appropriateness of the anchor composition in a mixed-format test, which includes both multiple-choice (MC) and constructed-response (CR) items, using subpopulation invariance indices. Linking functions were derived in the nonequivalent groups with anchor test (NEAT) design using two types of anchor sets: (a) MC only and (b) a mix of MC and CR. In each anchor condition, the linking functions were also derived separately for males and females, and those subpopulation functions were compared to the total group function. In the MC-only condition, the difference between the subpopulation functions and the total group function was not trivial in a score region that included cut scores, leading to inconsistent pass/fail decisions for low-performing examinees in particular. Overall, the mixed anchor was a better choice than the MC-only anchor to achieve subpopulation invariance between males and females. The research reinforces subpopulation invariance indices as a means of determining the adequacy of the anchor. 相似文献

15.

Weighting Constructed-Response Items in IRT-Based Exams

《教育实用测度》2013,26(4):257-275

Weighting responses to Constructed-Response (CR) items has been proposed as a way to increase the contribution these items make to the test score when there is insufficient testing time to administer additional CR items. The effect of various types of weighting items of an IRT-based mixed-format writing examination was investigated. Constructed-response items were weighted by increasing their representation according to the test blueprint, by increasing their contribution to the test characteristic curve, by summing the ratings of multiple raters, and by applying optimal weights utilized in IRT pattern scoring. Total score and standard errors of the weighted composite forms of CR and Multiple-Choice (MC) items were compared against each other and against a form containing additional rather than weighted items. Weighting resulted in a slight reduction of test reliability but reduced standard error in portions of the ability scale. 相似文献

16.

Observed Score Equating Using Discrete and Passage-Based Anchor Items

Jiyun Zu Jinghua Liu 《Journal of Educational Measurement》2010,47(4):395-412

Equating of tests composed of both discrete and passage-based multiple choice items using the nonequivalent groups with anchor test design is popular in practice. In this study, we compared the effect of discrete and passage-based anchor items on observed score equating via simulation. Results suggested that an anchor with a larger proportion of passage-based items, more items in each passage, and/or a larger degree of local dependence among items within one passage produces larger equating errors, especially when the groups taking the new form and the reference form differ in ability. Our findings challenge the common belief that an anchor should be a miniature version of the tests to be equated. Suggestions to practitioners regarding anchor design are also given. 相似文献

17.

Construct Equivalence of Multiple-Choice and Constructed-Response Items: A Random Effects Synthesis of Correlations

Michael C. Rodriguez 《Journal of Educational Measurement》2003,40(2):163-184

A thorough search of the literature was conducted to locate empirical studies investigating the trait or construct equivalence of multiple-choice (MC) and constructed-response (CR) items. Of the 67 studies identified, 29 studies included 56 correlations between items in both formats. These 56 correlations were corrected for attenuation and synthesized to establish evidence for a common estimate of correlation (true-score correlations). The 56 disattenuated correlations were highly heterogeneous. A search for moderators to explain this variation uncovered the role of the design characteristics of test items used in the studies. When items are constructed in both formats using the same stem (stem equivalent), the mean correlation between the two formats approaches unity and is significantly higher than when using non-stem-equivalent items (particularly when using essay-type items). Construct equivalence, in part, appears to be a function of the item design method or the item writer's intent. 相似文献

18.

Small-Sample Equating Using a Single-Group Nearly Equivalent Test (SiGNET) Design

Gautam Puhan Timothy P. Moses Mary C. Grant Frederick McHale 《Journal of Educational Measurement》2009,46(3):344-362

A single-group (SG) equating with nearly equivalent test forms (SiGNET) design was developed by Grant to equate small-volume tests. Under this design, the scored items for the operational form are divided into testlets or mini tests. An additional testlet is created but not scored for the first form. If the scored testlets are testlets 1–6 and the unscored testlet is testlet 7, then the first form is composed of testlets 1–6 and the second form is composed of testlets 2–7. The seven testlets are administered as a single administered form, and when a sufficient number of examinees have taken the administered form, the second form (testlets 2–7) is equated to the first form (testlets 1–6) using an SG equating design. As evident, this design facilitates the use of an SG equating and allows for the accumulation of data, both of which may reduce equating error. This study compared equatings under the SiGNET and common-item equating designs and found lower equating error for the SiGNET design in very small sample size conditions (e.g., N = 10). 相似文献

19.

Teachers' and Administrators' Use of Evidence of Student Learning to Take Action: Conclusions Drawn from a Special Issue on Formative Assessment

M. Christina Schneider Heidi Andrade 《教育实用测度》2013,26(3):159-162

In common-item equating the anchor block is generally built to represent a miniature form of the total test in terms of content and statistical specifications. The statistical properties frequently reflect equal mean and spread of item difficulty. Sinharay and Holland (2007) suggested that the requirement for equal spread of difficulty may be too restrictive. They suggested that an anchor test with representative content coverage and equal mean item difficulty but a smaller spread of item difficulty (miditest) may provide the same or better results for equating while decreasing the pressure to find very hard and very easy items to include in the anchor. Analyses to date have concentrated on the results of equating the scores from one form to another with findings that are supportive of the Sinharay and Holland concept (Sinharay &; Holland, 2006a, 2006b, 2007; Liu, Sinharay, Holland, Feigenbaum, &; Curley, 2009). These studies do not address longer chains of equating. It is important to monitor the possibility of scale drift over forms. The current research begins to address this issue. 相似文献

20.

Review of Cutscores: A Manual for Setting Standards of Performance on Educational and Occupational Tests

Jenna Copella 《教育实用测度》2013,26(1):73-76

Combinations of five methods of equating test forms and two methods of selecting samples of students for equating were compared for accuracy. The two sampling methods were representative sampling from the population and matching samples on the anchor test score. The equating methods were the Tucker, Levine equally reliable, chained equipercentile, frequency estimation, and item response theory (IRT) 3PL methods. The tests were the Verbal and Mathematical sections of the Scholastic Aptitude Test. The criteria for accuracy were measures of agreement with an equivalent-groups equating based on more than 115,000 students taking each form. Much of the inaccuracy in the equatings could be attributed to overall bias. The results for all equating methods in the matched samples were similar to those for the Tucker and frequency estimation methods in the representative samples; these equatings made too small an adjustment for the difference in the difficulty of the test forms. In the representative samples, the chained equipercentile method showed a much smaller bias. The IRT (3PL) and Levine methods tended to agree with each other and were inconsistent in the direction of their bias. 相似文献