期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Random-Groups Equating with Samples of 50 to 400 Test Takers

Samuel A. Livingston Sooyeon Kim 《Journal of Educational Measurement》2010,47(2):175-185

Five methods for equating in a random groups design were investigated in a series of resampling studies with samples of 400, 200, 100, and 50 test takers. Six operational test forms, each taken by 9,000 or more test takers, were used as item pools to construct pairs of forms to be equated. The criterion equating was the direct equipercentile equating in the group of all test takers. Equating accuracy was indicated by the root-mean-squared deviation, over 1,000 replications, of the sample equatings from the criterion equating. The methods investigated were equipercentile equating of smoothed distributions, linear equating, mean equating, symmetric circle-arc equating, and simplified circle-arc equating. The circle-arc methods produced the most accurate results for all sample sizes investigated, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible. 相似文献

2.

The Effect of Repeaters on Equating

HeeKyoung Kim Michael J. Kolen 《教育实用测度》2013,26(3):242-265

Test equating might be affected by including in the equating analyses examinees who have taken the test previously. This study evaluated the effect of including such repeaters on Medical College Admission Test (MCAT) equating using a population invariance approach. Three-parameter logistic (3-PL) item response theory (IRT) true score and traditional equipercentile equating methods were used under the random groups equating design. This study also examined whether or not population sensitivity of equating by repeater status varies depending on other background variables (gender and ethnicity). The results indicated that there was some evidence of repeaters' effect on equating with varying amounts of such effect by gender. 相似文献

3.

Evaluating Equating Accuracy and Assumptions for Groups That Differ in Performance

Sonya Powers Michael J. Kolen 《Journal of Educational Measurement》2014,51(1):39-56

Accurate equating results are essential when comparing examinee scores across exam forms. Previous research indicates that equating results may not be accurate when group differences are large. This study compared the equating results of frequency estimation, chained equipercentile, item response theory (IRT) true‐score, and IRT observed‐score equating methods. Using mixed‐format test data, equating results were evaluated for group differences ranging from 0 to .75 standard deviations. As group differences increased, equating results became increasingly biased and dissimilar across equating methods. Results suggest that the size of group differences, the likelihood that equating assumptions are violated, and the equating error associated with an equating method should be taken into consideration when choosing an equating method. 相似文献

4.

Equating Methods and Sampling Designs

《教育实用测度》2013,26(1):3-17

相似文献

5.

Accuracy of Random Groups Equating with Very Small Samples

Gary Skaggs 《Journal of Educational Measurement》2005,42(4):309-330

This study investigated the effectiveness of equating with very small samples using the random groups design. Of particular interest was equating accuracy at specific scores where performance standards might be set. Two sets of simulations were carried out, one in which the two forms were identical and one in which they differed by a tenth of a standard deviation in overall difficulty. These forms were equated using mean equating, linear equating, unsmoothed equipercentile equating, and equipercentile equating using two through six moments of log-linear presmoothing with samples of 25, 50, 75, 100, 150, and 200. The results indicated that identity equating was preferable to any equating method when samples were as small as 25. For samples of 50 and above, the choice of an equating method over identity equating depended on the location of the passing score relative to examinee performance. If passing scores were located below the mean, where data were sparser, mean equating produced the smallest percentage of misclassified examinees. For passing scores near the mean, all methods produced similar results with linear equating being the most accurate. For passing scores above the mean, equipercentile equating with 2- and 3-moment presmoothing were the best equating methods. Higher levels of presmoothing did not improve the results. 相似文献

6.

A Comparison Between Linear IRT Observed‐Score Equating and Levine Observed‐Score Equating Under the Generalized Kernel Equating Framework

Haiwen Chen 《Journal of Educational Measurement》2012,49(3):269-284

In this article, linear item response theory (IRT) observed‐score equating is compared under a generalized kernel equating framework with Levine observed‐score equating for nonequivalent groups with anchor test design. Interestingly, these two equating methods are closely related despite being based on different methodologies. Specifically, when using data from IRT models, linear IRT observed‐score equating is virtually identical to Levine observed‐score equating. This leads to the conclusion that poststratification equating based on true anchor scores can be viewed as the curvilinear Levine observed‐score equating. 相似文献

7.

Comparisons among Small Sample Equating Methods in a Common-Item Design

Sooyeon Kim Samuel A. Livingston 《Journal of Educational Measurement》2010,47(3):286-298

Score equating based on small samples of examinees is often inaccurate for the examinee populations. We conducted a series of resampling studies to investigate the accuracy of five methods of equating in a common-item design. The methods were chained equipercentile equating of smoothed distributions, chained linear equating, chained mean equating, the symmetric circle-arc method, and the simplified circle-arc method. Four operational test forms, each containing at least 110 items, were used for the equating, with new-form samples of 100, 50, 25, and 10 examinees and reference-form samples three times as large. Accuracy was described in terms of the root-mean-squared difference (over 1,000 replications) of the sample equatings from the criterion equating. Overall, chained mean equating produced the most accurate results for low scores, but the two circle-arc methods produced the most accurate results, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible. 相似文献

8.

Situations Where It Is Appropriate to Use Frequency Estimation Equipercentile Equating

Hongwen Guo Hyeonjoo J. Oh Daniel Eignor 《Journal of Educational Measurement》2013,50(3):338-354

In operational equating situations, frequency estimation equipercentile equating is considered only when the old and new groups have similar abilities. The frequency estimation assumptions are investigated in this study under various situations from both the levels of theoretical interest and practical use. It shows that frequency estimation equating can be used under circumstances when it is not normally used. To link theoretical results with practice, statistical methods are proposed for checking frequency estimation assumptions based on available data: observed‐score distributions and item difficulty distributions of the forms. In addition to the conventional use of frequency estimation equating when the group abilities are similar, three situations are identified when the group abilities are dissimilar: (a) when the two forms and the observed conditional score distributions are similar the two forms and the observed conditional score distributions are similar (in this situation, the frequency estimation equating assumptions are likely to hold, and frequency estimation equating is appropriate); (b) when forms are similar but the observed conditional score distributions are not (in this situation, frequency estimation equating is not appropriate); and (c) when forms are not similar but the observed conditional score distributions are (frequency estimation equating is not appropriate). Statistical analysis procedures for comparing distributions are provided. Data from a large‐scale test are used to illustrate the use of frequency estimation equating when the group difference in ability is large. 相似文献

9.

Local Linear Observed‐Score Equating

Marie Wiberg Wim J. van der Linden 《Journal of Educational Measurement》2011,48(3):229-254

Two methods of local linear observed‐score equating for use with anchor‐test and single‐group designs are introduced. In an empirical study, the two methods were compared with the current traditional linear methods for observed‐score equating. As a criterion, the bias in the equated scores relative to true equating based on Lord's (1980) definition of equity was used. The local method for the anchor‐test design yielded minimum bias, even for considerable variation of the relative difficulties of the two test forms and the length of the anchor test. Among the traditional methods, the method of chain equating performed best. The local method for single‐group designs yielded equated scores with bias comparable to the traditional methods. This method, however, appears to be of theoretical interest because it forces us to rethink the relationship between score equating and regression. 相似文献

10.

Evaluating the Effects of Multidimensionality on IRT True-Score Equating

《教育实用测度》2013,26(4):383-407

The performance of the item response theory (IRT) true-score equating method is examined under conditions of test multidimensionality. It is argued that a primary concern in applying unidimensional equating methods when multidimensionality is present is the potential decrease in equity (Lord, 1980) attributable to the fact that examinees of different ability are expected to obtain the same test scores. In contrast to equating studies based on real test data, the use of simulation in equating research not only permits assessment of these effects but also enables investigation of hypothetical equating conditions in which multidimensionality can be suspected to be especially problematic for test equating. In this article, I investigate whether the IRT true-score equating method, which explicitly assumes the item response matrix is unidimensional, is more adversely affected by the presence of multidimensionality than 2 conventional equating methods-linear and equipercentile equating-using several recently proposed equity-based criteria (Thomasson, 1993). Results from 2 simulation studies suggest that the IRT method performs at least as well as the conventional methods when the correlation between dimensions is high (³ 0.7) and may be only slightly inferior to the equipercentile method when the correlation is moderate to low (£ 0.5). 相似文献

11.

Review of Cutscores: A Manual for Setting Standards of Performance on Educational and Occupational Tests

Jenna Copella 《教育实用测度》2013,26(1):73-76

Combinations of five methods of equating test forms and two methods of selecting samples of students for equating were compared for accuracy. The two sampling methods were representative sampling from the population and matching samples on the anchor test score. The equating methods were the Tucker, Levine equally reliable, chained equipercentile, frequency estimation, and item response theory (IRT) 3PL methods. The tests were the Verbal and Mathematical sections of the Scholastic Aptitude Test. The criteria for accuracy were measures of agreement with an equivalent-groups equating based on more than 115,000 students taking each form. Much of the inaccuracy in the equatings could be attributed to overall bias. The results for all equating methods in the matched samples were similar to those for the Tucker and frequency estimation methods in the representative samples; these equatings made too small an adjustment for the difference in the difficulty of the test forms. In the representative samples, the chained equipercentile method showed a much smaller bias. The IRT (3PL) and Levine methods tended to agree with each other and were inconsistent in the direction of their bias. 相似文献

12.

The Accuracy and Consistency of a Series of IRT True Score Equatings

Deping Li Yanlin Jiang Alina A. von Davier 《Journal of Educational Measurement》2012,49(2):167-189

This study investigates a sequence of item response theory (IRT) true score equatings based on various scale transformation approaches and evaluates equating accuracy and consistency over time. The results show that the biases and sample variances for the IRT true score equating (both direct and indirect) are quite small (except for the mean/sigma method). The biases and sample variances for the equating functions based on the characteristic curve methods and concurrent calibrations for adjacent forms are smaller than the biases and variances for the equating functions based on the moment methods. In addition, the IRT true score equating is also compared to the chained equipercentile equating, and we observe that the sample variances for the chained equipercentile equating are much smaller than the variances for the IRT true score equating with an exception at the low scores. 相似文献

13.

Adjoined Piecewise Linear Approximations (APLAs) for Equating: Accuracy Evaluations of a Postsmoothing Equating Method

Tim Moses 《Journal of Educational Measurement》2013,50(4):427-446

The purpose of this study was to evaluate the use of adjoined and piecewise linear approximations (APLAs) of raw equipercentile equating functions as a postsmoothing equating method. APLAs are less familiar than other postsmoothing equating methods (i.e., cubic splines), but their use has been described in historical equating practices of large‐scale testing programs. This study used simulations to evaluate APLA equating results and compare these results with those from cubic spline postsmoothing and from several presmoothing equating methods. The overall results suggested that APLAs based on four line segments have accuracy advantages similar to or better than cubic splines and can sometimes produce more accurate smoothed equating functions than those produced using presmoothing methods. 相似文献

14.

A New Approach to Comparing Several Equating Methods in the Context of the NEAT Design

Sandip Sinharay Paul W. Holland 《Journal of Educational Measurement》2010,47(3):261-285

The nonequivalent groups with anchor test (NEAT) design involves missing data that are missing by design. Three equating methods that can be used with a NEAT design are the frequency estimation equipercentile equating method, the chain equipercentile equating method, and the item-response-theory observed-score-equating method. We suggest an approach to perform a fair comparison of the three methods. The approach is then applied to compare the three equating methods using three data sets from operational tests. For each data set, we examine how the three equating methods perform when the missing data satisfy the assumptions made by only one of these equating methods. The chain equipercentile equating method is somewhat more satisfactory overall than the other methods. 相似文献

15.

Statistical Models and Inference for the True Equating Transformation in the Context of Local Equating

Jorge González B. Matthias von Davier 《Journal of Educational Measurement》2013,50(3):315-320

Based on Lord's criterion of equity of equating, van der Linden (this issue) revisits the so‐called local equating method and offers alternative as well as new thoughts on several topics including the types of transformations, symmetry, reliability, and population invariance appropriate for equating. A remarkable aspect is to define equating as a standard statistical inference problem in which the true equating transformation is the parameter of interest that has to be estimated and assessed as any standard evaluation of an estimator of an unknown parameter in statistics. We believe that putting equating methods in a general statistical model framework would be an interesting and useful next step in the area. van der Linden's conceptual article on equating is certainly an important contribution to this task. 相似文献

16.

Evaluating Equating Results: Percent Relative Error for Chained Kernel Equating

Yanlin Jiang Alina A. von Davier Haiwen Chen 《Journal of Educational Measurement》2012,49(1):39-58

This article presents a method for evaluating equating results. Within the kernel equating framework, the percent relative error (PRE) for chained equipercentile equating was computed under the nonequivalent groups with anchor test (NEAT) design. The method was applied to two data sets to obtain the PRE, which can be used to measure equating effectiveness. The study compared the PRE results for chained and poststratification equating. The results indicated that the chained method transformed the new form score distribution to the reference form scale more effectively than the poststratification method. In addition, the study found that in chained equating, the population weight had impact on score distributions over the target population but not on the equating and PRE results. 相似文献

17.

On Attempting to Do What Lord Said Was Impossible: Commentary on van der Linden's “Some Conceptual Issues in Observed‐Score Equating”

Neil J. Dorans 《Journal of Educational Measurement》2013,50(3):304-314

van der Linden (this issue) uses words differently than Holland and Dorans. This difference in language usage is a source of some confusion in van der Linden's critique of what he calls equipercentile equating. I address these differences in language. van der Linden maintains that there are only two requirements for score equating. I maintain that the requirements he discards have practical utility and are testable. The score equity requirement proposed by Lord suggests that observed score equating was either unnecessary or impossible. Strong equity serves as the fulcrum for van der Linden's thesis. His proposed solution to the equity problem takes inequitable measures and aligns conditional error score distributions, resulting in a family of linking functions, one for each level of θ. In reality, θ is never known. Use of an anchor test as a proxy poses many practical problems, including defensibility. 相似文献

18.

Local Observed‐Score Kernel Equating

Marie Wiberg Wim J. van der Linden Alina A. von Davier 《Journal of Educational Measurement》2014,51(1):57-74

Three local observed‐score kernel equating methods that integrate methods from the local equating and kernel equating frameworks are proposed. The new methods were compared with their earlier counterparts with respect to such measures as bias—as defined by Lord's criterion of equity—and percent relative error. The local kernel item response theory observed‐score equating method, which can be used for any of the common equating designs, had a small amount of bias, a low percent relative error, and a relatively low kernel standard error of equating, even when the accuracy of the test was reduced. The local kernel equating methods for the nonequivalent groups with anchor test generally had low bias and were quite stable against changes in the accuracy or length of the anchor test. Although all proposed methods showed small percent relative errors, the local kernel equating methods for the nonequivalent groups with anchor test design had somewhat larger standard error of equating than their kernel method counterparts. 相似文献

19.

Selection Strategies for Univariate Loglinear Smoothing Models and Their Effect on Equating Function Accuracy

Tim Moses Paul W. Holland 《Journal of Educational Measurement》2009,46(2):159-176

In this study, we compared 12 statistical strategies proposed for selecting loglinear models for smoothing univariate test score distributions and for enhancing the stability of equipercentile equating functions. The major focus was on evaluating the effects of the selection strategies on equating function accuracy. Selection strategies' influence on the estimation of cumulative test score distributions was also assessed. The results of this simulation study differentiate the selection strategies and define the situations where their use has the most important implications for equating function accuracy. The recommended strategy for estimating test score distributions and for equating is AIC minimization. 相似文献

20.

Effectiveness of Equating at the Passing Score for Exams With Small Sample Sizes

Amanda A. Wolkowitz Keith D. Wright 《Journal of Educational Measurement》2019,56(2):361-390

This article explores the amount of equating error at a passing score when equating scores from exams with small samples sizes. This article focuses on equating using classical test theory methods of Tucker linear, Levine linear, frequency estimation, and chained equipercentile equating. Both simulation and real data studies were used in the investigation. The results of the study supported past findings that as the sample sizes increase, the amount of bias in the equating at the passing score decreases. The research also highlights the importance for practitioners to understand the data, to have an informed expectation of the results, and to have a documented rationale for an acceptable amount of equating error. 相似文献