期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Comparisons among Small Sample Equating Methods in a Common-Item Design

Sooyeon Kim Samuel A. Livingston 《Journal of Educational Measurement》2010,47(3):286-298

Score equating based on small samples of examinees is often inaccurate for the examinee populations. We conducted a series of resampling studies to investigate the accuracy of five methods of equating in a common-item design. The methods were chained equipercentile equating of smoothed distributions, chained linear equating, chained mean equating, the symmetric circle-arc method, and the simplified circle-arc method. Four operational test forms, each containing at least 110 items, were used for the equating, with new-form samples of 100, 50, 25, and 10 examinees and reference-form samples three times as large. Accuracy was described in terms of the root-mean-squared difference (over 1,000 replications) of the sample equatings from the criterion equating. Overall, chained mean equating produced the most accurate results for low scores, but the two circle-arc methods produced the most accurate results, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible. 相似文献

2.

Collateral Information for Equating in Small Samples: A Preliminary Investigation

Sooyeon Kim Samuel A. Livingston Charles Lewis 《教育实用测度》2013,26(4):302-323

This article describes a preliminary investigation of an empirical Bayes (EB) procedure for using collateral information to improve equating of scores on test forms taken by small numbers of examinees. Resampling studies were done on two different forms of the same test. In each study, EB and non-EB versions of two equating methods—chained linear and chained mean—were applied to repeated small samples drawn from a large data set collected for a common-item equating. The criterion equating was the chained linear equating in the large data set. Equatings of other forms of the same test provided the collateral information. New-form sample size was varied from 10 to 200; reference-form sample size was constant at 200. One of the two new forms did not differ greatly in difficulty from its reference form, as was the case for the equatings used as collateral information. For this form, the EB procedure improved the accuracy of equating with new-form samples of 50 or fewer. The other new form was much more difficult than its reference form; for this form, the EB procedure made the equating less accurate. 相似文献

3.

Review of Cutscores: A Manual for Setting Standards of Performance on Educational and Occupational Tests

Jenna Copella 《教育实用测度》2013,26(1):73-76

Combinations of five methods of equating test forms and two methods of selecting samples of students for equating were compared for accuracy. The two sampling methods were representative sampling from the population and matching samples on the anchor test score. The equating methods were the Tucker, Levine equally reliable, chained equipercentile, frequency estimation, and item response theory (IRT) 3PL methods. The tests were the Verbal and Mathematical sections of the Scholastic Aptitude Test. The criteria for accuracy were measures of agreement with an equivalent-groups equating based on more than 115,000 students taking each form. Much of the inaccuracy in the equatings could be attributed to overall bias. The results for all equating methods in the matched samples were similar to those for the Tucker and frequency estimation methods in the representative samples; these equatings made too small an adjustment for the difference in the difficulty of the test forms. In the representative samples, the chained equipercentile method showed a much smaller bias. The IRT (3PL) and Levine methods tended to agree with each other and were inconsistent in the direction of their bias. 相似文献

4.

Random-Groups Equating with Samples of 50 to 400 Test Takers

Samuel A. Livingston Sooyeon Kim 《Journal of Educational Measurement》2010,47(2):175-185

Five methods for equating in a random groups design were investigated in a series of resampling studies with samples of 400, 200, 100, and 50 test takers. Six operational test forms, each taken by 9,000 or more test takers, were used as item pools to construct pairs of forms to be equated. The criterion equating was the direct equipercentile equating in the group of all test takers. Equating accuracy was indicated by the root-mean-squared deviation, over 1,000 replications, of the sample equatings from the criterion equating. The methods investigated were equipercentile equating of smoothed distributions, linear equating, mean equating, symmetric circle-arc equating, and simplified circle-arc equating. The circle-arc methods produced the most accurate results for all sample sizes investigated, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible. 相似文献

5.

AN ANALYTICAL PROCEDURE FOR THE EQUIPERCENTILE METHOD OF EQUATING TESTS

CARL A. LINDSAY MARK A. PRICHARD 《Journal of Educational Measurement》1971,8(3):203-207

Prior use of the equipercentile method of test equating was based on a graphic procedure which is tedious, subject to smoothing errors, and non-analytical. Recognition of the equipercentile method as a curve-fitting procedure for two cumulative percentage distributions leads to a proposed analytical solution to the problem through use of linear estimates for successive "missing" cumulative percentage points. A complete equipercentile procedure which uses the proposed method and provides linear and quadratic functions for goodness-of-fit and extrapolation is discussed and illustrated with data from a test equating project. 相似文献

6.

Effectiveness of Equating at the Passing Score for Exams With Small Sample Sizes

Amanda A. Wolkowitz Keith D. Wright 《Journal of Educational Measurement》2019,56(2):361-390

This article explores the amount of equating error at a passing score when equating scores from exams with small samples sizes. This article focuses on equating using classical test theory methods of Tucker linear, Levine linear, frequency estimation, and chained equipercentile equating. Both simulation and real data studies were used in the investigation. The results of the study supported past findings that as the sample sizes increase, the amount of bias in the equating at the passing score decreases. The research also highlights the importance for practitioners to understand the data, to have an informed expectation of the results, and to have a documented rationale for an acceptable amount of equating error. 相似文献

7.

A New Approach to Comparing Several Equating Methods in the Context of the NEAT Design

Sandip Sinharay Paul W. Holland 《Journal of Educational Measurement》2010,47(3):261-285

The nonequivalent groups with anchor test (NEAT) design involves missing data that are missing by design. Three equating methods that can be used with a NEAT design are the frequency estimation equipercentile equating method, the chain equipercentile equating method, and the item-response-theory observed-score-equating method. We suggest an approach to perform a fair comparison of the three methods. The approach is then applied to compare the three equating methods using three data sets from operational tests. For each data set, we examine how the three equating methods perform when the missing data satisfy the assumptions made by only one of these equating methods. The chain equipercentile equating method is somewhat more satisfactory overall than the other methods. 相似文献

8.

The Accuracy and Consistency of a Series of IRT True Score Equatings

Deping Li Yanlin Jiang Alina A. von Davier 《Journal of Educational Measurement》2012,49(2):167-189

This study investigates a sequence of item response theory (IRT) true score equatings based on various scale transformation approaches and evaluates equating accuracy and consistency over time. The results show that the biases and sample variances for the IRT true score equating (both direct and indirect) are quite small (except for the mean/sigma method). The biases and sample variances for the equating functions based on the characteristic curve methods and concurrent calibrations for adjacent forms are smaller than the biases and variances for the equating functions based on the moment methods. In addition, the IRT true score equating is also compared to the chained equipercentile equating, and we observe that the sample variances for the chained equipercentile equating are much smaller than the variances for the IRT true score equating with an exception at the low scores. 相似文献

9.

Adjoined Piecewise Linear Approximations (APLAs) for Equating: Accuracy Evaluations of a Postsmoothing Equating Method

Tim Moses 《Journal of Educational Measurement》2013,50(4):427-446

The purpose of this study was to evaluate the use of adjoined and piecewise linear approximations (APLAs) of raw equipercentile equating functions as a postsmoothing equating method. APLAs are less familiar than other postsmoothing equating methods (i.e., cubic splines), but their use has been described in historical equating practices of large‐scale testing programs. This study used simulations to evaluate APLA equating results and compare these results with those from cubic spline postsmoothing and from several presmoothing equating methods. The overall results suggested that APLAs based on four line segments have accuracy advantages similar to or better than cubic splines and can sometimes produce more accurate smoothed equating functions than those produced using presmoothing methods. 相似文献

10.

Smoothing Methods for Estimating Test Score Distributions

Michael J. Kolen 《Journal of Educational Measurement》1991,28(3):257-282

Frequency distributions of test scores may appear irregular and, as estimates of a population distribution, contain a substantial amount of sampling error. Techniques for smoothing score distributions are available that have the capacity to improve estimation. In this article, estimation/smoothing methods that are flexible enough to fit a wide variety of test score distributions are reviewed. The methods are a kernel method, a strong true–score model–based method, and a method that uses polynomial log–linear models. The use of these methods is then reviewed, and applications of the methods are presented that include describing and comparing test score distributions, estimating norms, and estimating equipercentile equivalents in test score equating. Suggestions for further research are also provided. 相似文献

11.

Accuracy of Random Groups Equating with Very Small Samples

Gary Skaggs 《Journal of Educational Measurement》2005,42(4):309-330

This study investigated the effectiveness of equating with very small samples using the random groups design. Of particular interest was equating accuracy at specific scores where performance standards might be set. Two sets of simulations were carried out, one in which the two forms were identical and one in which they differed by a tenth of a standard deviation in overall difficulty. These forms were equated using mean equating, linear equating, unsmoothed equipercentile equating, and equipercentile equating using two through six moments of log-linear presmoothing with samples of 25, 50, 75, 100, 150, and 200. The results indicated that identity equating was preferable to any equating method when samples were as small as 25. For samples of 50 and above, the choice of an equating method over identity equating depended on the location of the passing score relative to examinee performance. If passing scores were located below the mean, where data were sparser, mean equating produced the smallest percentage of misclassified examinees. For passing scores near the mean, all methods produced similar results with linear equating being the most accurate. For passing scores above the mean, equipercentile equating with 2- and 3-moment presmoothing were the best equating methods. Higher levels of presmoothing did not improve the results. 相似文献

12.

Achieving Form-to-Form Comparability: Fundamental issues and Proposed Strategies for Equating Performance Assessments of Teachers

《Educational Assessment》2013,18(1):99-110

The purpose of this article is to describe some of the measurement issues encountered in the equating of performance assessments designed for use in making teacher certification decisions. As some teacher certification programs move from sole reliance on multiple-choice items to inclusion of complex performance tasks, difficult measurement issues related to equating may arise. A variety of analytic and judgmental strategies are described in this article that may provide solutions for addressing these equating issues. Analytic strategies are based on examinee data and involve the modification of existing equating procedures, such as linear and equipercentile methods, that have been used successfully in the past with test forms composed of multiple-choice items. Judgmental strategies for equating involve the use of expert judgments to determine the equivalence of scores obtained from alternate forms of an assessment instrument. 相似文献

13.

A Comparison of Angoff's Design I and Design II for Vertical Equating Using Traditional and IRT Methodology

Deborah J. Harris 《Journal of Educational Measurement》1991,28(3):221-235

Practical considerations in conducting an equating study often require a trade-off between testing time and sample size. A counterbalanced design (Angoff's Design II) is often selected because, as each examinee is administered both test forms and therefore the errors are correlated, sample sizes can be dramatically reduced over those required by a spiraling design (Angoff's Design I), where each examinee is administered only one test form. However, the counterbalanced design may be subject to fatigue, practice, or context effects. This article investigated these two data collection designs (for a given sample size) with equipercentile and IRT equating methodology in the vertical equating of two mathematics achievement tests. Both designs and both methodologies were judged to adequately meet an equivalent expected score criterion; Design II was found to exhibit more stability over different samples. 相似文献

14.

Evaluating the Effects of Multidimensionality on IRT True-Score Equating

《教育实用测度》2013,26(4):383-407

The performance of the item response theory (IRT) true-score equating method is examined under conditions of test multidimensionality. It is argued that a primary concern in applying unidimensional equating methods when multidimensionality is present is the potential decrease in equity (Lord, 1980) attributable to the fact that examinees of different ability are expected to obtain the same test scores. In contrast to equating studies based on real test data, the use of simulation in equating research not only permits assessment of these effects but also enables investigation of hypothetical equating conditions in which multidimensionality can be suspected to be especially problematic for test equating. In this article, I investigate whether the IRT true-score equating method, which explicitly assumes the item response matrix is unidimensional, is more adversely affected by the presence of multidimensionality than 2 conventional equating methods-linear and equipercentile equating-using several recently proposed equity-based criteria (Thomasson, 1993). Results from 2 simulation studies suggest that the IRT method performs at least as well as the conventional methods when the correlation between dimensions is high (³ 0.7) and may be only slightly inferior to the equipercentile method when the correlation is moderate to low (£ 0.5). 相似文献

15.

A Comparison of IRT Equating and Beta 4 Equating

Dong-In Kim Robert Brennan Michael Kolen 《Journal of Educational Measurement》2005,42(1):77-99

Four equating methods (3PL true score equating, 3PL observed score equating, beta 4 true score equating, and beta 4 observed score equating) were compared using four equating criteria: first-order equity (FOE), second-order equity (SOE), conditional-mean-squared-error (CMSE) difference, and the equipercentile equating property. True score equating more closely achieved estimated FOE than observed score equating when the true score distribution was estimated using the psychometric model that was used in the equating. Observed score equating more closely achieved estimated SOE, estimated CMSE difference, and the equipercentile equating property than true score equating. Among the four equating methods, 3PL observed score equating most closely achieved estimated SOE and had the smallest estimated CMSE difference, and beta 4 observed score equating was the method that most closely met the equipercentile equating property. 相似文献

16.

Evaluation of Two New Smoothing Methods in Equating: The Cubic B-Spline Presmoothing Method and the Direct Presmoothing Method

Zhongmin Cui Michael J. Kolen 《Journal of Educational Measurement》2009,46(2):135-158

This article considers two new smoothing methods in equipercentile equating , the cubic B-spline presmoothing method and the direct presmoothing method. Using a simulation study , these two methods are compared with established methods , the beta-4 method , the polynomial loglinear method , and the cubic spline postsmoothing method , under three sample sizes (300 , 1,000 , and 3,000) and for three test content areas (ITBS Maps and Diagrams , ITBS Reference and Materials , and ITBS Capitalization). Ten thousand random samples were simulated from population distributions , and the standard error , bias , and RMSE statistics were calculated. The cubic B-spline presmoothing method performed well in reducing total error of equating , whereas the direct presmoothing method appeared to need some modification for it to be as accurate as other smoothing methods. 相似文献

17.

Comparing apples with oranges? An approach to link TIMSS and the National Educational Panel Study in Germany via equipercentile and IRT methods

《Studies in Educational Evaluation》2015

相似文献

18.

Evaluating Equating Results: Percent Relative Error for Chained Kernel Equating

Yanlin Jiang Alina A. von Davier Haiwen Chen 《Journal of Educational Measurement》2012,49(1):39-58

This article presents a method for evaluating equating results. Within the kernel equating framework, the percent relative error (PRE) for chained equipercentile equating was computed under the nonequivalent groups with anchor test (NEAT) design. The method was applied to two data sets to obtain the PRE, which can be used to measure equating effectiveness. The study compared the PRE results for chained and poststratification equating. The results indicated that the chained method transformed the new form score distribution to the reference form scale more effectively than the poststratification method. In addition, the study found that in chained equating, the population weight had impact on score distributions over the target population but not on the equating and PRE results. 相似文献

19.

Local Linear Observed‐Score Equating

Marie Wiberg Wim J. van der Linden 《Journal of Educational Measurement》2011,48(3):229-254

Two methods of local linear observed‐score equating for use with anchor‐test and single‐group designs are introduced. In an empirical study, the two methods were compared with the current traditional linear methods for observed‐score equating. As a criterion, the bias in the equated scores relative to true equating based on Lord's (1980) definition of equity was used. The local method for the anchor‐test design yielded minimum bias, even for considerable variation of the relative difficulties of the two test forms and the length of the anchor test. Among the traditional methods, the method of chain equating performed best. The local method for single‐group designs yielded equated scores with bias comparable to the traditional methods. This method, however, appears to be of theoretical interest because it forces us to rethink the relationship between score equating and regression. 相似文献

20.

Small-Sample Equating With Log-Linear Smoothing

Samuel A. Livingston 《Journal of Educational Measurement》1993,30(1):23-39

This study investigated the extent to which log-linear smoothing could improve the accuracy of common-item equating by the chained equipercentile method in small samples of examinees. Examinee response data from a 100-item test were used to create two overlapping forms of 58 items each, with 24 items in common. The criterion equating was a direct equipercentile equating of the two forms in the full population of 93,283 examinees. Anchor equatings were performed in samples of 25, 50, 100, and 200 examinees, with 50 pairs of samples at each size level. Four equatings were performed with each pair of samples: one based on unsmoothed distributions and three based on varying degrees of smoothing. Smoothing reduced, by at least half, the sample size required for a given degree of accuracy. Smoothing that preserved only two moments of the marginal distributions resulted in equatings that failed to capture the curvilinearity in the population equating. 相似文献