首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
This study examines the effectiveness of three approaches for maintaining equivalent performance standards across test forms with small samples: (1) common‐item equating, (2) resetting the standard, and (3) rescaling the standard. Rescaling the standard (i.e., applying common‐item equating methodology to standard setting ratings to account for systematic differences between standard setting panels) has received almost no attention in the literature. Identity equating was also examined to provide context. Data from a standard setting form of a large national certification test (N examinees = 4,397; N panelists = 13) were split into content‐equivalent subforms with common items, and resampling methodology was used to investigate the error introduced by each approach. Common‐item equating (circle‐arc and nominal weights mean) was evaluated at samples of size 10, 25, 50, and 100. The standard setting approaches (resetting and rescaling the standard) were evaluated by resampling (N = 8) and by simulating panelists (N = 8, 13, and 20). Results were inconclusive regarding the relative effectiveness of resetting and rescaling the standard. Small‐sample equating, however, consistently produced new form cut scores that were less biased and less prone to random error than new form cut scores based on resetting or rescaling the standard.  相似文献   

2.
This study investigates the comparability of two item response theory based equating methods: true score equating (TSE), and estimated true equating (ETE). Additionally, six scaling methods were implemented within each equating method: mean-sigma, mean-mean, two versions of fixed common item parameter, Stocking and Lord, and Haebara. Empirical test data were examined to investigate the consistency of scores resulting from the two equating methods, as well as the consistency of the scaling methods both within equating methods and across equating methods. Results indicate that although the degree of correlation among the equated scores was quite high, regardless of equating method/scaling method combination, non-trivial differences in equated scores existed in several cases. These differences would likely accumulate across examinees making group-level differences greater. Systematic differences in the classification of examinees into performance categories were observed across the various conditions: ETE tended to place lower ability examinees into higher performance categories than TSE, while the opposite was observed for high ability examinees. Because the study was based on one set of operational data, the generalizability of the findings is limited and further study is warranted.  相似文献   

3.
本文采用数理统计和教育测量的方法,对2005年广西体育高考的篮球考试进行研究.结果表明,2005年的篮球专项考试项目设置合理,将全场比赛考试取消是有根据和较合理的;一分钟投篮和往返运球投篮的评分标准不合理,使考试的难度和区分度降低,考生比较容易得高分,评分标准不能客观和有效地评定和鉴别考生的优劣.应对这两项考试的评分标准进行修改.  相似文献   

4.
This study investigated the effectiveness of equating with very small samples using the random groups design. Of particular interest was equating accuracy at specific scores where performance standards might be set. Two sets of simulations were carried out, one in which the two forms were identical and one in which they differed by a tenth of a standard deviation in overall difficulty. These forms were equated using mean equating, linear equating, unsmoothed equipercentile equating, and equipercentile equating using two through six moments of log-linear presmoothing with samples of 25, 50, 75, 100, 150, and 200. The results indicated that identity equating was preferable to any equating method when samples were as small as 25. For samples of 50 and above, the choice of an equating method over identity equating depended on the location of the passing score relative to examinee performance. If passing scores were located below the mean, where data were sparser, mean equating produced the smallest percentage of misclassified examinees. For passing scores near the mean, all methods produced similar results with linear equating being the most accurate. For passing scores above the mean, equipercentile equating with 2- and 3-moment presmoothing were the best equating methods. Higher levels of presmoothing did not improve the results.  相似文献   

5.
Building on previous works by Lord and Ogasawara for dichotomous items, this article proposes an approach to derive the asymptotic standard errors of item response theory true score equating involving polytomous items, for equivalent and nonequivalent groups of examinees. This analytical approach could be used in place of empirical methods like the bootstrap method, to obtain standard errors of equated scores. Formulas are introduced to obtain the derivatives for computing the asymptotic standard errors. The approach was validated using mean‐mean, mean‐sigma, random‐groups, or concurrent calibration equating of simulated samples, for tests modeled using the generalized partial credit model or the graded response model.  相似文献   

6.
In observed‐score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response theory (IRT) model. The parameters from such a model can be utilized to derive the score probabilities for the tests and these score probabilities may then be used in observed‐score equating. In this study, the asymptotic standard errors of observed‐score equating using score probability vectors from polytomous IRT models are derived using the delta method. The results are applied to the equivalent groups design and the nonequivalent groups design with either chain equating or poststratification equating within the framework of kernel equating. The derivations are presented in a general form and specific formulas for the graded response model and the generalized partial credit model are provided. The asymptotic standard errors are accurate under several simulation conditions relating to sample size, distributional misspecification and, for the nonequivalent groups design, anchor test length.  相似文献   

7.
Through pilot studies and regular examination procedures, the National Institute for Educational Measurement (CITO) in The Netherlands has gathered experience with different methods of maintaining the standards of examinations. The present paper presents an overview of the psychometric aspects of the various approaches that can be chosen for the maintenance of standards. Generally speaking, the approaches to the problem, can be divided into two classes. In the first approach the examinations are a fixed factor, i.e. the examination is already constructed and cannot be changed, and the link between the standards of both examinations is created by some test equating design. In the second approach the items of both examinations are selected from a pre‐tested pool of items, in such a way that two equivalent examinations are constructed. In both approaches the statistical problems of simultaneously modelling possible differences in the ability level of different groups of examinees and differences in the difficulty of the items are solved within the framework of item response theory. It is shown that applying the Rasch model for dichotomous and polytomous items results in a variety of possible test‐equating designs which adequately deal with the restrictions imposed by the practical conditions related to the fact that the equating involves examinations. Especially the requirement of secrecy of the content of new examinations must be taken into account. Finally it is shown that, given a pool of pre‐tested items, optimisation techniques can be used to construct equivalent examinations.  相似文献   

8.
Score equating based on small samples of examinees is often inaccurate for the examinee populations. We conducted a series of resampling studies to investigate the accuracy of five methods of equating in a common-item design. The methods were chained equipercentile equating of smoothed distributions, chained linear equating, chained mean equating, the symmetric circle-arc method, and the simplified circle-arc method. Four operational test forms, each containing at least 110 items, were used for the equating, with new-form samples of 100, 50, 25, and 10 examinees and reference-form samples three times as large. Accuracy was described in terms of the root-mean-squared difference (over 1,000 replications) of the sample equatings from the criterion equating. Overall, chained mean equating produced the most accurate results for low scores, but the two circle-arc methods produced the most accurate results, particularly in the upper half of the score distribution. The difference in equating accuracy between the two circle-arc methods was negligible.  相似文献   

9.
基于项目反应理论中的LOGISTIC双参数模型研究共同题非等组设计下,考生能力分布与被试量对等值的影响。等值方法采用分别校准下的项目特征曲线法、Stocking-Lord法、Haebara法。等值结果采用等值分数标准误、等值系数标准误、共同题参数稳定性三种方法进行评价。研究结果表明,考生能力分布越接近,被试量越大,等值误差越小;且Stocking-Lord法较Haebara法的等值结果更稳定。  相似文献   

10.
Equating methods make use of an appropriate transformation function to map the scores of one test form into the scale of another so that scores are comparable and can be used interchangeably. The equating literature shows that the ways of judging the success of an equating (i.e., the score transformation) might differ depending on the adopted framework. Rather than targeting different parts of the equating process and aiming to evaluate the process from different aspects, this article views the equating transformation as a standard statistical estimator and discusses how this estimator should be assessed in an equating framework. For the kernel equating framework, a numerical illustration shows the potentials of viewing the equating transformation as a statistical estimator as opposed to assessing it using equating‐specific criteria. A discussion on how this approach can be used to compare other equating estimators from different frameworks is also included.  相似文献   

11.
The purpose of this study was to assess the dimensionality of two forms of a large-scale standardized test separately for 3 ethnic groups of examinees and to investigate whether differences in their latent trait composites have any impact on unidimensional item response theory true-score equating functions. Specifically, separate equating functions for African American and Hispanic examinees were compared to those of a Caucasian group as well as the total test taker population. On both forms, a 2-dimensional model adequately accounted for the item responses of Caucasian and African American examinees, whereas a more complex model was required for the Hispanic subgroup. The differences between equating functions for the 3 ethnic groups and the total test taker population were small and tended to be located at the low end of the score scale.  相似文献   

12.
A resampling study was conducted to compare the statistical bias and standard errors of nonequivalent-groups linear test equating in small samples of examinees. Sample sizes of 15, 25, 50, and 100 were examined. One thousand samples of each size were drawn with replacement from each of 5 archival data files from teacher subject area tests. For each test, data files from 2 parallel forms were used. Results suggest trivial levels of equating bias even with small samples, but substantial increases in standard errors as sample size decreases. Results were interpreted in terms of applications to testing situations in which small numbers of examinees are available.  相似文献   

13.
The selection of bandwidth in kernel equating is important because it has a direct impact on the equated test scores. The aim of this article is to examine the use of double smoothing when selecting bandwidths in kernel equating and to compare double smoothing with the commonly used penalty method. This comparison was made using both an equivalent groups design and a nonequivalent group with anchor test design. The performance of the methods was evaluated through simulation studies using both symmetric and skewed score distributions. In addition, the bandwidth selection methods were applied to real data from a college admissions test. The results show that the traditional penalty method works well although double smoothing is a viable alternative because it performs reasonably well compared to the traditional method.  相似文献   

14.
The standard error of measurement (SEM) is the standard deviation of errors of measurement that are associated with test scores from a particular group of examinees. When used to calculate confidence bands around obtained test scores, it can be helpful in expressing the unreliability of individual test scores in an understandable way. Score bands can also be used to interpret intraindividual and interindividual score differences. Interpreters should be wary of over-interpretation when using approximations for correctly calculated score bands. It is recommended that SEMs at various score levels be used in calculating score bands rather than a single SEM value.  相似文献   

15.
《教育实用测度》2013,26(4):383-407
The performance of the item response theory (IRT) true-score equating method is examined under conditions of test multidimensionality. It is argued that a primary concern in applying unidimensional equating methods when multidimensionality is present is the potential decrease in equity (Lord, 1980) attributable to the fact that examinees of different ability are expected to obtain the same test scores. In contrast to equating studies based on real test data, the use of simulation in equating research not only permits assessment of these effects but also enables investigation of hypothetical equating conditions in which multidimensionality can be suspected to be especially problematic for test equating. In this article, I investigate whether the IRT true-score equating method, which explicitly assumes the item response matrix is unidimensional, is more adversely affected by the presence of multidimensionality than 2 conventional equating methods-linear and equipercentile equating-using several recently proposed equity-based criteria (Thomasson, 1993). Results from 2 simulation studies suggest that the IRT method performs at least as well as the conventional methods when the correlation between dimensions is high (³ 0.7) and may be only slightly inferior to the equipercentile method when the correlation is moderate to low (£ 0.5).  相似文献   

16.
Cutoff scores based on absolute standards can be unacceptable in terms of the number of failures they produce. Cutoff scores based on relative standards, that is, cutoff scores set to achieve a fixed percentage of failures, can be unacceptable because an acceptable performance level for passed examinees cannot be guaranteed. In some situations one can improve upon an absolute standard using a compromise model, which draws on the information in the observed score distribution for a test to adjust the standard. Three compromise models are described and compared in this article.  相似文献   

17.
This article presents estimates of the effects of the use of formula scoring on an individual examinee's score. The results of this analysis suggest that under plausible assumptions, using test characteristics derived from several studies, some examinees would increase their scores by one half standard deviation or more if they were to answer items omitted under formula directions  相似文献   

18.
This article describes a preliminary investigation of an empirical Bayes (EB) procedure for using collateral information to improve equating of scores on test forms taken by small numbers of examinees. Resampling studies were done on two different forms of the same test. In each study, EB and non-EB versions of two equating methods—chained linear and chained mean—were applied to repeated small samples drawn from a large data set collected for a common-item equating. The criterion equating was the chained linear equating in the large data set. Equatings of other forms of the same test provided the collateral information. New-form sample size was varied from 10 to 200; reference-form sample size was constant at 200. One of the two new forms did not differ greatly in difficulty from its reference form, as was the case for the equatings used as collateral information. For this form, the EB procedure improved the accuracy of equating with new-form samples of 50 or fewer. The other new form was much more difficult than its reference form; for this form, the EB procedure made the equating less accurate.  相似文献   

19.
Accurate equating results are essential when comparing examinee scores across exam forms. Previous research indicates that equating results may not be accurate when group differences are large. This study compared the equating results of frequency estimation, chained equipercentile, item response theory (IRT) true‐score, and IRT observed‐score equating methods. Using mixed‐format test data, equating results were evaluated for group differences ranging from 0 to .75 standard deviations. As group differences increased, equating results became increasingly biased and dissimilar across equating methods. Results suggest that the size of group differences, the likelihood that equating assumptions are violated, and the equating error associated with an equating method should be taken into consideration when choosing an equating method.  相似文献   

20.
This study investigated differences between two approaches to chained equipercentile (CE) equating (one‐ and bi‐direction CE equating) in nearly equal groups and relatively unequal groups. In one‐direction CE equating, the new form is linked to the anchor in one sample of examinees and the anchor is linked to the reference form in the other sample. In bi‐direction CE equating, the anchor is linked to the new form in one sample of examinees and to the reference form in the other sample. The two approaches were evaluated in comparison to a criterion equating function (i.e., equivalent groups equating) using indexes such as root expected squared difference, bias, standard error of equating, root mean squared error, and number of gaps and bumps. The overall results across the equating situations suggested that the two CE equating approaches produced very similar results, whereas the bi‐direction results were slightly less erratic, smoother (i.e., fewer gaps and bumps), usually closer to the criterion function, and also less variable.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号