首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 484 毫秒
1.
曹文娟  白俊梅 《考试研究》2013,(3):79-85,33
本文使用R-2.15.2软件模拟研究锚测验难度参数方差特征对测验等值误差的影响,采用三种等值方法(链百分位等值法、Levine等值法和Tucker等值法)对锚测验不同类型的难度方差进行比较研究。结果显示,当锚测验难度方差小于全测验难度方差时,其等值的随机误差和系统误差与锚测验难度方差和全测验难度方差一致时(即锚测验为全测验的平行缩减版minitest时)的表现基本相同。因此,对锚测验而言,要求其与全测验具有相同的统计规格可能过于严格。  相似文献   

2.
本研究采用锚测验非等组设计,探究了锚测验样本量的变化对等值结果的影响。数据来自全国英语等级考试(PETS),使用以Bigsteps为核心的自主改进软件,基于Rasch模型估计题目参数。为了探究等值结果对合格分数线的影响,本研究对比了不同样本量锚测验参数估计值与锚题参数给定值的差异,并对不同样本量锚测验得出的实考试卷等值结果与最大样本量锚测验得出的实考试卷等值结果进行了差异分析。结果表明,当锚测验样本量达到150时,等值结果比较稳定。这一结果表明,PETS设定的300人左右的锚测验样本量是合理的。  相似文献   

3.
通过对测验等值中线性等值公式进行研究,得到改进后的线性等值公式,该公式不仅与两测验的相关系数有关,而且同两测验的信度有密切联系。目前常用的线性等值公式是其当两份测验信度相等时的一个特例。  相似文献   

4.
随着新一轮高考改革的深入,考生在一些科目中将有两次考试机会。这两次考试分数间的相互转换可以通过测验等值来解决。然而测验等值实践涉及诸多环节,每个环节都对最终的等值效果有重要的影响。本文从等值设计的选择、等值必要性判断、等值方法的选择、评价标准的选择以及等值过程的质量控制等方面说明在高考改革中测验等值应注意的问题,以期显著提高等值质量。  相似文献   

5.
为降低学生学业负担,避免学生因偶然因素导致的考试误差,新一轮高考改革要求为考生提供两次外语及学业水平考试机会。在此背景下,如何比较两次考试成绩成为关键。测验等值技术作为心理测量学的重要组成部分,恰能有效解决测验分数比较的问题。通过对等值概念、等值设计、等值处理方法及等值评估等问题的探讨,分析了高考等值应注意的问题及其可能采取的等值方法,为实现高考成绩比较科学化提供技术支持。  相似文献   

6.
由于测验安全性、试卷组卷不当等问题,有些测验的题本相互之间不能或者没有设置锚题。对作答不同题本的被试进行分数比较时,需要用到测验等值技术。不同于有锚题测验能通过题本之间的锚题进行等值,无锚题情境下的测验需要借助于一些特殊方法进行等值。目前,对无锚题测验进行等值主要有三种方式,一种是通过测验中具体的题目,也就是构建相同的"锚题"来进行等值,如构造随机等组测验法和利用题目先验信息进行等值的方法;一种是通过构建相同被试组来进行等值,即构造随机等组样本法;还有一种是借助于测验题目所考查的认知属性来进行等值,一般是基于一种认知诊断模型——规则空间模型来进行操作。  相似文献   

7.
测验等值的单组设计,是将需等值的测验X、Y都由同一考生组施测,然后对测验分数进行等值。其优点是考生组只有一个X、Y成绩的差异将归因于测验的不同而不会混杂考生组不同的因素。缺点是同一考生要测验两次,练习效应和疲劳将会干扰等值结果。本文提出一种新的设计方法──单组设计试卷分半法,是在单组设计中,把测验X、Y各分成平行的两半卷,各取X、Y的半卷组成新的测验Z,将Z对同一考生组施测,根据施测结果导出等值转换公式进行分数等值转换。这种方法每个考生只施测一次,既保持了单组设计的优点,又克服了它的缺点。  相似文献   

8.
测验等值不是无条件的分数转换,需要满足公平性、横跨群体的不变性、对称性和测验的一维性等条件。测验等值可以解决诸如不同学年度学生成绩比较、不同学校不同地区学生成绩比较、不同班级教师教学水平比较等问题。进行测验等值需做好等值设计,锚测验设计是等值设计中常用的一种,不同的锚测验设计有不同的要求。结合某实例,本研究介绍了测验等值在实际中的应用。  相似文献   

9.
等值误差理论与我国高考等值的误差控制   总被引:2,自引:0,他引:2  
测验等值误差有随机误差和系统误差两种。随机误差的产生来自于抽样,其大小主要受样本容量影响,有两种估计等值随机误差的方法。系统误差产生的原因比较复杂,有些系统误差可采用一定的办法予以估计,有些系统误差是无法估计的。我国高考等值的前期工作已经在方案设计、数据采集、锚题编制、等值关系计算等方面努力贯彻了误差控制思想,取得了较好效果。建议今后应采用预估样本容量,有计划更换锚题、精心设计等值路径、选择适当的平滑曲线次数等技术措施更有效地控制高考等值误差  相似文献   

10.
测验等值使得不同形式的考试能进行比较,从而保证了测验之间的相对稳定性。基于IRT的分数等值是在估计出参数的基础上进行的参数转换,等值结果的稳定性与考生样本量密不可分。本研究针对汉语水平考试(HSK)阅读分测验,采用真实数据模拟共同组锚测验设计,确定等值的参照标准,考察考生样本量的变化对IRT分数等值稳定性的影响。结果表明,考生样本量为2000左右时各种方案的等值结果均比较稳定。考生样本量进一步增大时,等值误差不降反增。  相似文献   

11.
Examined in this study were the effects of reducing anchor test length on student proficiency rates for 12 multiple‐choice tests administered in an annual, large‐scale, high‐stakes assessment. The anchor tests contained 15 items, 10 items, or five items. Five content representative samples of items were drawn at each anchor test length from a small universe of items in order to investigate the stability of equating results over anchor test samples. The operational tests were calibrated using the one‐parameter model and equated using the mean b‐value method. The findings indicated that student proficiency rates could display important variability over anchor test samples when 15 anchor items were used. Notable increases in this variability were found for some tests when shorter anchor tests were used. For these tests, some of the anchor items had parameters that changed somewhat in relative difficulty from one year to the next. It is recommended that anchor sets with more than 15 items be used to mitigate the instability in equating results due to anchor item sampling. Also, the optimal allocation method of stratified sampling should be evaluated as one means of improving the stability and precision of equating results.  相似文献   

12.
The choice of anchor tests is crucial in applications of the nonequivalent groups with anchor test design of equating. Sinharay and Holland (2006, 2007) suggested “miditests,” which are anchor tests that are content‐representative and have the same mean item difficulty as the total test but have a smaller spread of item difficulties. Sinharay and Holland (2006, 2007), Cho, Wall, Lee, and Harris (2010), Fitzpatrick and Skorupski (2016), Liu, Sinharay, Holland, Curley, and Feigenbaum (2011a), Liu, Sinharay, Holland, Feigenbaum, and Curley (2011b), and Yi (2009) found the miditests to lead to better equating than minitests, which are representative of the total test with respect to content and difficulty. However, these findings recently came into question as Trierweiler, Lewis, and Smith (2016) concluded, based on a comparison of correlation coefficients of miditests and minitests with the total test, that making an anchor test a miditest does not generally increase the anchor to total score correlation and recommended the continuation of the practice of using minitests over miditests. Their recommendation raises the question, “Should miditests continue to be considered in practice?” This note defends the miditests by citing literature that favors miditests and then by showing that miditests perform as well as the minitests in most realistic situations considered in Trierweiler et al. (2016), which implies that miditests should continue to be seriously considered by equating practitioners.  相似文献   

13.
The study examined two approaches for equating subscores. They are (1) equating subscores using internal common items as the anchor to conduct the equating, and (2) equating subscores using equated and scaled total scores as the anchor to conduct the equating. Since equated total scores are comparable across the new and old forms, they can be used as an anchor to equate the subscores. Both chained linear and chained equipercentile methods were used. Data from two tests were used to conduct the study and results showed that when more internal common items were available (i.e., 10–12 items), then using common items to equate the subscores is preferable. However, when the number of common items is very small (i.e., five to six items), then using total scaled scores to equate the subscores is preferable. For both tests, not equating (i.e., using raw subscores) is not reasonable as it resulted in a considerable amount of bias.  相似文献   

14.
目的:探讨心理行为训练对大学生意志品质的影响。方法:采用自编大学生意志品质量表,对参加心理行为训练的38名大学生进行测查。结果:①干预组在前测、后测中,其果断性因子、自觉性因子、自制力因子和总均分都无显著差异(P0.05),而在坚韧性因子存在显著差异(P0.05);②对照组在前测、后测中,各项因子得分和总均分均无显著差异(P0.05);③在及时后测中,干预组与对照组相比,意志品质各因子均存在显著差异(P0.05),总均分差异极其显著(P0.01);④在长效后测中,干预组和对照组在自觉性因子和总均分上存在显著差异(P0.05),在果断性、坚韧性和自制力上无显著差异(P0.05)。结论:心理行为训练能有效提高大学生意志品质水平,可广泛应用于高校大学生意志品质教育和心理健康教育。  相似文献   

15.
This article discusses a particular type of concordance table and the potential for test score misuse that may result from employing such a table. The concordance that is discussed is typically created between scores on different, nonequatable versions of a test that share the same or close to the same test title. These concordance tables often appear in the context of relating scores on computerized adaptive and paper‐and‐pencil versions of the same test. When such a table is presented in a complete point‐by‐point fashion, relating each reported score on the scale of the new version of the test to a reported score on the scale of the old version of the test, test score users will typically treat the table as if it represented an equating of scores between the two versions, and directly replace scores on the new version of the test by scores on the old version. This clearly represents a misuse of the test scores. Suggestions for avoiding this misuse of test scores from concordance tables are provided.  相似文献   

16.
In observed‐score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response theory (IRT) model. The parameters from such a model can be utilized to derive the score probabilities for the tests and these score probabilities may then be used in observed‐score equating. In this study, the asymptotic standard errors of observed‐score equating using score probability vectors from polytomous IRT models are derived using the delta method. The results are applied to the equivalent groups design and the nonequivalent groups design with either chain equating or poststratification equating within the framework of kernel equating. The derivations are presented in a general form and specific formulas for the graded response model and the generalized partial credit model are provided. The asymptotic standard errors are accurate under several simulation conditions relating to sample size, distributional misspecification and, for the nonequivalent groups design, anchor test length.  相似文献   

17.
In this article, linear item response theory (IRT) observed‐score equating is compared under a generalized kernel equating framework with Levine observed‐score equating for nonequivalent groups with anchor test design. Interestingly, these two equating methods are closely related despite being based on different methodologies. Specifically, when using data from IRT models, linear IRT observed‐score equating is virtually identical to Levine observed‐score equating. This leads to the conclusion that poststratification equating based on true anchor scores can be viewed as the curvilinear Levine observed‐score equating.  相似文献   

18.
Three local observed‐score kernel equating methods that integrate methods from the local equating and kernel equating frameworks are proposed. The new methods were compared with their earlier counterparts with respect to such measures as bias—as defined by Lord's criterion of equity—and percent relative error. The local kernel item response theory observed‐score equating method, which can be used for any of the common equating designs, had a small amount of bias, a low percent relative error, and a relatively low kernel standard error of equating, even when the accuracy of the test was reduced. The local kernel equating methods for the nonequivalent groups with anchor test generally had low bias and were quite stable against changes in the accuracy or length of the anchor test. Although all proposed methods showed small percent relative errors, the local kernel equating methods for the nonequivalent groups with anchor test design had somewhat larger standard error of equating than their kernel method counterparts.  相似文献   

19.
This study explores an anchor that is different from the traditional miniature anchor in test score equating. In contrast to a traditional “mini” anchor that has the same spread of item difficulties as the tests to be equated, the studied anchor, referred to as a “midi” anchor (Sinharay & Holland), has a smaller spread of item difficulties than the tests to be equated. Both anchors were administered in an operational SAT administration and the impact of anchor type on equating was evaluated with respect to systematic error or equating bias. Contradicting the popular belief that the mini anchor is best, the results showed that the mini anchor does not always produce more accurate equating functions than the midi anchor; the midi anchor was found to perform as well as or even better than the mini anchor. Because testing programs usually have more middle difficulty items and few very hard or very easy items, midi external anchors are operationally easier to build. Therefore, the results of our study provide evidence in favor of the midi anchor, the use of which will lead to cost saving with no reduction in equating quality.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号