期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

《湖南师范大学教育科学学报》1993,(1)

概化理论(Generalizability Theory)是当今最重要的三大测验理论之一,与经典测验理论(Classical Test Theory,CTT)和项目反应理论(Item Response Theory,IRT)相比,不仅信度观为人乐道,而且效度观也令人耳目一新。为了更好地把握测量效度,本文将从二个方面讨论一下概化理论的效度观。相似文献

2.

概化理论的效度观

杨志明《湖南师范大学社会科学学报》1993,(1)

概化理论(Generalizability Theory)是当今最重要的三大测验理论之一,与经典测验理论(Classical Test Theory,CTT)和项目反应理论(Item Response Theow,IRT)相比,不仅信度观为人乐道。而且效度观也令人耳目一新。为了更好地把握测量效度,本文将从二个方面讨论一下概化理论的效度观。相似文献

3.

计算机自适应性语言测试的智能选题方法研究

柴省三《中国教育信息化》2014,(4):81-85

随着计算机信息技术的发展和多媒体网络教学设备的日益普及,基于项目反应理论（IRT）的计算机自适应性（CAT）语言测试由于在测验信度、测验效率和考试安全性等方面比传统的纸笔测验具有更大的优势,因此针对计算机自适应性考试的理论问题和实践问题正在成为教育考试信息化研究的热点之一。文章在对自适应性考试原理进行考察的基础上,专门就计算机自适应性语言测试,特别是对国内外计算机自适应性阅读理解考试过程中遇到的智能选题单位和方法问题进行了探讨。并对具体的解决途径进行了研究。相似文献

4.

对基于项目反映理论的计算机自适应测试方法的再思考

刘培艳王淑琴《唐山师范学院学报》2013,(2):44-46

以项目反应理论IRT（ItemResponseTheory）为基础,介绍项目反应理论IRT的特点,以及基于项目反应理论IRT的计算机自适应测试的工作原理,并在此基础上总结了起点选择的方法,提出了测试流程两步制的改进方案,通过对测试流程的改进,大大减少了与被试能力值相差较远的测试项目,缩短了测试时间和计算量,同时能准确地估计被试能力值。相似文献

5.

计算机自适应性语言测试的智能选题方法研究

《中国教育信息化》2014,(8)

随着计算机信息技术的发展和多媒体网络教学设备的日益普及,基于项目反应理论(IRT)的计算机自适应性(CAT)语言测试由于在测验信度、测验效率和考试安全性等方面比传统的纸笔测验具有更大的优势,因此针对计算机自适应性考试的理论问题和实践问题正在成为教育考试信息化研究的热点之一。文章在对自适应性考试原理进行考察的基础上,专门就计算机自适应性语言测试,特别是对国内外计算机自适应性阅读理解考试过程中遇到的智能选题单位和方法问题进行了探讨,并对具体的解决途径进行了研究。相似文献

6.

项目反应理论框架下几个等值问题的探讨

丁树良熊建华《中国考试》2003,(Z4)

测验等值无疑是测量中一个重要问题。《面向心理学家的项目反应理论》一书指出,“项目反应理论(IRT)最初吸引美国测验编制人员。是因为这种理论可以解决许多测验中的实际问题,比如将不同形式的测验形式等值。”《BILOG.3使用说明书》指出,“IRT与经典测验理论(CTT)相比,或许最大的长处是测验等值”。事实上,在IRT框架下实施等值,不仅理论完善,前提条件较容易满足,而且等值关系式也十分简洁。相似文献

7.

小学学业成就评价方法探新——项目反应理论(IRT)指导下的计算机自适应测验(CAT)

李映红《湖南第一师范学报》2007,7(3):31-33

好的评价方法能给学生学业成就科学公正的评价,可以正确引导学生找出不足,激励学习。根据小学学业成就评价方法的现状,对照传统的CTT理论和现代测量理论IRT的优缺点,提出项目反应理论(IRT)指导下的计算机自适应测验(CAT)作为传统测验的一种辅助方式,应用于小学学业成就评价中是必要的,也是可行的。相似文献

8.

论试题库建立的理论和方法

张敏强《华南师范大学学报(社会科学版)》1990,1(1)

要使考试公平、合理、准确,有必要使考试标准化.广东省高考标准化改革试验已经证明,标准化考试在我国是可行的.考试标准化的改革要深入开展,必须建立试题库.本文就试题库建立的意义和作用作了论述,并介绍了运用经典测验理论(CTT)和现代测验理论中项目反应理论(IRT)建立试题库的理论和方法. 相似文献

9.

高考数学中考试评价的研究——基于CTT与IRT的实证比较

闫成海杜文久宋乃庆张健《华东师范大学学报(教育科学版)》2014,32(3):10-18

相关研究表明,IRT在教育考试评价中比CTT具有诸多优点。本文以某地区高考数学考试数据为基础,比较CTT与IRT在项目参数、评价方式、精度估计三个方面之间的差异。研究结果证明,在IRT下参数更容易反映观测各个项目的特征属性,IRT参数比CTT参数更具精确性,项目信息函数能更好的反映试题信息;CTT与IRT的评价方式不同,IRT下的能力分数优于CTT下的测验分数,更能反映学生能力水平;CTT与IRT精度估计不同,IRT测验信息函数和能力置信区间比CTT有更好的精度。实证展示出IRT在高考数学考试评价中的优越性,具有重要的价值和应用前景。相似文献

10.

多种测量信度观与自学考试信度分析

田霖韦小满王桥影《教育与考试》2013,(2):21-25

信度是衡量测量结果稳定性与可靠性的重要指标,反映了测量过程中对误差控制能力的大小。信度分析是自学考试试题评价的重要内容,包括测量分数信度分析与及格线决策信度分析。本文简要介绍了CTT信度观、GT信度观及IRT信度观的理论内容与信度分析方法,并对三种测量信度观进行比较。本文提出,自学考试的信度分析工作应结合具体课程的考试特点、试卷结构、考试作答数据类型等因素,同时考虑CTT、GT、IRT三种信度观的优势及信度估计方法的应用条件,根据具体研究目的选择最恰当的或综合运用不同的信度分析方法。相似文献

11.

IRT‐Estimated Reliability for Tests Containing Mixed Item Formats

Lianghua Shu Richard D. Schwarz 《Journal of Educational Measurement》2014,51(2):163-177

As a global measure of precision, item response theory (IRT) estimated reliability is derived for four coefficients (Cronbach's α, Feldt‐Raju, stratified α, and marginal reliability). Models with different underlying assumptions concerning test‐part similarity are discussed. A detailed computational example is presented for the targeted coefficients. A comparison of the IRT model‐derived coefficients is made and the impact of varying ability distributions is evaluated. The advantages of IRT‐derived reliability coefficients for problems such as automated test form assembly and vertical scaling are discussed. 相似文献

12.

Skills Diagnosis Using IRT-Based Latent Class Models

Louis A. Roussos Jonathan L. Templin Robert A. Henson 《Journal of Educational Measurement》2007,44(4):293-311

相似文献

13.

Relating Unidimensional IRT Parameters to a Multidimensional Response Space: A Review of Two Alternative Projection IRT Models for Scoring Subscales

Nilufer Kahraman Tony Thompson 《Journal of Educational Measurement》2011,48(2):146-164

A practical concern for many existing tests is that subscore test lengths are too short to provide reliable and meaningful measurement. A possible method of improving the subscale reliability and validity would be to make use of collateral information provided by items from other subscales of the same test. To this end, the purpose of this article is to compare two different formulations of an alternative Item Response Theory (IRT) model developed to parameterize unidimensional projections of multidimensional test items: Analytical and Empirical formulations. Two real data applications are provided to illustrate how the projection IRT model can be used in practice, as well as to further examine how ability estimates from the projection IRT model compare to external examinee measures. The results suggest that collateral information extracted by a projection IRT model can be used to improve reliability and validity of subscale scores, which in turn can be used to provide diagnostic information about strength and weaknesses of examinees helping stakeholders to link instruction or curriculum to assessment results. 相似文献

14.

A comparison of the approaches of generalizability theory and item response theory in estimating the reliability of test scores for testlet-composed tests

Guemin Lee In-Yong Park 《Asia Pacific Education Review》2012,13(1):47-54

Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several estimation methods for different measurement models using simulation techniques. Three types of estimation approach were conceptualized for generalizability theory (GT) and item response theory (IRT): item score approach (ISA), testlet score approach (TSA), and item-nested-testlet approach (INTA). The magnitudes of overestimation when applying item-based methods ranged from 0.02 to 0.06 and were related to the degrees of dependence among within-testlet items. Reliability estimates from TSA were lower than those from INTA due to the loss of information with IRT approaches. However, this could not be applied in GT. Specified methods in IRT produced higher reliability estimates than those in GT using the same approach. Relatively smaller magnitudes of error in reliability estimates were observed for ISA and for methods in IRT. Thus, it seems reasonable to use TSA as well as INTA for both GT and IRT. However, if there is a relatively large dependence among within-testlet items, INTA should be considered for IRT due to nonnegligible loss of information. 相似文献

15.

A Nonparametric Approach to Estimate Classification Accuracy and Consistency

Quinn N. Lathrop Ying Cheng 《Journal of Educational Measurement》2014,51(3):318-334

When cut scores for classifications occur on the total score scale, popular methods for estimating classification accuracy (CA) and classification consistency (CC) require assumptions about a parametric form of the test scores or about a parametric response model, such as item response theory (IRT). This article develops an approach to estimate CA and CC nonparametrically by replacing the role of the parametric IRT model in Lee's classification indices with a modified version of Ramsay's kernel‐smoothed item response functions. The performance of the nonparametric CA and CC indices are tested in simulation studies in various conditions with different generating IRT models, test lengths, and ability distributions. The nonparametric approach to CA often outperforms Lee's method and Livingston and Lewis's method, showing robustness to nonnormality in the simulated ability. The nonparametric CC index performs similarly to Lee's method and outperforms Livingston and Lewis's method when the ability distributions are nonnormal. 相似文献

16.

基于CTT、GT、IRT的评分者信度研究——以某届奥运会女子跳水决赛为例

钟晓玲康春花陈婧《考试研究》2013,(5):41-52

本文以某届国际奥林匹克运动会女子跳水决赛为例,综合应用CTT、GT和IRT三大测量理论进行评分者信度分析,从不同角度揭示评分者之间和评分者内部的差异情况。结果表明:CTT的评分者信度分别为0.981和078;GT的概化系数和可靠性指数分别为0.8279和0.8271,比赛中所采用的7名评委分别对选手在5轮上的跳水表现进行评定的决策是比较适宜的决策;在IRT中,相对而言,评委5在7名评委中最为严厉,评委2最为宽松,但评委之间在宽严程度上的差异不显著,评委1和评委4在自身一致性上存在问题,不同评委在评定不同选手、不同难度系数动作和不同轮数上存在偏差,但未达到显著性水平。基于本文的分析,可以了解三种评分者信度分析方法的特点及各自优势,为评分者培训和提高评分信度提供有用信息。相似文献

17.

The estimation of the IRT reliability coefficient and its lower and upper bounds, with comparisons to CTT reliability statistics

Seonghoon Kim Leonard S. Feldt 《Asia Pacific Education Review》2010,11(2):179-188

The primary purpose of this study is to investigate the mathematical characteristics of the test reliability coefficient ρ_XX′ as a function of item response theory (IRT) parameters and present the lower and upper bounds of the coefficient. Another purpose is to examine relative performances of the IRT reliability statistics and two classical test theory (CTT) reliability statistics (Cronbach’s alpha and Feldt–Gilmer congeneric coefficients) under various testing conditions that result from manipulating large-scale real data. For the first purpose, two alternative ways of exactly quantifying ρ_XX′ are compared in terms of computational efficiency and statistical usefulness. In addition, the lower and upper bounds for ρ_XX′ are presented in line with the assumptions of essential tau-equivalence and congeneric similarity, respectively. Empirical studies conducted for the second purpose showed across all testing conditions that (1) the IRT reliability coefficient was higher than the CTT reliability statistics; (2) the IRT reliability coefficient was closer to the Feldt–Gilmer coefficient than to the Cronbach’s alpha coefficient; and (3) the alpha coefficient was close to the lower bound of IRT reliability. Some advantages of the IRT approach to estimating test-score reliability over the CTT approaches are discussed in the end. 相似文献

18.

Can We Learn From Student Mistakes in a Formative,Reading Comprehension Assessment?

Bowen Liu Patrick C. Kennedy Ben Seipel Sarah E. Carlson Gina Biancarosa Mark L. Davison 《Journal of Educational Measurement》2019,56(4):815-835

This article describes an ongoing project to develop a formative, inferential reading comprehension assessment of causal story comprehension. It has three features to enhance classroom use: equated scale scores for progress monitoring within and across grades, a scale score to distinguish among low‐scoring students based on patterns of mistakes, and a reading efficiency index. Instead of two response types for each multiple‐choice item, correct and incorrect, each item has three response types: correct and two incorrect response types. Prior results on reliability, convergent and discriminant validity, and predictive utility of mistake subscores are briefly described. The three‐response‐type structure of items required rethinking the item response theory (IRT) modeling. IRT‐modeling results are presented, and implications for formative assessments and instructional use are discussed. 相似文献

19.

IRT Approaches to Modeling Scores on Mixed-Format Tests

Won-Chan Lee Stella Y. Kim Jiwon Choi Yujin Kang 《Journal of Educational Measurement》2020,57(2):230-254

This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models. 相似文献

20.

Item Response Theory Models for Performance Decline During Testing

Kuan‐Yu Jin Wen‐Chung Wang 《Journal of Educational Measurement》2014,51(2):178-200

Sometimes, test‐takers may not be able to attempt all items to the best of their ability (with full effort) due to personal factors (e.g., low motivation) or testing conditions (e.g., time limit), resulting in poor performances on certain items, especially those located toward the end of a test. Standard item response theory (IRT) models fail to consider such testing behaviors. In this study, a new class of mixture IRT models was developed to account for such testing behavior in dichotomous and polytomous items, by assuming test‐takers were composed of multiple latent classes and by adding a decrement parameter to each latent class to describe performance decline. Parameter recovery, effect of model misspecification, and robustness of the linearity assumption in performance decline were evaluated using simulations. It was found that the parameters in the new models were recovered fairly well by using the freeware WinBUGS; the failure to account for such behavior by fitting standard IRT models resulted in overestimation of difficulty parameters on items located toward the end of the test and overestimation of test reliability; and the linearity assumption in performance decline was rather robust. An empirical example is provided to illustrate the applications and the implications of the new class of models. 相似文献