期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Comparison of Estimation Techniques for IRT Models With Small Samples

Holmes Finch Brian F. French 《教育实用测度》2019,32(2):77-96

The usefulness of item response theory (IRT) models depends, in large part, on the accuracy of item and person parameter estimates. For the standard 3 parameter logistic model, for example, these parameters include the item parameters of difficulty, discrimination, and pseudo-chance, as well as the person ability parameter. Several factors impact traditional marginal maximum likelihood (ML) estimation of IRT model parameters, including sample size, with smaller samples generally being associated with lower parameter estimation accuracy, and inflated standard errors for the estimates. Given this deleterious impact of small samples on IRT model performance, use of these techniques with low-incidence populations, where it might prove to be particularly useful, estimation becomes difficult, especially with more complex models. Recently, a Pairwise estimation method for Rasch model parameters has been suggested for use with missing data, and may also hold promise for parameter estimation with small samples. This simulation study compared item difficulty parameter estimation accuracy of ML with the Pairwise approach to ascertain the benefits of this latter method. The results support the use of the Pairwise method with small samples, particularly for obtaining item location estimates. 相似文献

2.

Estimating Non-Normal Latent Trait Distributions within Item Response Theory Using True and Estimated Item Parameters

D. A. Sass T. A. Schmitt C. M. Walker 《教育实用测度》2013,26(1):65-88

Item response theory (IRT) procedures have been used extensively to study normal latent trait distributions and have been shown to perform well; however, less is known concerning the performance of IRT with non-normal latent trait distributions. This study investigated the degree of latent trait estimation error under normal and non-normal conditions using four latent trait estimation procedures and also evaluated whether the test composition, in terms of item difficulty level, reduces estimation error. Most importantly, both true and estimated item parameters were examined to disentangle the effects of latent trait estimation error from item parameter estimation error. Results revealed that non-normal latent trait distributions produced a considerably larger degree of latent trait estimation error than normal data. Estimated item parameters tended to have comparable precision to true item parameters, thus suggesting that increased latent trait estimation error results from latent trait estimation rather than item parameter estimation. 相似文献

3.

Applying Kaplan-Meier to Item Response Data

Daniel McNeish 《Journal of Experimental Education》2018,86(2):308-324

Some IRT models can be equivalently modeled in alternative frameworks such as logistic regression. Logistic regression can also model time-to-event data, which concerns the probability of an event occurring over time. Using the relation between time-to-event models and logistic regression and the relation between logistic regression and IRT, this article outlines how the nonparametric Kaplan-Meier estimator for time-to-event data can be applied to IRT data. Established Kaplan-Meier computational formulas are shown to aid in better approximating “parametric-type” item difficulty compared to methods from existing nonparametric methods, particularly for the less-well-defined scenario wherein the response function is monotonic but invariant item ordering is unreasonable. Limitations and the potential for Kaplan-Meier within differential item functioning are also discussed. 相似文献

4.

A Multilevel Testlet Model for Dual Local Dependence

Hong Jiao Akihito Kamata Shudong Wang Ying Jin 《Journal of Educational Measurement》2012,49(1):82-100

The applications of item response theory (IRT) models assume local item independence and that examinees are independent of each other. When a representative sample for psychometric analysis is selected using a cluster sampling method in a testlet‐based assessment, both local item dependence and local person dependence are likely to be induced. This study proposed a four‐level IRT model to simultaneously account for dual local dependence due to item clustering and person clustering. Model parameter estimation was explored using the Markov Chain Monte Carlo method. Model parameter recovery was evaluated in a simulation study in comparison with three other related models: the Rasch model, the Rasch testlet model, and the three‐level Rasch model for person clustering. In general, the proposed model recovered the item difficulty and person ability parameters with the least total error. The bias in both item and person parameter estimation was not affected but the standard error (SE) was affected. In some simulation conditions, the difference in classification accuracy between models could go up to 11%. The illustration using the real data generally supported model performance observed in the simulation study. 相似文献

5.

三参数等级反应模型及其信息函数的应用

陈青丁树良《考试研究》2009,(2):77-84

项目反应理论（Item Response Theory,IRT）是现代教育心理测量领域中最有影响的一种测量理论,它的一个明确目标是扩展模型的种类以至于能够处理实际测试中任何形式的反应数据。在已有的各种模型研究中,对于多级评分项目,只考虑到项目区分度和难度。但在实际测验中,此类项目还可能存在猜测度。本研究基于Samejima等级反应模型,将项目猜测度融合到多级评分模型中,提出了三参数等级反应模型（Three-parameter Graded Response Model,3PL-GRM）。由于忽略多级反应项目的猜测度会使得该项目的信息量虚假升高,本研究还进一步将3PL—GRM的信息函数应用到试卷质量分析中。相似文献

6.

Assessing Goodness of Fit of Item Response Theory Models: A Comparison of Traditional and Alternative Procedures

Clement A. Stone Bo Zhang 《Journal of Educational Measurement》2003,40(4):331-352

Testing the goodness of fit of item response theory (IRT) models is relevant to validating IRT models, and new procedures have been proposed. These alternatives compare observed and expected response frequencies conditional on observed total scores, and use posterior probabilities for responses across θ levels rather than cross-classifying examinees using point estimates of θ and score responses. This research compared these alternatives with regard to their methods, properties (Type 1 error rates and empirical power), available research, and practical issues (computational demands, treatment of missing data, effects of sample size and sparse data, and available computer programs). Different advantages and disadvantages related to these characteristics are discussed. A simulation study provided additional information about empirical power and Type 1 error rates. 相似文献

7.

A Comparison Between Linear IRT Observed‐Score Equating and Levine Observed‐Score Equating Under the Generalized Kernel Equating Framework

Haiwen Chen 《Journal of Educational Measurement》2012,49(3):269-284

In this article, linear item response theory (IRT) observed‐score equating is compared under a generalized kernel equating framework with Levine observed‐score equating for nonequivalent groups with anchor test design. Interestingly, these two equating methods are closely related despite being based on different methodologies. Specifically, when using data from IRT models, linear IRT observed‐score equating is virtually identical to Levine observed‐score equating. This leads to the conclusion that poststratification equating based on true anchor scores can be viewed as the curvilinear Levine observed‐score equating. 相似文献

8.

The Use of Multiple Imputation for Missing Data in Uniform DIF Analysis: Power and Type I Error Rates

Holmes Finch 《教育实用测度》2013,26(4):281-301

Methods of uniform differential item functioning (DIF) detection have been extensively studied in the complete data case. However, less work has been done examining the performance of these methods when missing item responses are present. Research that has been done in this regard appears to indicate that treating missing item responses as incorrect can lead to inflated Type I error rates (false detection of DIF). The current study builds on this prior research by investigating the utility of multiple imputation methods for missing item responses, in conjunction with standard DIF detection techniques. Results of the study support the use of multiple imputation for dealing with missing item responses. The article concludes with a discussion of these results for multiple imputation in conjunction with other research findings supporting its use in the context of item parameter estimation with missing data. 相似文献

9.

Detecting Local Item Dependence in Polytomous Adaptive Data

Jessica L. Mislevy André A. Rupp Jeffrey R. Harring 《Journal of Educational Measurement》2012,49(2):127-147

A rapidly expanding arena for item response theory (IRT) is in attitudinal and health‐outcomes survey applications, often with polytomous items. In particular, there is interest in computer adaptive testing (CAT). Meeting model assumptions is necessary to realize the benefits of IRT in this setting, however. Although initial investigations of local item dependence have been studied both for polytomous items in fixed‐form settings and for dichotomous items in CAT settings, there have been no publications applying local item dependence detection methodology to polytomous items in CAT despite its central importance to these applications. The current research uses a simulation study to investigate the extension of widely used pairwise statistics, Yen's Q₃ Statistic and Pearson's Statistic X², in this context. The simulation design and results are contextualized throughout with a real item bank of this type from the Patient‐Reported Outcomes Measurement Information System (PROMIS). 相似文献

10.

Item Position and Item Difficulty Change in an IRT-Based Common Item Equating Design

Jason L. Meyers G. Edward Miller Walter D. Way 《教育实用测度》2013,26(1):38-60

In operational testing programs using item response theory (IRT), item parameter invariance is threatened when an item appears in a different location on the live test than it did when it was field tested. This study utilizes data from a large state's assessments to model change in Rasch item difficulty (RID) as a function of item position change, test level, test content, and item format. As a follow-up to the real data analysis, a simulation study was performed to assess the effect of item position change on equating. Results from this study indicate that item position change significantly affects change in RID. In addition, although the test construction procedures used in the investigated state seem to somewhat mitigate the impact of item position change, equating results might be impacted in testing programs where other test construction practices or equating methods are utilized. 相似文献

11.

Generalizability in Item Response Modeling

Derek C. Briggs Mark Wilson 《Journal of Educational Measurement》2007,44(2):131-155

An approach called generalizability in item response modeling (GIRM) is introduced in this article. The GIRM approach essentially incorporates the sampling model of generalizability theory (GT) into the scaling model of item response theory (IRT) by making distributional assumptions about the relevant measurement facets. By specifying a random effects measurement model, and taking advantage of the flexibility of Markov Chain Monte Carlo (MCMC) estimation methods, it becomes possible to estimate GT variance components simultaneously with traditional IRT parameters. It is shown how GT and IRT can be linked together, in the context of a single-facet measurement design with binary items. Using both simulated and empirical data with the software WinBUGS, the GIRM approach is shown to produce results comparable to those from a standard GT analysis, while also producing results from a random effects IRT model. 相似文献

12.

An empirical examination of IRT information for school climate surveys

Lun Mo Fang Yang Xiangen Hu 《Educational Research and Evaluation》2013,19(1):33-45

School climate surveys are widely applied in school districts across the nation to collect information about teacher efficacy, principal leadership, school safety, students' activities, and so forth. They enable school administrators to understand and address many issues on campus when used in conjunction with other student and staff data. However, these days each district develops the questionnaire according to its own needs and rarely provides supporting evidence for the reliability of items in the scale, that is, whether an individual item contributes significant information to the questionnaire. The Item Response Theory (IRT) is a useful tool that helps examine how much information each item and the whole scale can provide. Our study applied IRT to examine individual items in a school climate survey and assessed the efficiency of the survey after the removal of items that contributed little to the scale. The purpose of this study is to show how IRT can be applied to empirically validate school climate surveys. 相似文献

13.

Evaluating item response theory linking and model fit for data from PISA 2000–2012

Matthias von Davier Kentaro Yamamoto Hyo Jeong Shin Henry Chen Lale Khorramdel Jon Weeks 《Assessment in Education: Principles, Policy & Practice》2019,26(4):466-488

ABSTRACT

Based on concerns about the item response theory (IRT) linking approach used in the Programme for International Student Assessment (PISA) until 2012 as well as the desire to include new, more complex, interactive items with the introduction of computer-based assessments, alternative IRT linking methods were implemented in the 2015 PISA round. The new linking method represents a concurrent calibration using all available data, enabling us to find item parameters that maximize fit across all groups and allowing us to investigate measurement invariance across groups. Apart from the Rasch model that historically has been used in PISA operational analyses, we compared our method against more general IRT models that can incorporate item-by-country interactions. The results suggest that our proposed method holds promise not only to provide a strong linkage across countries and cycles but also to serve as a tool for investigating measurement invariance. 相似文献

14.

Computerized adaptive testing in instructional settings 总被引：3，自引：0，他引：3

R. Edwin Welch Theodore W. Frick 《Educational technology research and development : ETR & D》1993,41(3):47-62

Item response theory (IRT) has most often been used in research on computerized adaptive testing (CAT). Depending on the model used, IRT requires between 200 and 1,000 examinees for estimating item parameters. Thus, it is not practical for instructional designers to develop their own CAT based on the IRT model. Frick improved Wald's sequential probability ratio test (SPRT) by combining it with normative expert systems reasoning, referred to as an EXSPRT-based CAT. While previous studies were based on re-enactments from historical test data, the present study is the first to examine how well these adaptive methods function in a real-time testing situation. Results indicate that the EXSPRT-I significantly reduced test lengths and was highly accurate in predicting mastery. EXSPRT is apparently a viable and practical alternative to IRT for assessing mastery of instructional objectives. 相似文献

15.

Application of IRT Fixed Parameter Calibration to Multiple-Group Test Data

Seonghoon Kim Michael J. Kolen 《教育实用测度》2013,26(4):310-324

ABSTRACT

In applications of item response theory (IRT), fixed parameter calibration (FPC) has been used to estimate the item parameters of a new test form on the existing ability scale of an item pool. The present paper presents an application of FPC to multiple examinee groups test data that are linked to the item pool via anchor items, and investigates the performance of FPC relative to an alternative approach, namely independent 0–1 calibration and scale linking. Two designs for linking to the pool are proposed that involve multiple groups and test forms, for which multiple-group FPC can be effectively used. A real-data study shows that the multiple-group FPC method performs similarly to the alternative method in estimating ability distributions and new item parameters on the scale of the item pool. In addition, a simulation study shows that the multiple-group FPC method performs nearly equally to or better than the alternative method in recovering the underlying ability distributions and the new item parameters. 相似文献

16.

词汇测验中猜测行为的探查——贝叶斯猜测系列模型的应用与思考

谭艳姬曹亦薇《考试研究》2012,(5):19-28

本研究应用Caojing等人的Bayesian IRT Guessing系列模型,分析初中二年级学生在汉语词汇测验中的猜测行为,使用DIC3指标评价模型的拟合程度,并将参数估计结果与双参数Logistic模型进行了比较。研究发现：（1）猜测模型的拟合度优于双参数Logistic模型;（2）初中二年级测验数据最适合临界猜测模型（IRT-TG）,约有3.5%的学生存在TG型猜测行为;（3）猜测者的存在会明显影响本身的能力估计与项目难度估计,但是对非猜测者的能力及区分度参数估计影响不大。相似文献

17.

On-demand testing and maintaining standards for general qualifications in the UK using item response theory: possibilities and challenges

Qingping He 《Educational research; a review for teachers and all concerned with progress in education》2013,55(1):89-112

Background:?Although on-demand testing is being increasingly used in many areas of assessment, it has not been adopted in high stakes examinations like the General Certificate of Secondary Education (GCSE) and General Certificate of Education Advanced level (GCE A level) offered by awarding organisations (AOs) in the UK. One of the major issues with on-demand testing is that some of the methods used for maintaining the comparability of standards over time in conventional testing are no longer available and the development of new methods is required.

Purpose:?This paper proposes an item response theory (IRT) framework for implementing on-demand testing and maintaining the comparability of standards over time for general qualifications, including GCSEs and GCE A levels, in the UK and discusses procedures for its practical implementation.

Sources of evidence:?Sources of evidence include literature from the fields of on-demand testing, the design of computer-based assessment, the development of IRT, and the application of IRT in educational measurement.

Main argument:?On-demand testing presents many advantages over conventional testing. In view of the nature of general qualifications, including the use of multiple components and multiple question types, the advances made in item response modelling over the past 30 years, and the availability of complex IRT analysis software systems, coupled with increasing IRT expertise in awarding organisations, IRT models could be used to implement on-demand testing in high stakes examinations in the UK. The proposed framework represents a coherent and complete approach to maintaining standards in on-demand testing. The procedures for implementing the framework discussed in the paper could be adapted by people to suit their own needs and circumstances.

Conclusions:?The use of IRT to implement on-demand testing could prove to be one of the viable approaches to maintaining standards over time or between test sessions for UK general qualifications. 相似文献

18.

The Effects of Dimensionality on Equating the Law School Admission Test

Gregory Camilli Ming-mei Wang Jacqueline Fesq 《Journal of Educational Measurement》1995,32(1):79-96

Using factor analysis, we conducted an assessment of multidimensionality for 6 forms of the Law School Admission Test (LSAT) and found 2 subgroups of items or factors for each of the 6 forms. The main conclusion of the factor analysis component of this study was that the LSAT appears to measure 2 different reasoning abilities: inductive and deductive. The technique of N. J. Dorans & N. M. Kingston (1985) was used to examine the effect of dimensionality on equating. We began by calibrating (with item response theory [IRT] methods) all items on a form to obtain Set I of estimated IRT item parameters. Next, the test was divided into 2 homogeneous subgroups of items, each having been determined to represent a different ability (i.e., inductive or deductive reasoning). The items within these subgroups were then recalibrated separately to obtain item parameter estimates, and then combined into Set II. The estimated item parameters and true-score equating tables for Sets I and II corresponded closely. 相似文献

19.

The Accuracy and Consistency of a Series of IRT True Score Equatings

Deping Li Yanlin Jiang Alina A. von Davier 《Journal of Educational Measurement》2012,49(2):167-189

This study investigates a sequence of item response theory (IRT) true score equatings based on various scale transformation approaches and evaluates equating accuracy and consistency over time. The results show that the biases and sample variances for the IRT true score equating (both direct and indirect) are quite small (except for the mean/sigma method). The biases and sample variances for the equating functions based on the characteristic curve methods and concurrent calibrations for adjacent forms are smaller than the biases and variances for the equating functions based on the moment methods. In addition, the IRT true score equating is also compared to the chained equipercentile equating, and we observe that the sample variances for the chained equipercentile equating are much smaller than the variances for the IRT true score equating with an exception at the low scores. 相似文献

20.

Parameter Estimation Accuracy of the Effort-Moderated Item Response Theory Model Under Multiple Assumption Violations

Joseph A. Rios James Soland 《Educational and psychological measurement》2021,81(3):569

As low-stakes testing contexts increase, low test-taking effort may serve as a serious validity threat. One common solution to this problem is to identify noneffortful responses and treat them as missing during parameter estimation via the effort-moderated item response theory (EM-IRT) model. Although this model has been shown to outperform traditional IRT models (e.g., two-parameter logistic [2PL]) in parameter estimation under simulated conditions, prior research has failed to examine its performance under violations to the model’s assumptions. Therefore, the objective of this simulation study was to examine item and mean ability parameter recovery when violating the assumptions that noneffortful responding occurs randomly (Assumption 1) and is unrelated to the underlying ability of examinees (Assumption 2). Results demonstrated that, across conditions, the EM-IRT model provided robust item parameter estimates to violations of Assumption 1. However, bias values greater than 0.20 SDs were observed for the EM-IRT model when violating Assumption 2; nonetheless, these values were still lower than the 2PL model. In terms of mean ability estimates, model results indicated equal performance between the EM-IRT and 2PL models across conditions. Across both models, mean ability estimates were found to be biased by more than 0.25 SDs when violating Assumption 2. However, our accompanying empirical study suggested that this biasing occurred under extreme conditions that may not be present in some operational settings. Overall, these results suggest that the EM-IRT model provides superior item and equal mean ability parameter estimates in the presence of model violations under realistic conditions when compared with the 2PL model. 相似文献