首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Functional form misfit is frequently a concern in item response theory (IRT), although the practical implications of misfit are often difficult to evaluate. In this article, we illustrate how seemingly negligible amounts of functional form misfit, when systematic, can be associated with significant distortions of the score metric in vertical scaling contexts. Our analysis uses two‐ and three‐parameter versions of Samejima's logistic positive exponent model (LPE) as a data generating model. Consistent with prior work, we find LPEs generally provide a better comparative fit to real item response data than traditional IRT models (2PL, 3PL). Further, our simulation results illustrate how 2PL‐ or 3PL‐based vertical scaling in the presence of LPE‐induced misspecification leads to an artificial growth deceleration across grades, consistent with that commonly seen in vertical scaling studies. The results raise further concerns about the use of standard IRT models in measuring growth, even apart from the frequently cited concerns of construct shift/multidimensionality across grades.  相似文献   

2.
Mixture Rasch models have been used to study a number of psychometric issues such as goodness of fit, response strategy differences, strategy shifts, and multidimensionality. Although these models offer the potential for improving understanding of the latent variables being measured, under some conditions overextraction of latent classes may occur, potentially leading to misinterpretation of results. In this study, a mixture Rasch model was applied to data from a statewide test that was initially calibrated to conform to a 3‐parameter logistic (3PL) model. Results suggested how latent classes could be explained and also suggested that these latent classes might be due to applying a mixture Rasch model to 3PL data. To support this latter conjecture, a simulation study was presented to demonstrate how data generated to fit a one‐class 2‐parameter logistic (2PL) model required more than one class when fit with a mixture Rasch model.  相似文献   

3.
In the presence of test speededness, the parameter estimates of item response theory models can be poorly estimated due to conditional dependencies among items, particularly for end‐of‐test items (i.e., speeded items). This article conducted a systematic comparison of five‐item calibration procedures—a two‐parameter logistic (2PL) model, a one‐dimensional mixture model, a two‐step strategy (a combination of the one‐dimensional mixture and the 2PL), a two‐dimensional mixture model, and a hybrid model‐–by examining how sample size, percentage of speeded examinees, percentage of missing responses, and way of scoring missing responses (incorrect vs. omitted) affect the item parameter estimation in speeded tests. For nonspeeded items, all five procedures showed similar results in recovering item parameters. For speeded items, the one‐dimensional mixture model, the two‐step strategy, and the two‐dimensional mixture model provided largely similar results and performed better than the 2PL model and the hybrid model in calibrating slope parameters. However, those three procedures performed similarly to the hybrid model in estimating intercept parameters. As expected, the 2PL model did not appear to be as accurate as the other models in recovering item parameters, especially when there were large numbers of examinees showing speededness and a high percentage of missing responses with incorrect scoring. Real data analysis further described the similarities and differences between the five procedures.  相似文献   

4.
The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity.  相似文献   

5.
In a previous simulation study of methods for assessing differential item functioning (DIF) in computer-adaptive tests (Zwick, Thayer, & Wingersky, 1993, 1994), modified versions of the Mantel-Haenszel and standardization methods were found to perform well. In that study, data were generated using the 3-parameter logistic (3PL) model and this same model was assumed in obtaining item parameter estimates. In the current study, the 3PL data were used but the Rasch model was assumed in obtaining the item parameter estimates, which determined the information table used for item selection. Although the obtained DIF statistics were highly correlated with the generating DIF values, they tended to be smaller in magnitude than in the 3PL analysis, resulting in a lower probability of DIF detection. This reduced sensitivity appeared to be related to a degradation in the accuracy of matching. Expected true scores from the Rasch-based computer-adaptive test tended to be biased downward, particularly for lower-ability examinees  相似文献   

6.
In one study, parameters were estimated for constructed-response (CR) items in 8 tests from 4 operational testing programs using the l-parameter and 2- parameter partial credit (IPPC and 2PPC) models. Where multiple-choice (MC) items were present, these models were combined with the 1-parameter and 3-parameter logistic (IPL and 3PL) models, respectively. We found that item fit was better when the 2PPC model was used alone or with the 3PL model. Also, the slopes of the CR and MC items were found to differ substantially. In a second study, item parameter estimates produced using the IPL-IPPC and 3PL-2PPC model combinations were evaluated for fit to simulated data generated using true parameters known to fit one model combination or ttle other. The results suggested that the more flexible 3PL-2PPC model combination would produce better item fit than the IPL-1PPC combination.  相似文献   

7.
This article considers potential problems that can arise in estimating a unidimensional item response theory (IRT) model when some test items are multidimensional (i.e., show a complex factorial structure). More specifically, this study examines (1) the consequences of model misfit on IRT item parameter estimates due to unintended minor item‐level multidimensionality, and (2) whether a Projection IRT model can provide a useful remedy. A real‐data example is used to illustrate the problem and also is used as a base model for a simulation study. The results suggest that ignoring item‐level multidimensionality might lead to inflated item discrimination parameter estimates when the proportion of multidimensional test items to unidimensional test items is as low as 1:5. The Projection IRT model appears to be a useful tool for updating unidimensional item parameter estimates of multidimensional test items for a purified unidimensional interpretation.  相似文献   

8.
A Monte Carlo simulation technique for generating dichotomous item scores is presented that implements (a) a psychometric model with different explicit assumptions than traditional parametric item response theory (IRT) models, and (b) item characteristic curves without restrictive assumptions concerning mathematical form. The four-parameter beta compound-binomial (4PBCB) strong true score model (with two-term approximation to the compound binomial) is used to estimate and generate the true score distribution. The nonparametric item-true score step functions are estimated by classical item difficulties conditional on proportion-correct total score. The technique performed very well in replicating inter-item correlations, item statistics (point-biserial correlation coefficients and item proportion-correct difficulties), first four moments of total score distribution, and coefficient alpha of three real data sets consisting of educational achievement test scores. The technique replicated real data (including subsamples of differing proficiency) as well as the three-parameter logistic (3PL) IRT model (and much better than the 1PL model) and is therefore a promising alternative simulation technique. This 4PBCB technique may be particularly useful as a more neutral simulation procedure for comparing methods that use different IRT models.  相似文献   

9.
Abstract

Multilevel Rasch models are increasingly used to estimate the relationships between test scores and student and school factors. Response data were generated to follow one-, two-, and three-parameter logistic (1PL, 2PL, 3PL) models, but the Rasch model was used to estimate the latent regression parameters. When the response functions followed 2PL or 3PL models, the proportion of variance explained in test scores by the simulated student or school predictors was estimated accurately with a Rasch model. Proportion of variance within and between schools was also estimated accurately. The regression coefficients were misestimated unless they were rescaled out of logit units. However, item-level parameters, such as DIF effects, were biased when the Rasch model was violated, similar to single-level models.  相似文献   

10.
As low-stakes testing contexts increase, low test-taking effort may serve as a serious validity threat. One common solution to this problem is to identify noneffortful responses and treat them as missing during parameter estimation via the effort-moderated item response theory (EM-IRT) model. Although this model has been shown to outperform traditional IRT models (e.g., two-parameter logistic [2PL]) in parameter estimation under simulated conditions, prior research has failed to examine its performance under violations to the model’s assumptions. Therefore, the objective of this simulation study was to examine item and mean ability parameter recovery when violating the assumptions that noneffortful responding occurs randomly (Assumption 1) and is unrelated to the underlying ability of examinees (Assumption 2). Results demonstrated that, across conditions, the EM-IRT model provided robust item parameter estimates to violations of Assumption 1. However, bias values greater than 0.20 SDs were observed for the EM-IRT model when violating Assumption 2; nonetheless, these values were still lower than the 2PL model. In terms of mean ability estimates, model results indicated equal performance between the EM-IRT and 2PL models across conditions. Across both models, mean ability estimates were found to be biased by more than 0.25 SDs when violating Assumption 2. However, our accompanying empirical study suggested that this biasing occurred under extreme conditions that may not be present in some operational settings. Overall, these results suggest that the EM-IRT model provides superior item and equal mean ability parameter estimates in the presence of model violations under realistic conditions when compared with the 2PL model.  相似文献   

11.
The applications of item response theory (IRT) models assume local item independence and that examinees are independent of each other. When a representative sample for psychometric analysis is selected using a cluster sampling method in a testlet‐based assessment, both local item dependence and local person dependence are likely to be induced. This study proposed a four‐level IRT model to simultaneously account for dual local dependence due to item clustering and person clustering. Model parameter estimation was explored using the Markov Chain Monte Carlo method. Model parameter recovery was evaluated in a simulation study in comparison with three other related models: the Rasch model, the Rasch testlet model, and the three‐level Rasch model for person clustering. In general, the proposed model recovered the item difficulty and person ability parameters with the least total error. The bias in both item and person parameter estimation was not affected but the standard error (SE) was affected. In some simulation conditions, the difference in classification accuracy between models could go up to 11%. The illustration using the real data generally supported model performance observed in the simulation study.  相似文献   

12.
This study demonstrated the equivalence between the Rasch testlet model and the three‐level one‐parameter testlet model and explored the Markov Chain Monte Carlo (MCMC) method for model parameter estimation in WINBUGS. The estimation accuracy from the MCMC method was compared with those from the marginalized maximum likelihood estimation (MMLE) with the expectation‐maximization algorithm in ConQuest and the sixth‐order Laplace approximation estimation in HLM6. The results indicated that the estimation methods had significant effects on the bias of the testlet variance and ability variance estimation, the random error in the ability parameter estimation, and the bias in the item difficulty parameter estimation. The Laplace method best recovered the testlet variance while the MMLE best recovered the ability variance. The Laplace method resulted in the smallest random error in the ability parameter estimation while the MCMC method produced the smallest bias in item parameter estimates. Analyses of three real tests generally supported the findings from the simulation and indicated that the estimates for item difficulty and ability parameters were highly correlated across estimation methods.  相似文献   

13.
An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items.  相似文献   

14.
This article concerns the simultaneous assessment of DIF for a collection of test items. Rather than an average or sum in which positive and negative DIF may cancel, we propose an index that measures the variance of DIF on a test as an indicator of the degree to which different items show DIF in different directions. It is computed from standard Mantel-Haenszel statistics (the logodds ratio and its variance error) and may be conceptually classified as a variance component or variance effect size. Evaluated by simulation under three item response models (IPL, 2PL, and 3PL), the index is shown to be an accurate estimate of the DTF generating parameter in the case of the 1PL and 2PL models with groups of equal ability. For groups of unequal ability, the index is accurate under the I PL but not the 2PL condition; however, a weighted version of the index provides improved estimates. For the 3PL condition, the DTF generating parameter is underestimated. This latter result is due in part to a mismatch in the scales of the log-odds ratio and IRT difficulty.  相似文献   

15.
In structural equation models, outliers could result in inaccurate parameter estimates and misleading fit statistics when using traditional methods. To robustly estimate structural equation models, iteratively reweighted least squares (IRLS; Yuan & Bentler, 2000) has been proposed, but not thoroughly examined. We explore the large-sample properties of IRLS and its effect on parameter recovery, model fit, and aberrant data identification. A parametric bootstrap technique is proposed to determine the tuning parameters of IRLS, which results in improved Type I error rates in aberrant data identification, for data sets generated from homogenous populations. Scenarios concerning (a) simulated data, (b) contaminated data, and (c) a real data set are studied. Results indicate good parameter recovery, model fit, and aberrant data identification when noisy observations are drawn from a real data set, but lackluster parameter recovery and identification of aberrant data when the noise is parametrically structured. Practical implications and further research are discussed.  相似文献   

16.
In this article we present a general approach not relying on item response theory models (non‐IRT) to detect differential item functioning (DIF) in dichotomous items with presence of guessing. The proposed nonlinear regression (NLR) procedure for DIF detection is an extension of method based on logistic regression. As a non‐IRT approach, NLR can be seen as a proxy of detection based on the three‐parameter IRT model which is a standard tool in the study field. Hence, NLR fills a logical gap in DIF detection methodology and as such is important for educational purposes. Moreover, the advantages of the NLR procedure as well as comparison to other commonly used methods are demonstrated in a simulation study. A real data analysis is offered to demonstrate practical use of the method.  相似文献   

17.
This study examines how the Early Years Educators at Play (EYEPlay) professional development (PD) programme supported inclusive learning settings for all children, including English language learners and students with disabilities. The EYEPlay PD model is a year‐long programme that integrates drama strategies into literacy practices within real‐classroom contexts. Inclusive education refers to ensuring equal opportunities to access and participation in learning activities for all students. Cultural‐historical activity theory was used to understand and unpack the drama practices. Twelve semi‐structured focus group interviews were conducted with 19 preschool teachers. The data were analysed via constant‐comparative and interpretive methods. The study findings showed that EYEPlay PD practices enhanced inclusive learning settings for diverse groups of students by increasing access and expanding opportunities to learn, and supporting a positive learning environment.  相似文献   

18.
Drawing valid inferences from modern measurement models is contingent upon a good fit of the data to the model. Violations of model‐data fit have numerous consequences, limiting the usefulness and applicability of the model. As Bayesian estimation is becoming more common, understanding the Bayesian approaches for evaluating model‐data fit models is critical. In this instructional module, Allison Ames and Aaron Myers provide an overview of Posterior Predictive Model Checking (PPMC), the most common Bayesian model‐data fit approach. Specifically, they review the conceptual foundation of Bayesian inference as well as PPMC and walk through the computational steps of PPMC using real‐life data examples from simple linear regression and item response theory analysis. They provide guidance for how to interpret PPMC results and discuss how to implement PPMC for other model(s) and data. The digital module contains sample data, SAS code, diagnostic quiz questions, data‐based activities, curated resources, and a glossary.  相似文献   

19.
In this article 2 major problems of using the three‐wave quasi simplex model to obtain reliability estimates are illustrated. The 1st problem is that the sampling variance of the reliability estimates can be very large, especially if the stability through time is low. The 2nd problem is that, for the reliability parameter to be identified, the model assumes a particular change process, namely a Markov process. We show that minor violations of this assumption can lead to a large bias in the reliability estimates. The problems are evaluated using both real and Monte Carlo data. A model with repeated measurements in 1 of the waves is also discussed.  相似文献   

20.
A shared document‐based annotation tool was presented, and its usefulness in two different real‐life web‐based university‐level courses (adult learners, n= 27 and adolescent learners, n= 23) was empirically investigated. The study design embodied three data collection phases: (1) a pretest measuring self‐rated motivation, learning strategies, and social ability; (2) log file data analysis showing actual use of the system features; and (3) a posttest in a form of an email survey. For both groups, the results showed that the level of motivation has a positive effect on activity in the system and the final grade. The learners, who reported to have good time‐management strategies, were the most active users of the system. The level of social ability predicted both the number of consecutive comments in the documents and the threads in document‐related newsgroup discussions. Log file data analysis showed that user activity in the system was positively related to the final grade in both samples. Results of the posttest showed that all the respondents agreed when asked: (1) if the system brought added value to the learning process; (2) if the use of the system changed their studying habits favourably; and (3) if they would like to use the system in other courses.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号