首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
《教育实用测度》2013,26(4):291-312
This study compares three procedures for the detection of differential item functioning (DIF) under item response theory (IRT): (a) Lord's chi-square, (b) Raju's area measures, and (c) the likelihood ratio test. Relations among the three procedures and some practical considerations, such as linking metrics and scale purification, are discussed. Data from two forms of a university mathematics placement test were analyzed to examine the congruence among the three procedures. Results indicated that there was close agreement among the three DIF detection procedures.  相似文献   

2.
The purpose of this study was to investigate multidimensional DIF with a simple and nonsimple structure in the context of multidimensional Graded Response Model (MGRM). This study examined and compared the performance of the IRT-LR and Wald test using MML-EM and MHRM estimation approaches with different test factors and test structures in simulation studies and applying real data sets. When the test structure included two dimensions, the IRT-LR (MML-EM) generally performed better than the Wald test and provided higher power rates. If the test included three dimensions, the methods provided similar performance in DIF detection. In contrast to these results, when the number of dimensions in the test was four, MML-EM estimation completely lost precision in estimating the nonuniform DIF, even with large sample sizes. The Wald with MHRM estimation approaches outperformed the Wald test (MML-EM) and IRT-LR (MML-EM). The Wald test had higher power rate and acceptable type I error rates for nonuniform DIF with the MHRM estimation approach.The small and/or unbalanced sample sizes, small DIF magnitudes, unequal ability distributions between groups, number of dimensions, estimation methods and test structure were evaluated as important test factors for detecting multidimensional DIF.  相似文献   

3.
Student growth percentiles (SGPs, Betebenner, 2009) are used to locate a student's current score in a conditional distribution based on the student's past scores. Currently, following Betebenner (2009), quantile regression (QR) is most often used operationally to estimate the SGPs. Alternatively, multidimensional item response theory (MIRT) may also be used to estimate SGPs, as proposed by Lockwood and Castellano (2015). A benefit of using MIRT to estimate SGPs is that techniques and methods already developed for MIRT may readily be applied to the specific context of SGP estimation and inference. This research adopts a MIRT framework to explore the reliability of SGPs. More specifically, we propose a straightforward method for estimating SGP reliability. In addition, we use this measure to study how SGP reliability is affected by two key factors: the correlation between prior and current latent achievement scores, and the number of prior years included in the SGP analysis. These issues are primarily explored via simulated data. In addition, the QR and MIRT approaches are compared in an empirical application.  相似文献   

4.
In the logistic regression (LR) procedure for differential item functioning (DIF), the parameters of LR have often been estimated using maximum likelihood (ML) estimation. However, ML estimation suffers from the finite-sample bias. Furthermore, ML estimation for LR can be substantially biased in the presence of rare event data. The bias of ML estimation due to small samples and rare event data can degrade the performance of the LR procedure, especially when testing the DIF of difficult items in small samples. Penalized ML (PML) estimation was originally developed to reduce the finite-sample bias of conventional ML estimation and also was known to reduce the bias in the estimation of LR for the rare events data. The goal of this study is to compare the performances of the LR procedures based on the ML and PML estimation in terms of the statistical power and Type I error. In a simulation study, Swaminathan and Rogers's Wald test based on PML estimation (PSR) showed the highest statistical power in most of the simulation conditions, and LRT based on conventional PML estimation (PLRT) showed the most robust and stable Type I error. The discussion about the trade-off between bias and variance is presented in the discussion section.  相似文献   

5.
Analyzing examinees’ responses using cognitive diagnostic models (CDMs) has the advantage of providing diagnostic information. To ensure the validity of the results from these models, differential item functioning (DIF) in CDMs needs to be investigated. In this article, the Wald test is proposed to examine DIF in the context of CDMs. This study explored the effectiveness of the Wald test in detecting both uniform and nonuniform DIF in the DINA model through a simulation study. Results of this study suggest that for relatively discriminating items, the Wald test had Type I error rates close to the nominal level. Moreover, its viability was underscored by the medium to high power rates for most investigated DIF types when DIF size was large. Furthermore, the performance of the Wald test in detecting uniform DIF was compared to that of the traditional Mantel‐Haenszel (MH) and SIBTEST procedures. The results of the comparison study showed that the Wald test was comparable to or outperformed the MH and SIBTEST procedures. Finally, the strengths and limitations of the proposed method and suggestions for future studies are discussed.  相似文献   

6.
The assessment of differential item functioning (DIF) is routinely conducted to ensure test fairness and validity. Although many DIF assessment methods have been developed in the context of classical test theory and item response theory, they are not applicable for cognitive diagnosis models (CDMs), as the underlying latent attributes of CDMs are multidimensional and binary. This study proposes a very general DIF assessment method in the CDM framework which is applicable for various CDMs, more than two groups of examinees, and multiple grouping variables that are categorical, continuous, observed, or latent. The parameters can be estimated with Markov chain Monte Carlo algorithms implemented in the freeware WinBUGS. Simulation results demonstrated a good parameter recovery and advantages in DIF assessment for the new method over the Wald method.  相似文献   

7.
In typical differential item functioning (DIF) assessments, an item's DIF status is not influenced by its status in previous test administrations. An item that has shown DIF at multiple administrations may be treated the same way as an item that has shown DIF in only the most recent administration. Therefore, much useful information about the item's functioning is ignored. In earlier work, we developed the Bayesian updating (BU) DIF procedure for dichotomous items and showed how it could be used to formally aggregate DIF results over administrations. More recently, we extended the BU method to the case of polytomously scored items. We conducted an extensive simulation study that included four “administrations” of a test. For the single‐administration case, we compared the Bayesian approach to an existing polytomous‐DIF procedure. For the multiple‐administration case, we compared BU to two non‐Bayesian methods of aggregating the polytomous‐DIF results over administrations. We concluded that both the BU approach and a simple non‐Bayesian method show promise as methods of aggregating polytomous DIF results over administrations.  相似文献   

8.
To assess item dimensionality, the following two approaches are described and compared: hierarchical generalized linear model (HGLM) and multidimensional item response theory (MIRT) model. Two generating models are used to simulate dichotomous responses to a 17-item test: the unidimensional and compensatory two-dimensional (C2D) models. For C2D data, seven items are modeled to load on the first and second factors, θ1 and θ2, with the remaining 10 items modeled unidimensionally emulating a mathematics test with seven items requiring an additional reading ability dimension. For both types of generated data, the multidimensionality of item responses is investigated using HGLM and MIRT. Comparison of HGLM and MIRT's results are possible through a transformation of items' difficulty estimates into probabilities of a correct response for a hypothetical examinee at the mean on θ and θ2. HGLM and MIRT performed similarly. The benefits of HGLM for item dimensionality analyses are discussed.  相似文献   

9.
Although logistic regression became one of the well‐known methods in detecting differential item functioning (DIF), its three statistical tests, the Wald, likelihood ratio (LR), and score tests, which are readily available under the maximum likelihood, do not seem to be consistently distinguished in DIF literature. This paper provides a clarifying note on those three tests when logistic regression is applied for DIF detection.  相似文献   

10.
Detection of differential item functioning (DIF) on items intentionally constructed to favor one group over another was investigated on item parameter estimates obtained from two item response theory-based computer programs, LOGIST and BILOG. Signed- and unsigned-area measures based on joint maximum likelihood estimation, marginal maximum likelihood estimation, and two marginal maximum a posteriori estimation procedures were compared with each other to determine whether detection of DIF could be improved using prior distributions. Results indicated that item parameter estimates obtained using either prior condition were less deviant than when priors were not used. Differences in detection of DIF appeared to be related to item parameter estimation condition and to some extent to sample size.  相似文献   

11.
《教育实用测度》2013,26(2):175-199
This study used three different differential item functioning (DIF) detection proce- dures to examine the extent to which items in a mathematics performance assessment functioned differently for matched gender groups. In addition to examining the appropriateness of individual items in terms of DIF with respect to gender, an attempt was made to identify factors (e.g., content, cognitive processes, differences in ability distributions, etc.) that may be related to DIF. The QUASAR (Quantitative Under- standing: Amplifying Student Achievement and Reasoning) Cognitive Assessment Instrument (QCAI) is designed to measure students' mathematical thinking and reasoning skills and consists of open-ended items that require students to show their solution processes and provide explanations for their answers. In this study, 33 polytomously scored items, which were distributed within four test forms, were evaluated with respect to gender-related DIF. The data source was sixth- and seventh- grade student responses to each of the four test forms administrated in the spring of 1992 at all six school sites participatingin the QUASARproject. The sample consisted of 1,782 students with approximately equal numbers of female and male students. The results indicated that DIF may not be serious for 3 1 of the 33 items (94%) in the QCAI. For the two items that were detected as functioning differently for male and female students, several plausible factors for DIF were discussed. The results from the secondary analyses, which removed the mutual influence of the two items, indicated that DIF in one item, PPPl, which favored female students rather than their matched male students, was of particular concern. These secondary analyses suggest that the detection of DIF in the other item in the original analysis may have been due to the influence of Item PPPl because they were both in the same test form.  相似文献   

12.
Mantel-Haenszel and SIBTEST, which have known difficulty in detecting non-unidirectional differential item functioning (DIF), have been adapted with some success for computerized adaptive testing (CAT). This study adapts logistic regression (LR) and the item-response-theory-likelihood-ratio test (IRT-LRT), capable of detecting both unidirectional and non-unidirectional DIF, to the CAT environment in which pretest items are assumed to be seeded in CATs but not used for trait estimation. The proposed adaptation methods were evaluated with simulated data under different sample size ratios and impact conditions in terms of Type I error, power, and specificity in identifying the form of DIF. The adapted LR and IRT-LRT procedures are more powerful than the CAT version of SIBTEST for non-unidirectional DIF detection. The good Type I error control provided by IRT-LRT under extremely unequal sample sizes and large impact is encouraging. Implications of these and other findings are discussed.  相似文献   

13.
学生的数学素养具有多维结构,素养导向的数学学业成就测评需要提供被试在各维度上的表现信息,而不仅是一个单一的总分。以PISA数学素养结构为理论模型,以多维项目反应理论(MIRT)为测量模型,利用R语言的MIRT程序包处理和分析某地区8年级数学素养测评题目数据,研究数学素养的多维测量方法。结果表明:MIRT兼具单维项目反应理论和因子分析的优点,利用其可对测试的结构效度和测试题目质量进行分析,以及对被试进行多维能力认知诊断。  相似文献   

14.
This article defines and demonstrates a framework for studying differential item functioning (DIF) and differential test functioning (DTF) for tests that are intended to be multidimensional The procedure introduced here is an extension of unidimensional differential functioning of items and tests (DFIT) recently developed by Raju, van der Linden, & Fleer (1995). To demonstrate the usefulness of these new indexes in a multidimensional IRT setting, two-dimensional data were simulated with known item parameters and known DIF and DTE The DIF and DTF indexes were recovered reasonably well under various distributional differences of Os after multidimensional linking was applied to put the two sets of item parameters on a common scale. Further studies are suggested in the area of DIF/DTF for intentionally multidimensional tests.  相似文献   

15.
Multidimensional item response theory (MIRT) provides an ideal foundation for modeling performance in complex domains, taking into account multiple basic abilities simultaneously, and representing different mixtures of the abilities required for different test items. This article provides a brief overview of different MIRT models, and the substantive implications of their differences for educational assessment. To illustrate the flexibility and benefits of MIRT, three application scenarios are described: to account for unintended multidimensionality when measuring a unidimensional construct, to model latent covariance structures between ability dimensions, and to model interactions of multiple abilities required for solving specific test items. All of these scenarios are illustrated by empirical examples. Finally, the implications of using MIRT models on educational processes are discussed.  相似文献   

16.
Many educational and psychological tests are inherently multidimensional, meaning these tests measure two or more dimensions or constructs. The purpose of this module is to illustrate how test practitioners and researchers can apply multidimensional item response theory (MIRT) to understand better what their tests are measuring, how accurately the different composites of ability are being assessed, and how this information can be cycled back into the test development process. Procedures for conducting MIRT analyses–from obtaining evidence that the test is multidimensional, to modeling the test as multidimensional, to illustrating the properties of multidimensional items graphically-are described from both a theoretical and a substantive basis. This module also illustrates these procedures using data from a ninth-grade mathematics achievement test. It concludes with a discussion of future directions in MIRT research.  相似文献   

17.
Nambury S. Raju (1937–2005) developed two model‐based indices for differential item functioning (DIF) during his prolific career in psychometrics. Both methods, Raju's area measures ( Raju, 1988 ) and Raju's DFIT ( Raju, van der Linden, & Fleer, 1995 ), are based on quantifying the gap between item characteristic functions (ICFs). This approach provides an intuitive and flexible methodology for assessing DIF. The purpose of this tutorial is to explain DFIT and show how this methodology can be utilized in a variety of DIF applications.  相似文献   

18.
Trend estimation in international comparative large‐scale assessments relies on measurement invariance between countries. However, cross‐national differential item functioning (DIF) has been repeatedly documented. We ran a simulation study using national item parameters, which required trends to be computed separately for each country, to compare trend estimation performances to two linking methods employing international item parameters across several conditions. The trend estimates based on the national item parameters were more accurate than the trend estimates based on the international item parameters when cross‐national DIF was present. Moreover, the use of fixed common item parameter calibrations led to biased trend estimates. The detection and elimination of DIF can reduce this bias but is also likely to increase the total error.  相似文献   

19.
In this article, I address two competing conceptions of differential item functioning (DIF) in polytomously scored items. The first conception, referred to as net DIF, concerns between-group differences in the conditional expected value of the polytomous response variable. The second conception, referred to as global DIF, concerns the conditional dependence of group membership and the polytomous response variable. The distinction between net and global DIF is important because different DIF evaluation methods are appropriate for net and global DIF; no currently available method is universally the best for detecting both net and global DIF. Net and global DIF definitions are presented under two different, yet compatible, modeling frameworks: a traditional item response theory (IRT) framework, and a differential step functioning (DSF) framework. The theoretical relationship between the IRT and DSF frameworks is presented. Available methods for evaluating net and global DIF are described, and an applied example of net and global DIF is presented.  相似文献   

20.
Two simulation studies investigated Type I error performance of two statistical procedures for detecting differential item functioning (DIF): SIBTEST and Mantel-Haenszel (MH). Because MH and SIBTEST are based on asymptotic distributions requiring "large" numbers of examinees, the first study examined Type 1 error for small sample sizes. No significant Type I error inflation occurred for either procedure. Because MH has the potential for Type I error inflation for non-Rasch models, the second study used a markedly non-Rasch test and systematically varied the shape and location of the studied item. When differences in distribution across examinee group of the measured ability were present, both procedures displayed inflated Type 1 error for certain items; MH displayed the greater inflation. Also, both procedures displayed statistically biased estimation of the zero DIF for certain items, though SIBTEST displayed much less than MH. When no latent distributional differences were present, both procedures performed satisfactorily under all conditions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号