首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Logistic regression has recently been advanced as a viable procedure for detecting differential item functioning (DIF). One of the advantages of this procedure is the considerable flexibility it offers in the specification of the regression equation. This article describes incorporating two ability estimates into a single regression analysis, with the result that substantially fewer items exhibit DIF. A comparable analysis is conducted using the Mantel-Haenszel with similar results. It is argued that by simultaneously conditioning on two relevant ability estimates, more accurate matching of examinees in the reference and focal groups is obtained, and thus multidimensional item impact is not mistakenly identified as DIF.  相似文献   

2.
This article presents a study of ethnic Differential Item Functioning (DIF) for 4th-, 7th-, and 10th-grade reading items on a state criterion-referenced achievement test. The tests, administered 1997 to 2001, were composed of multiple-choice and constructed-response items. Item performance by focal groups (i.e., students from Asian/Pacific Island, Black/African American, Native American, and Latino/Hispanic origins) were compared with the performance of White students using simultaneous item bias and Rasch procedures. Flagged multiple-choice items generally favored White students, whereas flagged constructed-response items generally favored students from Asian/Pacific Islander, Black/African American, and Latino/Hispanic origins. Content analysis of flagged reading items showed that positively and negatively flagged items typically measured inference, interpretation, or analysis of text in multiple-choice and constructed-response formats. Items that were not flagged for DIF generally measured very easy reading skills (e.g., literal comprehension) and reading skills that require higher level thinking (e.g., developing interpretations across texts and analyzing graphic elements).  相似文献   

3.
Data from a large-scale performance assessment ( N = 105,731) were analyzed with five differential item functioning (DIF) detection methods for polytomous items to examine the congruence among the DIF detection methods. Two different versions of the item response theory (IRT) model-based likelihood ratio test, the logistic regression likelihood ratio test, the Mantel test, and the generalized Mantel–Haenszel test were compared. Results indicated some agreement among the five DIF detection methods. Because statistical power is a function of the sample size, the DIF detection results from extremely large data sets are not practically useful. As alternatives to the DIF detection methods, four IRT model-based indices of standardized impact and four observed-score indices of standardized impact for polytomous items were obtained and compared with the R 2 measures of logistic regression.  相似文献   

4.
Monte Carlo simulations with 20,000 replications are reported to estimate the probability of rejecting the null hypothesis regarding DIF using SIBTEST when there is DIF present and/or when impact is present due to differences on the primary dimension to be measured. Sample sizes are varied from 250 to 2000 and test lengths from 10 to 40 items. Results generally support previous findings for Type I error rates and power. Impact is inversely related to test length. The combination of DIF and impact, with the focal group having lower ability on both the primary and secondary dimensions, results in impact partially masking DIF so that items biased toward the reference group are less likely to be detected.  相似文献   

5.
A computer simulation study was conducted to determine the feasibility of using logistic regression procedures to detect differential item functioning (DIF) in polytomous items. One item in a simulated test of 25 items contained DIF; parameters' for that item were varied to create three conditions of nonuniform DIF and one of uniform DIF. Item scores were generated using a generalized partial credit model, and the data were recoded into multiple dichotomies in order to use logistic regression procedures. Results indicate that logistic regression is powerful in detecting most forms of DIF; however, it required large amounts of data manipulation, and interpretation of the results was sometimes difficult. Some logistic regression procedures may be useful in the post hoc analysis of DlF for polytomous items.  相似文献   

6.
Gender fairness in testing can be impeded by the presence of differential item functioning (DIF), which potentially causes test bias. In this study, the presence and causes of gender-related DIF were investigated with real data from 800 items answered by 250,000 test takers. DIF was examined using the Mantel–Haenszel and logistic regression procedures. Little DIF was found in the quantitative items and a moderate amount was found in the verbal items. Vocabulary items favored women if sampled from traditionally female domains but generally not vice versa if sampled from male domains. The sentence completion item format in the English reading comprehension subtest favored men regardless of content. The findings, if supported in a cross-validation study, can potentially lead to changes in how vocabulary items are sampled and in the use of the sentence completion format in English reading comprehension, thereby increasing gender fairness in the examined test.  相似文献   

7.
ABSTRACT

Differential item functioning (DIF) analyses have been used as the primary method in large-scale assessments to examine fairness for subgroups. Currently, DIF analyses are conducted utilizing manifest methods using observed characteristics (gender and race/ethnicity) for grouping examinees. Homogeneity of item responses is assumed denoting that all examinees respond to test items using a similar approach. This assumption may not hold with all groups. In this study, we demonstrate the first application of the latent class (LC) approach to investigate DIF and its sources with heterogeneous (linguistic minority groups). We found at least three LCs within each linguistic group, suggesting the need to empirically evaluate this assumption in DIF analysis. We obtained larger proportions of DIF items with larger effect sizes when LCs within language groups versus the overall (majority/minority) language groups were examined. The illustrated approach could be used to improve the ways in which DIF analyses are typically conducted to enhance DIF detection accuracy and score-based inferences when analyzing DIF with heterogeneous populations.  相似文献   

8.
The purpose of this article is to present logistic discriminant function analysis as a means of differential item functioning (DIF) identification of items that are polytomously scored. The procedure is presented with examples of a DIF analysis using items from a 27-item mathematics test which includes six open-ended response items scored polytomously. The results show that the logistic discriminant function procedure is ideally suited for DIF identification on nondichotomously scored test items. It is simpler and more practical than polytomous extensions of the logistic regression DIF procedure and appears to fee more powerful than a generalized Mantel-Haenszelprocedure.  相似文献   

9.
This was a study of differential item functioning (DIF) for grades 4, 7, and 10 reading and mathematics items from state criterion-referenced tests. The tests were composed of multiple-choice and constructed-response items. Gender DIF was investigated using POLYSIBTEST and a Rasch procedure. The Rasch procedure flagged more items for DIF than did the simultaneous item bias procedure—particularly multiple-choice items. For both reading and mathematics tests, multiple-choice items generally favored males while constructed-response items generally favored females. Content analyses showed that flagged reading items typically measured text interpretations or implied meanings; males tended to benefit from items that asked them to identify reasonable interpretations and analyses of informational text. Most items that favored females asked students to make their own interpretations and analyses, of both literary and informational text, supported by text-based evidence. Content analysis of mathematics items showed that items favoring males measured geometry, probability, and algebra. Mathematics items favoring females measured statistical interpretations, multistep problem solving, and mathematical reasoning.  相似文献   

10.
The purpose of this study is to evaluate the performance of CATSIB (Computer Adaptive Testing-Simultaneous Item Bias Test) for detecting differential item functioning (DIF) when items in the matching and studied subtest are administered adaptively in the context of a realistic multi-stage adaptive test (MST). MST was simulated using a 4-item module in a 7-panel administration. Three independent variables, expected to affect DIF detection rates, were manipulated: item difficulty, sample size, and balanced/unbalanced design. CATSIB met the acceptable criteria, meaning that the Type I error and power rates met 5% and 80%, respectively, for the large reference/moderate focal sample and the large reference/large focal sample conditions. These results indicate that CATSIB can be used to consistently and accurately detect DIF on an MST, but only with moderate to large samples.  相似文献   

11.
In this article we present a general approach not relying on item response theory models (non‐IRT) to detect differential item functioning (DIF) in dichotomous items with presence of guessing. The proposed nonlinear regression (NLR) procedure for DIF detection is an extension of method based on logistic regression. As a non‐IRT approach, NLR can be seen as a proxy of detection based on the three‐parameter IRT model which is a standard tool in the study field. Hence, NLR fills a logical gap in DIF detection methodology and as such is important for educational purposes. Moreover, the advantages of the NLR procedure as well as comparison to other commonly used methods are demonstrated in a simulation study. A real data analysis is offered to demonstrate practical use of the method.  相似文献   

12.
《教育实用测度》2013,26(4):329-349
The logistic regression (LR) procedure for differential item functioning (DIF) detection is a model-based approach designed to identify both uniform and nonuniform DIF. However, this procedure tends to produce inflated Type I errors. This outcome is problematic because it can result in the inefficient use of testing resources, and it may interfere with the study of the underlying causes of DIF. Recently, an effect size measure was developed for the LR DIF procedure and a classification method was proposed. However, the effect size measure and classification method have not been systematically investigated. In this study, we developed a new classification method based on those established for the Simultaneous Item Bias Test. A simulation study also was conducted to determine if the effect size measure affects the Type I error and power rates for the LR DIF procedure across sample sizes, ability distributions, and percentage of DIF items included on a test. The results indicate that the inclusion of the effect size measure can substantially reduce Type I error rates when large sample sizes are used, although there is also a reduction in power.  相似文献   

13.
This article proposes two multidimensional IRT model-based methods of selecting item bundles (clusters of not necessarily adjacent items chosen according to some organizational principle) suspected of displaying DIF amplification. The approach embodied in these two methods is inspired by Shealy and Stout's (1993a, 1993b) multidimensional model for DIF. Each bundle selected by these methods constitutes a DIF amplification hypothesis. When SIBTEST (Shealy & Stout, 1993b) confirms DIF amplification in selected bundles, differential bundle functioning (DBF) is said to occur. Three real data examples illustrate the two methods for suspect bundle selection. The effectiveness of the methods is argued on statistical grounds. A distinction between benign and adverse DIF is made. The decision whether flagged DIF items or DBF bundles display benign or adverse DIF/DBF must depend in part on nonstatistical construct validity arguments. Conducting DBF analyses using these methods should help in the identification of the causes of DIF/DBF.  相似文献   

14.
Although logistic regression became one of the well‐known methods in detecting differential item functioning (DIF), its three statistical tests, the Wald, likelihood ratio (LR), and score tests, which are readily available under the maximum likelihood, do not seem to be consistently distinguished in DIF literature. This paper provides a clarifying note on those three tests when logistic regression is applied for DIF detection.  相似文献   

15.
There are numerous statistical procedures for detecting items that function differently across subgroups of examinees that take a test or survey. However, in endeavouring to detect items that may function differentially, selection of the statistical method is only one of many important decisions. In this article, we discuss the important decisions that affect investigations of differential item functioning (DIF) such as choice of method, sample size, effect size criteria, conditioning variable, purification, DIF amplification, DIF cancellation, and research designs for evaluating DIF. Our review highlights the necessity of matching the DIF procedure to the nature of the data analysed, the need to include effect size criteria, the need to consider the direction and balance of items flagged for DIF, and the need to use replication to reduce Type I errors whenever possible. Directions for future research and practice in using DIF to enhance the validity of test scores are provided.  相似文献   

16.
Although the root-mean squared deviation (RMSD) is a popular statistical measure for evaluating country-specific item-level misfit (i.e., differential item functioning [DIF]) in international large-scale assessment, this paper shows that its sensitivity to detect misfit may depend strongly on the proficiency distribution of the considered countries. Specifically, items for which most respondents in a country have a very low (or high) probability of providing a correct answer will rarely be flagged by the RMSD as showing misfit, even if very strong DIF is present. With many international large-scale assessment initiatives moving toward covering a more heterogeneous group of countries, this raises issues for the ability of the RMSD to detect item-level misfit, especially in low-performing countries that are not well-aligned with the overall difficulty level of the test. This may put one at risk of incorrectly assuming measurement invariance to hold, and may also inflate estimated between-country difference in proficiency. The degree to which the RMSD is able to detect DIF in low-performing countries is studied using both an empirical example from PISA 2015 and a simulation study.  相似文献   

17.
The aim of this study is to assess the efficiency of using the multiple‐group categorical confirmatory factor analysis (MCCFA) and the robust chi‐square difference test in differential item functioning (DIF) detection for polytomous items under the minimum free baseline strategy. While testing for DIF items, despite the strong assumption that all but the examined item are set to be DIF‐free, MCCFA with such a constrained baseline approach is commonly used in the literature. The present study relaxes this strong assumption and adopts the minimum free baseline approach where, aside from those parameters constrained for identification purpose, parameters of all but the examined item are allowed to differ among groups. Based on the simulation results, the robust chi‐square difference test statistic with the mean and variance adjustment is shown to be efficient in detecting DIF for polytomous items in terms of the empirical power and Type I error rates. To sum up, MCCFA under the minimum free baseline strategy is useful for DIF detection for polytomous items.  相似文献   

18.
This study examined the extent to which log-linear smoothing could improve the accuracy of differential item functioning (DIF) estimates in small samples of examinees. Examinee responses from a certification test were analyzed using White examinees in the reference group and African American examinees in the focal group. Using a simulation approach, separate DIF estimates for seven small-sample-size conditions were obtained using unsmoothed (U) and smoothed (S) score distributions. These small sample U and S DIF estimates were compared to a criterion (i.e., DIF estimates obtained using the unsmoothed total data) to assess their degree of variability (random error) and accuracy (bias). Results indicate that for most studied items smoothing the raw score distributions reduced random error and bias of the DIF estimates, especially in the small-sample-size conditions. Implications of these results for operational testing programs are discussed.  相似文献   

19.
Comparing countries according to their PISA results can be considered cross-cultural studies. An important issue in these studies is that the measurement tools must be culturally and linguistically equivalent. Cultural or linguistic differences in measurement tools may threaten validity. Aberrant behavior is another important factor that affects validity. Person-fit analysis is a statistical method of detecting aberrant individual behavior. This study investigated the effects of students who were flagged as aberrant on a comparison of Turkish and Dutch students’ non-cognitive outputs towards science in the PISA 2015. Whether the items contained differential item functioning (DIF) or not was determined by a Poly-SIBTEST for each scale. Misfit students were detected by non-parametric GNp and U3p statistics. It was determined that flagged students were not effective for comparing the mean of non-cognitive outputs towards science, but it was observed that flagged students had an effect on DIF.  相似文献   

20.
Differential Item Functioning (DIF) is traditionally used to identify different item performance patterns between intact groups, most commonly involving race or sex comparisons. This study advocates expanding the utility of DIF as a step in construct validation. Rather than grouping examinees based on cultural differences, the reference and focal groups are chosen from two extremes along a distinct cognitive dimension that is hypothesized to supplement the dominant latent trait being measured. Specifically, this study investigates DIF between proficient and non-proficient fourth- and seventh-grade writers on open-ended mathematics test items that require students to communicate about mathematics. It is suggested that the occurrence of DIF in this situation actually enhances, rather than detracts from, the construct validity of the test because, according to the National Council of Teachers of Mathematics (NCTM), mathematical communication is an important component of mathematical ability, the dominant construct being assessed. However, the presence of DIF influences the validity of inferences that can be made from test scores and suggests that two scores should be reported, one for general mathematical ability and one for mathematical communication. The fact that currently only one test score is reported, a simple composite of scores on multiple-choice and open-ended items, may lead to incorrect decisions being made about examinees.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号