期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Assessing Differential Step Functioning in Polytomous Items Using a Common Odds Ratio Estimator

Randall D. Penfield 《Journal of Educational Measurement》2007,44(3):187-210

Many statistics used in the assessment of differential item functioning (DIF) in polytomous items yield a single item-level index of measurement invariance that collapses information across all response options of the polytomous item. Utilizing a single item-level index of DIF can, however, be misleading if the magnitude or direction of the DIF changes across the steps underlying the polytomous response process. A more comprehensive approach to examining measurement invariance in polytomous item formats is to examine invariance at the level of each step of the polytomous item, a framework described in this article as differential step functioning (DSF). This article proposes a nonparametric DSF estimator that is based on the Mantel-Haenszel common odds ratio estimator ( Mantel & Haenszel, 1959 ), which is frequently implemented in the detection of DIF in dichotomous items. A simulation study demonstrated that when the level of DSF varied in magnitude or sign across the steps underlying the polytomous response options, the DSF-based approach typically provided a more powerful and accurate test of measurement invariance than did corresponding item-level DIF estimators. 相似文献

2.

Explaining Variability in Response Style Traits: A Covariate-Adjusted IRTree

Allison J. Ames Aaron J. Myers 《Educational and psychological measurement》2021,81(4):756

Contamination of responses due to extreme and midpoint response style can confound the interpretation of scores, threatening the validity of inferences made from survey responses. This study incorporated person-level covariates in the multidimensional item response tree model to explain heterogeneity in response style. We include an empirical example and two simulation studies to support the use and interpretation of the model: parameter recovery using Markov chain Monte Carlo (MCMC) estimation and performance of the model under conditions with and without response styles present. Item intercepts mean bias and root mean square error were small at all sample sizes. Item discrimination mean bias and root mean square error were also small but tended to be smaller when covariates were unrelated to, or had a weak relationship with, the latent traits. Item and regression parameters are estimated with sufficient accuracy when sample sizes are greater than approximately 1,000 and MCMC estimation with the Gibbs sampler is used. The empirical example uses the National Longitudinal Study of Adolescent to Adult Health’s sexual knowledge scale. Meaningful predictors associated with high levels of extreme response latent trait included being non-White, being male, and having high levels of parental support and relationships. Meaningful predictors associated with high levels of the midpoint response latent trait included having low levels of parental support and relationships. Item-level covariates indicate the response style pseudo-items were less easy to endorse for self-oriented items, whereas the trait of interest pseudo-items were easier to endorse for self-oriented items. 相似文献

3.

An Empirical Examination of the IRT Information of Polytomously Scored Reading Items Under the Generalized Partial Credit Model

John R. Donoghue 《Journal of Educational Measurement》1994,31(4):295-311

Using Muraki's (1992) generalized partial credit IRT model, polytomous items (responses to which can be scored as ordered categories) from the 1991 field test of the NAEP Reading Assessment were calibrated simultaneously with multiple-choice and short open-ended items. Expected information of each type of item was computed. On average, four-category polytomous items yielded 2.1 to 3.1 times as much IRT information as dichotomous items. These results provide limited support for the ad hoc rule of weighting k-category polytomous items the same as k - 1 dichotomous items for computing total scores. Polytomous items provided the most information about examinees of moderately high proficiency; the information function peaked at 1.0 to 1.5, and the population distribution mean was 0. When scored dichotomously, information in polytomous items sharply decreased, but they still provided more expected information than did the other response formats. For reference, a derivation of the information function for the generalized partial credit model is included. 相似文献

4.

An Instructional Module on Mokken Scale Analysis

下载免费PDF全文

Stefanie A. Wind 《Educational Measurement》2017,36(2):50-66

Mokken scale analysis (MSA) is a probabilistic‐nonparametric approach to item response theory (IRT) that can be used to evaluate fundamental measurement properties with less strict assumptions than parametric IRT models. This instructional module provides an introduction to MSA as a probabilistic‐nonparametric framework in which to explore measurement quality, with an emphasis on its application in the context of educational assessment. The module describes both dichotomous and polytomous formulations of the MSA model. Examples of the application of MSA to educational assessment are provided using data from a multiple‐choice physical science assessment and a rater‐mediated writing assessment. 相似文献

5.

An NCME Instructional Module on Polytomous Item Response Theory Models

Randall David Penfield 《Educational Measurement》2014,33(1):36-48

A polytomous item is one for which the responses are scored according to three or more categories. Given the increasing use of polytomous items in assessment practices, item response theory (IRT) models specialized for polytomous items are becoming increasingly common. The purpose of this ITEMS module is to provide an accessible overview of polytomous IRT models. The module presents commonly encountered polytomous IRT models, describes their properties, and contrasts their defining principles and assumptions. After completing this module, the reader should have a sound understating of what a polytomous IRT model is, the manner in which the equations of the models are generated from the model's underlying step functions, how widely used polytomous IRT models differ with respect to their definitional properties, and how to interpret the parameters of polytomous IRT models. 相似文献

6.

Classification Consistency and Accuracy for Complex Assessments Using Item Response Theory

Won-Chan Lee 《Journal of Educational Measurement》2010,47(1):1-17

In this article, procedures are described for estimating single-administration classification consistency and accuracy indices for complex assessments using item response theory (IRT). This IRT approach was applied to real test data comprising dichotomous and polytomous items. Several different IRT model combinations were considered. Comparisons were also made between the IRT approach and two non-IRT approaches including the Livingston-Lewis and compound multinomial procedures. Results for various IRT model combinations were not substantially different. The estimated classification consistency and accuracy indices for the non-IRT procedures were almost always lower than those for the IRT procedures. 相似文献

7.

Mokken Scale Analysis: Theoretical Considerations and an Application to Transitivity Tasks

《教育实用测度》2013,26(4):355-373

This study provides a discussion and an application of Mokken scale analysis. Mokken scale analysis can be characterized as a nonparametric item response theory approach. The Mokken approach to scaling consists of two different item response models, the model of monotone homogeneity and the more restrictive model of double monotonicity. Methods for empirical data analysis using the two Mokken model versions are discussed. Both dichotomous and polytomous item scores can be analyzed by means of Mokken scale analysis. Three empirical data sets pertaining to transitive inference items were analyzed using the Mokken approach. The results are compared with the results obtained from a Rasch analysis. 相似文献

8.

Detecting Local Item Dependence in Polytomous Adaptive Data

Jessica L. Mislevy André A. Rupp Jeffrey R. Harring 《Journal of Educational Measurement》2012,49(2):127-147

A rapidly expanding arena for item response theory (IRT) is in attitudinal and health‐outcomes survey applications, often with polytomous items. In particular, there is interest in computer adaptive testing (CAT). Meeting model assumptions is necessary to realize the benefits of IRT in this setting, however. Although initial investigations of local item dependence have been studied both for polytomous items in fixed‐form settings and for dichotomous items in CAT settings, there have been no publications applying local item dependence detection methodology to polytomous items in CAT despite its central importance to these applications. The current research uses a simulation study to investigate the extension of widely used pairwise statistics, Yen's Q₃ Statistic and Pearson's Statistic X², in this context. The simulation design and results are contextualized throughout with a real item bank of this type from the Patient‐Reported Outcomes Measurement Information System (PROMIS). 相似文献

9.

Children’s use of number line estimation strategies

Dominique Peeters Tine Degrande Mirjam Ebersbach Lieven Verschaffel Koen Luwel 《European Journal of Psychology of Education - EJPE》2016,31(2):117-134

This study tested whether second graders use benchmark-based strategies when solving a number line estimation (NLE) task. Participants were assigned to one of three conditions based on the availability of benchmarks provided on the number line. In the bounded condition, number lines were only bounded at both sides by 0 and 200, while the midpoint condition included an additional benchmark at the midpoint and children in the quartile condition were provided with a benchmark at every quartile. First, the inclusion of a midpoint resulted in more accurate estimates around the middle of the number line in the midpoint condition compared to the bounded and, surprisingly, also the quartile condition. Furthermore, the two additional benchmarks in the quartile condition did not yield better estimations around the first and third quartile, because children frequently relied on an erroneous representation of these benchmarks, leading to systematic estimation errors. Second, verbal strategy reports revealed that children in the midpoint condition relied more frequently on the benchmark at the midpoint of the number line compared to the bounded condition, confirming the accuracy data. Finally, the frequency of use of benchmark-based strategies correlated positively with mathematics achievement and tended to correlate positively also with estimation accuracy. In sum, this study is one of the first to provide systematic evidence for children’s use of benchmark-based estimation strategies in NLE with natural numbers and its relationship with children’s NLE performance. 相似文献

10.

A Polytomous Scoring Approach to Handle Not-Reached Items in Low-Stakes Assessments

Guher Gorgun Okan Bulut 《Educational and psychological measurement》2021,81(5):847

In low-stakes assessments, some students may not reach the end of the test and leave some items unanswered due to various reasons (e.g., lack of test-taking motivation, poor time management, and test speededness). Not-reached items are often treated as incorrect or not-administered in the scoring process. However, when the proportion of not-reached items is high, these traditional approaches may yield biased scores and thereby threatening the validity of test results. In this study, we propose a polytomous scoring approach for handling not-reached items and compare its performance with those of the traditional scoring approaches. Real data from a low-stakes math assessment administered to second and third graders were used. The assessment consisted of 40 short-answer items focusing on addition and subtraction. The students were instructed to answer as many items as possible within 5 minutes. Using the traditional scoring approaches, students’ responses for not-reached items were treated as either not-administered or incorrect in the scoring process. With the proposed scoring approach, students’ nonmissing responses were scored polytomously based on how accurately and rapidly they responded to the items to reduce the impact of not-reached items on ability estimation. The traditional and polytomous scoring approaches were compared based on several evaluation criteria, such as model fit indices, test information function, and bias. The results indicated that the polytomous scoring approaches outperformed the traditional approaches. The complete case simulation corroborated our empirical findings that the scoring approach in which nonmissing items were scored polytomously and not-reached items were considered not-administered performed the best. Implications of the polytomous scoring approach for low-stakes assessments were discussed. 相似文献

11.

Dimensional analyses of complex data

Kathy E. Green 《Structural equation modeling》2013,20(1):50-61

In this article, scales constructed using principal components and Rasch measurement methods are compared. The context of the comparison is scale definition under difficult circumstances—when constructs are unclear and sample sizes marginal. Three data sets of increasing complexity and decreasing stability were used. Responses for the least complex data set were dichotomous; the remaining two were polytomous. Results of Rasch and principal components analyses were identical when data were stable and the structure unidimensional. With less stability and more complexity, the defined scales were still similar for the two analytic approaches. Effects of item positions on the scales were noted and are discussed. 相似文献

12.

Performance of the Generalized S‐X2 Item Fit Index for Polytomous IRT Models

Taehoon Kang Troy T. Chen 《Journal of Educational Measurement》2008,45(4):391-406

Orlando and Thissen's S‐X ² item fit index has performed better than traditional item fit statistics such as Yen's Q₁ and McKinley and Mill's G² for dichotomous item response theory (IRT) models. This study extends the utility of S‐X ² to polytomous IRT models, including the generalized partial credit model, partial credit model, and rating scale model. The performance of the generalized S‐X ² in assessing item model fit was studied in terms of empirical Type I error rates and power and compared to G². The results suggest that the generalized S‐X ² is promising for polytomous items in educational and psychological testing programs. 相似文献

13.

Aggregating Polytomous DIF Results Over Multiple Test Administrations

下载免费PDF全文

Rebecca Zwick Lei Ye Steven Isham 《Journal of Educational Measurement》2018,55(1):132-151

In typical differential item functioning (DIF) assessments, an item's DIF status is not influenced by its status in previous test administrations. An item that has shown DIF at multiple administrations may be treated the same way as an item that has shown DIF in only the most recent administration. Therefore, much useful information about the item's functioning is ignored. In earlier work, we developed the Bayesian updating (BU) DIF procedure for dichotomous items and showed how it could be used to formally aggregate DIF results over administrations. More recently, we extended the BU method to the case of polytomously scored items. We conducted an extensive simulation study that included four “administrations” of a test. For the single‐administration case, we compared the Bayesian approach to an existing polytomous‐DIF procedure. For the multiple‐administration case, we compared BU to two non‐Bayesian methods of aggregating the polytomous‐DIF results over administrations. We concluded that both the BU approach and a simple non‐Bayesian method show promise as methods of aggregating polytomous DIF results over administrations. 相似文献

14.

基于RCMLM模型的数学试卷性别DIF研究

宋吉祥李付鹏杜海燕《考试研究》2021,(1):51-57

RCMLM模型是基于Rasch测量理论的通用拓展模型。利用RCMLM模型对一份普通高中数学试卷进行不同性别的DIF分析。结果表明:该模型可对具有二分计分和多分计分的试题同时进行DIF分析,避免了以往分别对两种计分方式试题进行DIF分析的弊端,保持了试卷的完整性,使DIF分析结果更加有效。相似文献

15.

Assessing Fit of Unidimensional Graded Response Models Using Bayesian Methods

Xiaowen Zhu Clement A. Stone 《Journal of Educational Measurement》2011,48(1):81-97

The posterior predictive model checking method is a flexible Bayesian model‐checking tool and has recently been used to assess fit of dichotomous IRT models. This paper extended previous research to polytomous IRT models. A simulation study was conducted to explore the performance of posterior predictive model checking in evaluating different aspects of fit for unidimensional graded response models. A variety of discrepancy measures (test‐level, item‐level, and pair‐wise measures) that reflected different threats to applications of graded IRT models to performance assessments were considered. Results showed that posterior predictive model checking exhibited adequate power in detecting different aspects of misfit for graded IRT models when appropriate discrepancy measures were used. Pair‐wise measures were found more powerful in detecting violations of the unidimensionality and local independence assumptions. 相似文献

16.

Asymptotic Standard Errors for Item Response Theory True Score Equating of Polytomous Items

下载免费PDF全文

Cheow Cher Wong 《Journal of Educational Measurement》2015,52(1):106-120

Building on previous works by Lord and Ogasawara for dichotomous items, this article proposes an approach to derive the asymptotic standard errors of item response theory true score equating involving polytomous items, for equivalent and nonequivalent groups of examinees. This analytical approach could be used in place of empirical methods like the bootstrap method, to obtain standard errors of equated scores. Formulas are introduced to obtain the derivatives for computing the asymptotic standard errors. The approach was validated using mean‐mean, mean‐sigma, random‐groups, or concurrent calibration equating of simulated samples, for tests modeled using the generalized partial credit model or the graded response model. 相似文献

17.

Metacognitive scaffolds improve self-judgments of accuracy in a medical intelligent tutoring system

Reza Feyzi-Behnagh Roger Azevedo Elizabeth Legowski Kayse Reitmeyer Eugene Tseytlin Rebecca S. Crowley 《Instructional Science》2014,42(2):159-181

In this study, we examined the effect of two metacognitive scaffolds on the accuracy of confidence judgments made while diagnosing dermatopathology slides in SlideTutor. Thirty-one (N = 31) first- to fourth-year pathology and dermatology residents were randomly assigned to one of the two scaffolding conditions. The cases used in this study were selected from the domain of nodular and diffuse dermatitides. Both groups worked with a version of SlideTutor that provided immediate feedback on their actions for 2 h before proceeding to solve cases in either the Considering Alternatives or Playback condition. No immediate feedback was provided on actions performed by participants in the scaffolding mode. Measurements included learning gains (pre-test and post-test), as well as metacognitive performance, including Goodman–Kruskal Gamma correlation, bias, and discrimination. Results showed that participants in both conditions improved significantly in terms of their diagnostic scores from pre-test to post-test. More importantly, participants in the Considering Alternatives condition outperformed those in the Playback condition in the accuracy of their confidence judgments and the discrimination of the correctness of their assertions while solving cases. The results suggested that presenting participants with their diagnostic decision paths and highlighting correct and incorrect paths helps them to become more metacognitively accurate in their confidence judgments. 相似文献

18.

Elementary Forms of the Quick Word Test

Edgar F. Borgatta George W. Bohrnstedt 《Journal of Experimental Education》2013,81(2):57-61

To assess whether different response patterns were associated with differences in the naming and placement of response categories, 1,000 undergraduate students in educational administration completed a 10-item personal-values questionnaire. Five different forms, each answered by 200 students, were employed, differing only in the response categories which could be selected. Different distributions were obtained, depending upon whether “Undecided” was placed in the midpoint of an agreement-disagreement scale, or separated from that scale. Naming of the midpoint by “Undecided” and “Neutral” also produced different response patterns. The results indicate a need for further investigation of the effects of given scales upon responses before advanced statistical techniques are applied. 相似文献

19.

Detecting DIF for Polytomously Scored Items: An Adaptation of the SIBTEST Procedure 总被引：1，自引：0，他引：1

Hua-Hua Chang John Mazzeo Louis Roussos 《Journal of Educational Measurement》1996,33(3):333-353

Shealy and Stout (1993) proposed a DIF detection procedure called SIBTEST and demonstrated its utility with both simulated and real data sets'. Current versions of SIBTEST can be used only for dichotomous items. In this article, an extension to handle polytomous items is developed. Two simulation studies are presented which compare the modified SIBTEST procedure with the Mantel and standardized mean difference (SMD) procedures. The first study compares the procedures under conditions in which the Mantel and SMD procedures have been shown to perform well (Zwick, Donoghue, & Grima, 1993). Results of Study I suggest that SIBTEST performed reasonably well, but that the Mantel and SMD procedures performed slightly better. The second study uses data simulated under conditions in which observed-score DIF methods for dichotomous items have not performed well. The results of Study 2 indicate that under these conditions the modified SIBTEST procedure provides better control of impact-induced Type I error inflation than the other procedures. 相似文献

20.

IRT下题量与被试量对参数估计模拟返真性能的影响 总被引：1，自引：0，他引：1

"基础教育教学质量监测系统"项目组《中国考试》2009,(6)

在项目反应理论下的题库建设时,进行纸笔测验测试时需要多少被试量、题量,试题的参数估计能够达到较为精确估计?本文使用蒙特卡洛模拟方法模拟测验情境,对此问题进行探讨。分析题量的变化和被试量的变化对a、b参数估计的模拟返真性能的影响。1)从被试量角度来看,在两级、多级记分试题模拟测验情境下,随着被试量逐渐增大,项目参数估计值模拟返真指标均方误差逐渐减小。2)从题量角度来看,在两级记分试题模拟情境下,均方误差曲线在题量为25题左右时有一个拐点,即当题量小于25题时,随着题量增加时RMSE减小幅度较大,而当题量大于25题时,这时再增加题量,RMSE减小幅度很小。在多级记分试题模拟情境下,均方误差曲线在题量为15题左右时有一个拐点,即当题量小于15题时,随着题量增加, RMSE逐渐减小,当题量大于15题时,随着题量增加,RMSE逐渐增大。相似文献