首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 312 毫秒
1.
Numerous assessments contain a mixture of multiple choice (MC) and constructed response (CR) item types and many have been found to measure more than one trait. Thus, there is a need for multidimensional dichotomous and polytomous item response theory (IRT) modeling solutions, including multidimensional linking software. For example, multidimensional item response theory (MIRT) may have a promising future in subscale score proficiency estimation, leading toward a more diagnostic orientation, which requires the linking of these subscale scores across different forms and populations. Several multidimensional linking studies can be found in the literature; however, none have used a combination of MC and CR item types. Thus, this research explores multidimensional linking accuracy for tests composed of both MC and CR items using a matching test characteristic/response function approach. The two-dimensional simulation study presented here used real data-derived parameters from a large-scale statewide assessment with two subscale scores for diagnostic profiling purposes, under varying conditions of anchor set lengths (6, 8, 16, 32, 60), across 10 population distributions, with a mixture of simple versus complex structured items, using a sample size of 3,000. It was found that for a well chosen anchor set, the parameters recovered well after equating across all populations, even for anchor sets composed of as few as six items.  相似文献   

2.
学生的数学素养具有多维结构,素养导向的数学学业成就测评需要提供被试在各维度上的表现信息,而不仅是一个单一的总分。以PISA数学素养结构为理论模型,以多维项目反应理论(MIRT)为测量模型,利用R语言的MIRT程序包处理和分析某地区8年级数学素养测评题目数据,研究数学素养的多维测量方法。结果表明:MIRT兼具单维项目反应理论和因子分析的优点,利用其可对测试的结构效度和测试题目质量进行分析,以及对被试进行多维能力认知诊断。  相似文献   

3.
In computerized adaptive testing (CAT), ensuring the security of test items is a crucial practical consideration. A common approach to reducing item theft is to define maximum item exposure rates, i.e., to limit the proportion of examinees to whom a given item can be administered. Numerous methods for controlling exposure rates have been proposed for tests employing the unidimensional 3-PL model. The present article explores the issues associated with controlling exposure rates when a multidimensional item response theory (MIRT) model is utilized and exposure rates must be controlled conditional upon ability. This situation is complicated by the exponentially increasing number of possible ability values in multiple dimensions. The article introduces a new procedure, called the generalized Stocking-Lewis method, that controls the exposure rate for students of comparable ability as well as with respect to the overall population. A realistic simulation set compares the new method with three other approaches: Kullback-Leibler information with no exposure control, Kullback-Leibler information with unconditional Sympson-Hetter exposure control, and random item selection.  相似文献   

4.
Many educational and psychological tests are inherently multidimensional, meaning these tests measure two or more dimensions or constructs. The purpose of this module is to illustrate how test practitioners and researchers can apply multidimensional item response theory (MIRT) to understand better what their tests are measuring, how accurately the different composites of ability are being assessed, and how this information can be cycled back into the test development process. Procedures for conducting MIRT analyses–from obtaining evidence that the test is multidimensional, to modeling the test as multidimensional, to illustrating the properties of multidimensional items graphically-are described from both a theoretical and a substantive basis. This module also illustrates these procedures using data from a ninth-grade mathematics achievement test. It concludes with a discussion of future directions in MIRT research.  相似文献   

5.
Many researchers have suggested that the main cause of item bias is the misspecification of the latent ability space, where items that measure multiple abilities are scored as though they are measuring a single ability. If two different groups of examinees have different underlying multidimensional ability distributions and the test items are capable of discriminating among levels of abilities on these multiple dimensions, then any unidimensional scoring scheme has the potential to produce item bias. It is the purpose of this article to provide the testing practitioner with insight about the difference between item bias and item impact and how they relate to item validity. These concepts will be explained from a multidimensional item response theory (MIRT) perspective. Two detection procedures, the Mantel-Haenszel (as modified by Holland and Thayer, 1988) and Shealy and Stout's Simultaneous Item Bias (SIB; 1991) strategies, will be used to illustrate how practitioners can detect item bias.  相似文献   

6.
Student growth percentiles (SGPs, Betebenner, 2009) are used to locate a student's current score in a conditional distribution based on the student's past scores. Currently, following Betebenner (2009), quantile regression (QR) is most often used operationally to estimate the SGPs. Alternatively, multidimensional item response theory (MIRT) may also be used to estimate SGPs, as proposed by Lockwood and Castellano (2015). A benefit of using MIRT to estimate SGPs is that techniques and methods already developed for MIRT may readily be applied to the specific context of SGP estimation and inference. This research adopts a MIRT framework to explore the reliability of SGPs. More specifically, we propose a straightforward method for estimating SGP reliability. In addition, we use this measure to study how SGP reliability is affected by two key factors: the correlation between prior and current latent achievement scores, and the number of prior years included in the SGP analysis. These issues are primarily explored via simulated data. In addition, the QR and MIRT approaches are compared in an empirical application.  相似文献   

7.
This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models.  相似文献   

8.
To assess item dimensionality, the following two approaches are described and compared: hierarchical generalized linear model (HGLM) and multidimensional item response theory (MIRT) model. Two generating models are used to simulate dichotomous responses to a 17-item test: the unidimensional and compensatory two-dimensional (C2D) models. For C2D data, seven items are modeled to load on the first and second factors, θ1 and θ2, with the remaining 10 items modeled unidimensionally emulating a mathematics test with seven items requiring an additional reading ability dimension. For both types of generated data, the multidimensionality of item responses is investigated using HGLM and MIRT. Comparison of HGLM and MIRT's results are possible through a transformation of items' difficulty estimates into probabilities of a correct response for a hypothetical examinee at the mean on θ and θ2. HGLM and MIRT performed similarly. The benefits of HGLM for item dimensionality analyses are discussed.  相似文献   

9.
Lord's Wald test for differential item functioning (DIF) has not been studied extensively in the context of the multidimensional item response theory (MIRT) framework. In this article, Lord's Wald test was implemented using two estimation approaches, marginal maximum likelihood estimation and Bayesian Markov chain Monte Carlo estimation, to detect uniform and nonuniform DIF under MIRT models. The Type I error and power rates for Lord's Wald test were investigated under various simulation conditions, including different DIF types and magnitudes, different means and correlations of two ability parameters, and different sample sizes. Furthermore, English usage data were analyzed to illustrate the use of Lord's Wald test with the two estimation approaches.  相似文献   

10.
《教育实用测度》2013,26(3):193-211
A procedure for interpreting multiple-discrimination indices obtained from a multidimensional item-response theory (MIRT) analysis is described and demonstrated. The procedure consists of converting discrimination parameter estimates to direction cosines and cluster analyzing the angular distances between item vectors, grouping together items with similar orientations in the theta space. The procedure is suggested as an alternative to conventional item factor analysis for investigating issues related to test dimensionality within a single test form and between alternate forms of a test.  相似文献   

11.
Multidimensional item response theory (MIRT) provides an ideal foundation for modeling performance in complex domains, taking into account multiple basic abilities simultaneously, and representing different mixtures of the abilities required for different test items. This article provides a brief overview of different MIRT models, and the substantive implications of their differences for educational assessment. To illustrate the flexibility and benefits of MIRT, three application scenarios are described: to account for unintended multidimensionality when measuring a unidimensional construct, to model latent covariance structures between ability dimensions, and to model interactions of multiple abilities required for solving specific test items. All of these scenarios are illustrated by empirical examples. Finally, the implications of using MIRT models on educational processes are discussed.  相似文献   

12.
This study investigates the relationships among factor correlations, inter-item correlations, and the reliability estimates of subscores, providing a guideline with respect to psychometric properties of useful subscores. In addition, it compares subscore estimation methods with respect to reliability and distinctness. The subscore estimation methods explored in the current study include augmentation based on classical test theory and multidimensional item response theory (MIRT). The study shows that there is no estimation method that is optimal according to both criteria. Augmented subscores show the most improvement in reliability compared to observed subscores but are the least distinct.  相似文献   

13.
This research examined the effect of scoring items thought to be multidimensional using a unidimensional model and demonstrated the use of multidimensional item response theory (MIRT) as a diagnostic tool. Using real data from a large-scale mathematics test, previously shown to function differentially in favor of proficient writers, the difference in proficiency classifications was explored when a two-versus one-dimensional confirmatory model was fit. The estimate of ability obtained when using the unidimensional model was considered to represent general mathematical ability. Under the two-dimensional model, one of the two dimensions was also considered to represent general mathematical ability. The second dimension was considered to represent the ability to communicate in mathematics. The resulting pattern of mismatched proficiency classifications suggested that examinees found to have less mathematics communication ability were more likely to be placed in a lower general mathematics proficiency classification under the unidimensional than multidimensional model. Results and implications are discussed.  相似文献   

14.
This article defines and demonstrates a framework for studying differential item functioning (DIF) and differential test functioning (DTF) for tests that are intended to be multidimensional The procedure introduced here is an extension of unidimensional differential functioning of items and tests (DFIT) recently developed by Raju, van der Linden, & Fleer (1995). To demonstrate the usefulness of these new indexes in a multidimensional IRT setting, two-dimensional data were simulated with known item parameters and known DIF and DTE The DIF and DTF indexes were recovered reasonably well under various distributional differences of Os after multidimensional linking was applied to put the two sets of item parameters on a common scale. Further studies are suggested in the area of DIF/DTF for intentionally multidimensional tests.  相似文献   

15.
ABSTRACT

In applications of item response theory (IRT), fixed parameter calibration (FPC) has been used to estimate the item parameters of a new test form on the existing ability scale of an item pool. The present paper presents an application of FPC to multiple examinee groups test data that are linked to the item pool via anchor items, and investigates the performance of FPC relative to an alternative approach, namely independent 0–1 calibration and scale linking. Two designs for linking to the pool are proposed that involve multiple groups and test forms, for which multiple-group FPC can be effectively used. A real-data study shows that the multiple-group FPC method performs similarly to the alternative method in estimating ability distributions and new item parameters on the scale of the item pool. In addition, a simulation study shows that the multiple-group FPC method performs nearly equally to or better than the alternative method in recovering the underlying ability distributions and the new item parameters.  相似文献   

16.
17.
Many large‐scale assessments are designed to yield two or more scores for an individual by administering multiple sections measuring different but related skills. Multidimensional tests, or more specifically, simple structured tests, such as these rely on multiple multiple‐choice and/or constructed responses sections of items to generate multiple scores. In the current article, we propose an extension of the hierarchical rater model (HRM) to be applied with simple structured tests with constructed response items. In addition to modeling the appropriate trait structure, the multidimensional HRM (M‐HRM) presented here also accounts for rater severity bias and rater variability or inconsistency. We introduce the model formulation, test parameter recovery with a focus on latent traits, and compare the M‐HRM to other scoring approaches (unidimensional HRMs and a traditional multidimensional item response theory model) using simulated and empirical data. Results show more precise scores under the M‐HRM, with a major improvement in scores when incorporating rater effects versus ignoring them in the traditional multidimensional item response theory model.  相似文献   

18.
Linking item parameters to a base scale   总被引:1,自引:0,他引:1  
This paper compares three methods of item calibration??concurrent calibration, separate calibration with linking, and fixed item parameter calibration??that are frequently used for linking item parameters to a base scale. Concurrent and separate calibrations were implemented using BILOG-MG. The Stocking and Lord in Appl Psychol Measure 7:201?C210, (1983) characteristic curve method of parameter linking was used in conjunction with separate calibration. The fixed item parameter calibration (FIPC) method was implemented using both BILOG-MG and PARSCALE because the method is carried out differently by the two programs. Both programs use multiple EM cycles, but BILOG-MG does not update the prior ability distribution during FIPC calibration, whereas PARSCALE updates the prior ability distribution multiple times. The methods were compared using simulations based on actual testing program data, and results were evaluated in terms of recovery of the underlying ability distributions, the item characteristic curves, and the test characteristic curves. Factors manipulated in the simulations were sample size, ability distributions, and numbers of common (or fixed) items. The results for concurrent calibration and separate calibration with linking were comparable, and both methods showed good recovery results for all conditions. Between the two fixed item parameter calibration procedures, only the appropriate use of PARSCALE consistently provided item parameter linking results similar to those of the other two methods.  相似文献   

19.
Increasing use of item pools in large-scale educational assessments calls for an appropriate scaling procedure to achieve a common metric among field-tested items. The present study examines scaling procedures for developing a new item pool under a spiraled block linking design. The three scaling procedures are considered: (a) concurrent calibration, (b) separate calibration with one linking, and (c) separate calibration with three sequential linking. Evaluation across varying sample sizes and item pool sizes suggests that calibrating an item pool simultaneously results in the most stable scaling. The separate calibration with linking procedures produced larger scaling errors as the number of linking steps increased. The Haebara’s item characteristic curve linking resulted in better performances than the test characteristic curve (TCC) linking method. The present article provides an analytic illustration that the test characteristic curve method may fail to find global solutions in polytomous items. Finally, comparison of the single- and mixed-format item pools suggests that the use of polytomous items as the anchor can improve the overall scaling accuracy of the item pools.  相似文献   

20.
In educational assessment, overall scores obtained by simply averaging a number of domain scores are sometimes reported. However, simply averaging the domain scores ignores the fact that different domains have different score points, that scores from those domains are related, and that at different score points the relationship between overall score and domain score may be different. To report reliable and valid overall scores and domain scores, I investigated the performance of four methods using both real and simulation data: (a) the unidimensional IRT model; (b) the higher-order IRT model, which simultaneously estimates the overall ability and domain abilities; (c) the multidimensional IRT (MIRT) model, which estimates domain abilities and uses the maximum information method to obtain the overall ability; and (d) the bifactor general model. My findings suggest that the MIRT model not only provides reliable domain scores, but also produces reliable overall scores. The overall score from the MIRT maximum information method has the smallest standard error of measurement. In addition, unlike the other models, there is no linear relationship assumed between overall score and domain scores. Recommendations for sizes of correlations between domains and the number of items needed for reporting purposes are provided.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号