首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 250 毫秒
1.
This study explores classification consistency and accuracy for mixed-format tests using real and simulated data. In particular, the current study compares six methods of estimating classification consistency and accuracy for seven mixed-format tests. The relative performance of the estimation methods is evaluated using simulated data. Study results from real data analysis showed that the procedures exhibited similar patterns across various exams, but some tended to produce lower estimates of classification consistency and accuracy than others. As data became more multidimensional, unidimensional and multidimensional item response theory (IRT) methods tended to produce different results, with the unidimensional approach yielding lower estimates than the multidimensional approach. Results from simulated data analysis demonstrated smaller estimation error for the multidimensional IRT methods than for the unidimensional IRT method. The unidimensional approach yielded larger error as tests became more multidimensional, whereas a reverse relationship was observed for the multidimensional IRT approach. Among the non-IRT approaches, the normal approximation and Livingston-Lewis methods performed well, whereas the compound multinomial method tended to produce relatively larger error.  相似文献   

2.
Multiple measures, such as multiple content domains or multiple types of performance, are used in various testing programs to classify examinees for screening or selection. Despite the popular usages of multiple measures, there is little research on classification consistency and accuracy of multiple measures. Accordingly, this study introduces an approach to estimate classification consistency and accuracy indices for multiple measures under four possible decision rules: (1) complementary, (2) conjunctive, (3) compensatory, and (4) pairwise combinations of the three. The current study uses the IRT-recursive-based approach with the simple-structure multidimensional IRT model (SS-MIRT) to estimate the classification consistency and accuracy for multiple measures. Theoretical formulations of the four decision rules with a binary decision (Pass/Fail) are presented. The estimation procedures are illustrated using an empirical data example based on SS-MIRT. In addition, this study applies the estimation procedures to the unidimensional IRT (UIRT) context, considering that UIRT is practically used more. This application shows that the proposed procedure of classification consistency and accuracy could be used with a UIRT model for individual measures as an alternative method of SS-MIRT.  相似文献   

3.
An Extension of Four IRT Linking Methods for Mixed-Format Tests   总被引:1,自引:0,他引:1  
Under item response theory (IRT), linking proficiency scales from separate calibrations of multiple forms of a test to achieve a common scale is required in many applications. Four IRT linking methods including the mean/mean, mean/sigma, Haebara, and Stocking-Lord methods have been presented for use with single-format tests. This study extends the four linking methods to a mixture of unidimensional IRT models for mixed-format tests. Each linking method extended is intended to handle mixed-format tests using any mixture of the following five IRT models: the three-parameter logistic, graded response, generalized partial credit, nominal response (NR), and multiple-choice (MC) models. A simulation study is conducted to investigate the performance of the four linking methods extended to mixed-format tests. Overall, the Haebara and Stocking-Lord methods yield more accurate linking results than the mean/mean and mean/sigma methods. When the NR model or the MC model is used to analyze data from mixed-format tests, limitations of the mean/mean, mean/sigma, and Stocking-Lord methods are described.  相似文献   

4.
ABSTRACT

The main purposes of this study were to develop bi-factor multidimensional item response theory (BF-MIRT) observed-score equating procedures for mixed-format tests and to investigate relative appropriateness of the proposed procedures. Using data from a large-scale testing program, three types of pseudo data sets were formulated: matched samples, pseudo forms, and simulated data sets. Very minor within-format residual dependence in mixed-format tests was found after controlling for the influence of the primary general factor. The unidimensional IRT and BF-MIRT equating methods produced similar equating results for the data used in this study. When a BF-MIRT model is implemented, we recommend the use of observed-score equating instead of true-score equating because the latter requires an arbitrary approximation or reduction process to relate true scores on test forms.  相似文献   

5.
In educational assessment, overall scores obtained by simply averaging a number of domain scores are sometimes reported. However, simply averaging the domain scores ignores the fact that different domains have different score points, that scores from those domains are related, and that at different score points the relationship between overall score and domain score may be different. To report reliable and valid overall scores and domain scores, I investigated the performance of four methods using both real and simulation data: (a) the unidimensional IRT model; (b) the higher-order IRT model, which simultaneously estimates the overall ability and domain abilities; (c) the multidimensional IRT (MIRT) model, which estimates domain abilities and uses the maximum information method to obtain the overall ability; and (d) the bifactor general model. My findings suggest that the MIRT model not only provides reliable domain scores, but also produces reliable overall scores. The overall score from the MIRT maximum information method has the smallest standard error of measurement. In addition, unlike the other models, there is no linear relationship assumed between overall score and domain scores. Recommendations for sizes of correlations between domains and the number of items needed for reporting purposes are provided.  相似文献   

6.
With a focus on performance assessments, this paper describes procedures for calculating conditional standard error of measurement (CSEM) and reliability of scale scores and classification consistency of performance levels. Scale scores that are transformations of total raw scores are the focus of these procedures, although other types of raw scores are considered as well. Polytomous IRT models provide the psychometric foundation for the procedures that are described. The procedures are applied using test data from ACT's Work Keys Writing Assessment to demonstrate their usefulness. Two polytomous IRT models were compared, as were two different procedures for calculating scores. One simulation study was done using one of the models to evaluate the accuracy of the proposed procedures. The results suggest that the procedures provide quite stable estimates and have the potential to be useful in a variety of performance assessment situations.  相似文献   

7.
Numerous assessments contain a mixture of multiple choice (MC) and constructed response (CR) item types and many have been found to measure more than one trait. Thus, there is a need for multidimensional dichotomous and polytomous item response theory (IRT) modeling solutions, including multidimensional linking software. For example, multidimensional item response theory (MIRT) may have a promising future in subscale score proficiency estimation, leading toward a more diagnostic orientation, which requires the linking of these subscale scores across different forms and populations. Several multidimensional linking studies can be found in the literature; however, none have used a combination of MC and CR item types. Thus, this research explores multidimensional linking accuracy for tests composed of both MC and CR items using a matching test characteristic/response function approach. The two-dimensional simulation study presented here used real data-derived parameters from a large-scale statewide assessment with two subscale scores for diagnostic profiling purposes, under varying conditions of anchor set lengths (6, 8, 16, 32, 60), across 10 population distributions, with a mixture of simple versus complex structured items, using a sample size of 3,000. It was found that for a well chosen anchor set, the parameters recovered well after equating across all populations, even for anchor sets composed of as few as six items.  相似文献   

8.
Student growth percentiles (SGPs, Betebenner, 2009) are used to locate a student's current score in a conditional distribution based on the student's past scores. Currently, following Betebenner (2009), quantile regression (QR) is most often used operationally to estimate the SGPs. Alternatively, multidimensional item response theory (MIRT) may also be used to estimate SGPs, as proposed by Lockwood and Castellano (2015). A benefit of using MIRT to estimate SGPs is that techniques and methods already developed for MIRT may readily be applied to the specific context of SGP estimation and inference. This research adopts a MIRT framework to explore the reliability of SGPs. More specifically, we propose a straightforward method for estimating SGP reliability. In addition, we use this measure to study how SGP reliability is affected by two key factors: the correlation between prior and current latent achievement scores, and the number of prior years included in the SGP analysis. These issues are primarily explored via simulated data. In addition, the QR and MIRT approaches are compared in an empirical application.  相似文献   

9.
When cut scores for classifications occur on the total score scale, popular methods for estimating classification accuracy (CA) and classification consistency (CC) require assumptions about a parametric form of the test scores or about a parametric response model, such as item response theory (IRT). This article develops an approach to estimate CA and CC nonparametrically by replacing the role of the parametric IRT model in Lee's classification indices with a modified version of Ramsay's kernel‐smoothed item response functions. The performance of the nonparametric CA and CC indices are tested in simulation studies in various conditions with different generating IRT models, test lengths, and ability distributions. The nonparametric approach to CA often outperforms Lee's method and Livingston and Lewis's method, showing robustness to nonnormality in the simulated ability. The nonparametric CC index performs similarly to Lee's method and outperforms Livingston and Lewis's method when the ability distributions are nonnormal.  相似文献   

10.
Investigating the fit of a parametric model plays a vital role in validating an item response theory (IRT) model. An area that has received little attention is the assessment of multiple IRT models used in a mixed-format test. The present study extends the nonparametric approach, proposed by Douglas and Cohen (2001), to assess model fit of three IRT models (three- and two-parameter logistic model, and generalized partial credit model) used in a mixed-format test. The statistical properties of the proposed fit statistic were examined and compared to S-X2 and PARSCALE’s G2. Overall, RISE (Root Integrated Square Error) outperformed the other two fit statistics under the studied conditions in that the Type I error rate was not inflated and the power was acceptable. A further advantage of the nonparametric approach is that it provides a convenient graphical inspection of the misfit.  相似文献   

11.
Book reviews     
Background:?A recent article published in Educational Research on the reliability of results in National Curriculum testing in England (Newton, The reliability of results from national curriculum testing in England, Educational Research 51, no. 2: 181–212, 2009) suggested that: (1) classification accuracy can be calculated from classification consistency; and (2) classification accuracy on a single test administration is higher than classification consistency across two tests.

Purpose:?This article shows that it is not possible to calculate classification accuracy from classification consistency. It then shows that, given reasonable assumptions about the distribution of measurement error, the expected classification accuracy on a single test administration is higher than the expected classification consistency across two tests only in the case of a pass–fail test, but not necessarily for tests that classify test-takers into more than two categories.

Main argument and conclusion:?Classification accuracy is defined in terms of a ‘true score’ specified in a psychometric model. Three things must be known or hypothesised in order to derive a value for classification accuracy: (1) a psychometric model relating observed scores to true scores; (2) the location of the cut-scores on the score scale; and (3) the distribution of true scores in the group of test-takers.  相似文献   

12.
Many educational and psychological tests are inherently multidimensional, meaning these tests measure two or more dimensions or constructs. The purpose of this module is to illustrate how test practitioners and researchers can apply multidimensional item response theory (MIRT) to understand better what their tests are measuring, how accurately the different composites of ability are being assessed, and how this information can be cycled back into the test development process. Procedures for conducting MIRT analyses–from obtaining evidence that the test is multidimensional, to modeling the test as multidimensional, to illustrating the properties of multidimensional items graphically-are described from both a theoretical and a substantive basis. This module also illustrates these procedures using data from a ninth-grade mathematics achievement test. It concludes with a discussion of future directions in MIRT research.  相似文献   

13.
Although reliability of subscale scores may be suspect, subscale scores are the most common type of diagnostic information included in student score reports. This research compared methods for augmenting the reliability of subscale scores for an 8th-grade mathematics assessment. Yen's Objective Performance Index, Wainer et al.'s augmented scores, and scores based on multidimensional item response theory (IRT) models were compared and found to improve the precision of the subscale scores. However, the augmented subscale scores were found to be more highly correlated and less variable than unaugmented scores. The meaningfulness of reporting such augmented scores as well as the implications for validity and test development are discussed.  相似文献   

14.
In this article, procedures are described for estimating single-administration classification consistency and accuracy indices for complex assessments using item response theory (IRT). This IRT approach was applied to real test data comprising dichotomous and polytomous items. Several different IRT model combinations were considered. Comparisons were also made between the IRT approach and two non-IRT approaches including the Livingston-Lewis and compound multinomial procedures. Results for various IRT model combinations were not substantially different. The estimated classification consistency and accuracy indices for the non-IRT procedures were almost always lower than those for the IRT procedures.  相似文献   

15.
Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several estimation methods for different measurement models using simulation techniques. Three types of estimation approach were conceptualized for generalizability theory (GT) and item response theory (IRT): item score approach (ISA), testlet score approach (TSA), and item-nested-testlet approach (INTA). The magnitudes of overestimation when applying item-based methods ranged from 0.02 to 0.06 and were related to the degrees of dependence among within-testlet items. Reliability estimates from TSA were lower than those from INTA due to the loss of information with IRT approaches. However, this could not be applied in GT. Specified methods in IRT produced higher reliability estimates than those in GT using the same approach. Relatively smaller magnitudes of error in reliability estimates were observed for ISA and for methods in IRT. Thus, it seems reasonable to use TSA as well as INTA for both GT and IRT. However, if there is a relatively large dependence among within-testlet items, INTA should be considered for IRT due to nonnegligible loss of information.  相似文献   

16.
Most researchers agree that psychological/educational tests are sensitive to multiple traits, implying the need for a multidimensional item response theory (MIRT). One limitation of applying a MIRT in practice is the difficulty in establishing equivalent scales of multiple traits. In this study, a new MIRT linking method was proposed and evaluated by comparison with two existing methods. The results showed that the new method was more acceptable in transforming item parameters and maintaining dimensional structures. Limitations and cautions in using multidimensional linking techniques were also discussed.  相似文献   

17.
Standard 3.9 of the Standards for Educational and Psychological Testing ( 1999 ) demands evidence of model fit when item response theory (IRT) models are employed to data from tests. Hambleton and Han ( 2005 ) and Sinharay ( 2005 ) recommended the assessment of practical significance of misfit of IRT models, but few examples of such assessment can be found in the literature concerning IRT model fit. In this article, practical significance of misfit of IRT models was assessed using data from several tests that employ IRT models to report scores. The IRT model did not fit any data set considered in this article. However, the extent of practical significance of misfit varied over the data sets.  相似文献   

18.
A mixed‐effects item response theory (IRT) model is presented as a logical extension of the generalized linear mixed‐effects modeling approach to formulating explanatory IRT models. Fixed and random coefficients in the extended model are estimated using a Metropolis‐Hastings Robbins‐Monro (MH‐RM) stochastic imputation algorithm to accommodate for increased dimensionality due to modeling multiple design‐ and trait‐based random effects. As a consequence of using this algorithm, more flexible explanatory IRT models, such as the multidimensional four‐parameter logistic model, are easily organized and efficiently estimated for unidimensional and multidimensional tests. Rasch versions of the linear latent trait and latent regression model, along with their extensions, are presented and discussed, Monte Carlo simulations are conducted to determine the efficiency of parameter recovery of the MH‐RM algorithm, and an empirical example using the extended mixed‐effects IRT model is presented.  相似文献   

19.
信度和效度是衡量一个测量工具质量的关键指标,教育认知诊断测验中的信度和效度研究近年来受到研究者的关注。诊断测验的信度系数基本上源自基于α系数的属性信度系数、经验属性信度系数、四分相关系数、模拟重测一致性和分类一致性指标;效度系数主要包括模拟判准率、分类准确性和理论构想效度等。教育认知诊断测验的信度和效度研究较新,仍存在着一定的不足且缺乏全面的比较研究,更缺少系统的评价体系。  相似文献   

20.
This study investigates a sequence of item response theory (IRT) true score equatings based on various scale transformation approaches and evaluates equating accuracy and consistency over time. The results show that the biases and sample variances for the IRT true score equating (both direct and indirect) are quite small (except for the mean/sigma method). The biases and sample variances for the equating functions based on the characteristic curve methods and concurrent calibrations for adjacent forms are smaller than the biases and variances for the equating functions based on the moment methods. In addition, the IRT true score equating is also compared to the chained equipercentile equating, and we observe that the sample variances for the chained equipercentile equating are much smaller than the variances for the IRT true score equating with an exception at the low scores.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号