首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
School climate surveys are widely applied in school districts across the nation to collect information about teacher efficacy, principal leadership, school safety, students' activities, and so forth. They enable school administrators to understand and address many issues on campus when used in conjunction with other student and staff data. However, these days each district develops the questionnaire according to its own needs and rarely provides supporting evidence for the reliability of items in the scale, that is, whether an individual item contributes significant information to the questionnaire. The Item Response Theory (IRT) is a useful tool that helps examine how much information each item and the whole scale can provide. Our study applied IRT to examine individual items in a school climate survey and assessed the efficiency of the survey after the removal of items that contributed little to the scale. The purpose of this study is to show how IRT can be applied to empirically validate school climate surveys.  相似文献   

2.
Using Muraki's (1992) generalized partial credit IRT model, polytomous items (responses to which can be scored as ordered categories) from the 1991 field test of the NAEP Reading Assessment were calibrated simultaneously with multiple-choice and short open-ended items. Expected information of each type of item was computed. On average, four-category polytomous items yielded 2.1 to 3.1 times as much IRT information as dichotomous items. These results provide limited support for the ad hoc rule of weighting k-category polytomous items the same as k - 1 dichotomous items for computing total scores. Polytomous items provided the most information about examinees of moderately high proficiency; the information function peaked at 1.0 to 1.5, and the population distribution mean was 0. When scored dichotomously, information in polytomous items sharply decreased, but they still provided more expected information than did the other response formats. For reference, a derivation of the information function for the generalized partial credit model is included.  相似文献   

3.
潘浩 《考试研究》2014,(2):59-63
早期的单维IRT模型忽视了测验多维性的可能,而多维IRT模型对各维度的划分不够明确,不能很好地反映各维度能力的内涵。高阶IRT模型承认测验的多维性,以分测验划分维度,同时又将多个维度的能力统一到一个高阶的能力中,能够在了解被试各维度的能力同时,为被试提供整体的能力估计,它能更好地反映实际,并且适应大规模测验的需求。  相似文献   

4.
Although reliability of subscale scores may be suspect, subscale scores are the most common type of diagnostic information included in student score reports. This research compared methods for augmenting the reliability of subscale scores for an 8th-grade mathematics assessment. Yen's Objective Performance Index, Wainer et al.'s augmented scores, and scores based on multidimensional item response theory (IRT) models were compared and found to improve the precision of the subscale scores. However, the augmented subscale scores were found to be more highly correlated and less variable than unaugmented scores. The meaningfulness of reporting such augmented scores as well as the implications for validity and test development are discussed.  相似文献   

5.
Componential IRT models for polytomous items are of particular interest in two contexts: Componential research and test development. We assume that there are basic components, such as processes and knowledge structures, involved in solving cognitive tasks. In Componential research, the subtask paradigm may be used to isolate such components in subtasks. In test development, items may be composed such that their response alternatives correspond with specific combinations of such components. In both cases the data may be modeled as polytomous items. With Bock's (1972) nominal model as a general framework, transformation matrices can be used to constrain the parameters of the response categories so as to reflect the Componential design of the response categories. In this way, both main effects and interaction effects of components can be studied. An application to a spelling task demonstrates this approach  相似文献   

6.
Many innovative item formats have been proposed over the past decade, but little empirical research has been conducted on their measurement properties. This study examines the reliability, efficiency, and construct validity of two innovative item formats—the figural response (FR) and constructed response (CR) formats used in a K–12 computerized science test. The item response theory (IRT) information function and confirmatory factor analysis (CFA) were employed to address the research questions. It was found that the FR items were similar to the multiple-choice (MC) items in providing information and efficiency, whereas the CR items provided noticeably more information than the MC items but tended to provide less information per minute. The CFA suggested that the innovative formats and the MC format measure similar constructs. Innovations in computerized item formats are reviewed, and the merits as well as challenges of implementing the innovative formats are discussed.  相似文献   

7.
This article considers psychometric properties of composite raw scores and transformed scale scores on mixed-format tests that consist of a mixture of multiple-choice and free-response items. Test scores on several mixed-format tests are evaluated with respect to conditional and overall standard errors of measurement, score reliability, and classification consistency and accuracy under three item response theory (IRT) frameworks: unidimensional IRT (UIRT), simple structure multidimensional IRT (SS-MIRT), and bifactor multidimensional IRT (BF-MIRT) models. Illustrative examples are presented using data from three mixed-format exams with various levels of format effects. In general, the two MIRT models produced similar results, while the UIRT model resulted in consistently lower estimates of reliability and classification consistency/accuracy indices compared to the MIRT models.  相似文献   

8.
In classical test theory, a test is regarded as a sample of items from a domain defined by generating rules or by content, process, and format specifications, l f the items are a random sample of the domain, then the percent-correct score on the test estimates the domain score, that is, the expected percent correct for all items in the domain. When the domain is represented by a large set of calibrated items, as in item banking applications, item response theory (IRT) provides an alternative estimator of the domain score by transformation of the IRT scale score on the test. This estimator has the advantage of not requiring the test items to be a random sample of the domain, and of having a simple standard error. We present here resampling results in real data demonstrating for uni- and multidimensional models that the IRT estimator is also a more accurate predictor of the domain score than is the classical percent-correct score. These results have implications for reporting outcomes of educational qualification testing and assessment.  相似文献   

9.
The Remote Associates Test (RAT) developed by Mednick and Mednick (1967) is known as a valid measure of creative convergent thinking. We developed a 30-item version of the RAT in Dutch with high internal consistency (Cronbach's alpha = 0.85) and applied both Classical Test Theory and Item Response Theory (IRT) to provide measures of item difficulty and discriminability, construct validity, and reliability. IRT was further used to construct a shorter version of the RAT, which comprises of 22 items but still shows good reliability and validity—as revealed by its relation to Raven's Advanced Progressive Matrices test, another insight-problem test, and Guilford's Alternative Uses Test.  相似文献   

10.
Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several estimation methods for different measurement models using simulation techniques. Three types of estimation approach were conceptualized for generalizability theory (GT) and item response theory (IRT): item score approach (ISA), testlet score approach (TSA), and item-nested-testlet approach (INTA). The magnitudes of overestimation when applying item-based methods ranged from 0.02 to 0.06 and were related to the degrees of dependence among within-testlet items. Reliability estimates from TSA were lower than those from INTA due to the loss of information with IRT approaches. However, this could not be applied in GT. Specified methods in IRT produced higher reliability estimates than those in GT using the same approach. Relatively smaller magnitudes of error in reliability estimates were observed for ISA and for methods in IRT. Thus, it seems reasonable to use TSA as well as INTA for both GT and IRT. However, if there is a relatively large dependence among within-testlet items, INTA should be considered for IRT due to nonnegligible loss of information.  相似文献   

11.
In observed‐score equipercentile equating, the goal is to make scores on two scales or tests measuring the same construct comparable by matching the percentiles of the respective score distributions. If the tests consist of different items with multiple categories for each item, a suitable model for the responses is a polytomous item response theory (IRT) model. The parameters from such a model can be utilized to derive the score probabilities for the tests and these score probabilities may then be used in observed‐score equating. In this study, the asymptotic standard errors of observed‐score equating using score probability vectors from polytomous IRT models are derived using the delta method. The results are applied to the equivalent groups design and the nonequivalent groups design with either chain equating or poststratification equating within the framework of kernel equating. The derivations are presented in a general form and specific formulas for the graded response model and the generalized partial credit model are provided. The asymptotic standard errors are accurate under several simulation conditions relating to sample size, distributional misspecification and, for the nonequivalent groups design, anchor test length.  相似文献   

12.
This article considers potential problems that can arise in estimating a unidimensional item response theory (IRT) model when some test items are multidimensional (i.e., show a complex factorial structure). More specifically, this study examines (1) the consequences of model misfit on IRT item parameter estimates due to unintended minor item‐level multidimensionality, and (2) whether a Projection IRT model can provide a useful remedy. A real‐data example is used to illustrate the problem and also is used as a base model for a simulation study. The results suggest that ignoring item‐level multidimensionality might lead to inflated item discrimination parameter estimates when the proportion of multidimensional test items to unidimensional test items is as low as 1:5. The Projection IRT model appears to be a useful tool for updating unidimensional item parameter estimates of multidimensional test items for a purified unidimensional interpretation.  相似文献   

13.
The Survey of Young Adult Literacy conducted in 1985 by the National Assessment of Educational Progress included 63 items that elicited skills in acquiring and using information from written documents. These items were analyzed using two different models: (1) a qualitative cognitive model, which characterized items in terms of the processing tasks they required, and (2) an item response theory (IRT) model, which characterized items difficulties and respondents' proficiencies simply by tendencies toward correct response. This paper demonstrates how a generalization of Fischer and Seheibleehner's Linear Logistic Test Model can be used to integrate information from the cognitive analysis into the IRT analysis, providing a foundation for subsequent item construction, test development, and diagnosis of individuals skill deficiencies.  相似文献   

14.
The primary purpose of this study is to investigate the mathematical characteristics of the test reliability coefficient ρ XX as a function of item response theory (IRT) parameters and present the lower and upper bounds of the coefficient. Another purpose is to examine relative performances of the IRT reliability statistics and two classical test theory (CTT) reliability statistics (Cronbach’s alpha and Feldt–Gilmer congeneric coefficients) under various testing conditions that result from manipulating large-scale real data. For the first purpose, two alternative ways of exactly quantifying ρ XX are compared in terms of computational efficiency and statistical usefulness. In addition, the lower and upper bounds for ρ XX are presented in line with the assumptions of essential tau-equivalence and congeneric similarity, respectively. Empirical studies conducted for the second purpose showed across all testing conditions that (1) the IRT reliability coefficient was higher than the CTT reliability statistics; (2) the IRT reliability coefficient was closer to the Feldt–Gilmer coefficient than to the Cronbach’s alpha coefficient; and (3) the alpha coefficient was close to the lower bound of IRT reliability. Some advantages of the IRT approach to estimating test-score reliability over the CTT approaches are discussed in the end.  相似文献   

15.
ABSTRACT

Objectives: This study aims to test the dimensionality, reliability, and item quality of the revised UCLA loneliness scale as well as to investigate the differential item functioning (DIF) of the three dimensions of the revised UCLA loneliness scale in community-dwelling Chinese and Korean elderly individuals.

Method: Data from 493 elderly individuals (287 Chinese and 206 Korean) were used to examine the revised UCLA loneliness scale. The Research model based on item response theory (IRT) was used to test dimensionality, reliability, and item fit. The hybrid ordinal logistic regression-IRT test was used to evaluate DIF.

Results: Item separation reliability, person reliability, and Cronbach’s alpha met the benchmarks. The quality of the items in the three-dimension model met the benchmark. Eight items were detected as significant DIF items (at α < .01). The loneliness level of Chinese elderly individuals was significantly higher than that of Koreans in Dimensions 1 and 2, while Korean elderly participants showed significantly higher loneliness levels than Chinese participants in Dimension 3. Several collected demographic characteristics and loneliness levels were more highly correlated in Korean elderly individuals than in Chinese elderly individuals.

Conclusion: Analysis using the three dimensions is reasonable for the revised UCLA loneliness scale. Good item quality and the items of this measure suggest that the revised UCLA loneliness can be used to assess the preferred latent traits. Finally, the differences between the levels of loneliness in Chinese and Korean elderly individuals are associated with the factors of loneliness.  相似文献   

16.
Cognitive diagnosis models (CDMs) continue to generate interest among researchers and practitioners because they can provide diagnostic information relevant to classroom instruction and student learning. However, its modeling component has outpaced its complementary component??test construction. Thus, most applications of cognitive diagnosis modeling involve retrofitting of CDMs to assessments constructed using classical test theory (CTT) or item response theory (IRT). This study explores the relationship between item statistics used in the CTT, IRT, and CDM frameworks using such an assessment, specifically a large-scale mathematics assessment. Furthermore, by highlighting differences between tests with varying levels of diagnosticity using a measure of item discrimination from a CDM approach, this study empirically uncovers some important CTT and IRT item characteristics. These results can be used to formulate practical guidelines in using IRT- or CTT-constructed assessments for cognitive diagnosis purposes.  相似文献   

17.
Investigating the fit of a parametric model plays a vital role in validating an item response theory (IRT) model. An area that has received little attention is the assessment of multiple IRT models used in a mixed-format test. The present study extends the nonparametric approach, proposed by Douglas and Cohen (2001), to assess model fit of three IRT models (three- and two-parameter logistic model, and generalized partial credit model) used in a mixed-format test. The statistical properties of the proposed fit statistic were examined and compared to S-X2 and PARSCALE’s G2. Overall, RISE (Root Integrated Square Error) outperformed the other two fit statistics under the studied conditions in that the Type I error rate was not inflated and the power was acceptable. A further advantage of the nonparametric approach is that it provides a convenient graphical inspection of the misfit.  相似文献   

18.
This research derived information functions and proposed new scalar information indices to examine the quality of multidimensional forced choice (MFC) items based on the RANK model. We also explored how GGUM‐RANK information, latent trait recovery, and reliability varied across three MFC formats: pairs (two response alternatives), triplets (three alternatives), and tetrads (four alternatives). As expected, tetrad and triplet measures provided substantially more information than pairs, and MFC items composed of statements with high discrimination parameters were most informative. The methods and findings of this study will help practitioners to construct better MFC items, make informed projections about reliability with different MFC formats, and facilitate the development of MFC triplet‐ and tetrad‐based computerized adaptive tests.  相似文献   

19.
The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity.  相似文献   

20.
In judgmental standard setting procedures (e.g., the Angoff procedure), expert raters establish minimum pass levels (MPLs) for test items, and these MPLs are then combined to generate a passing score for the test. As suggested by Van der Linden (1982), item response theory (IRT) models may be useful in analyzing the results of judgmental standard setting studies. This paper examines three issues relevant to the use of lRT models in analyzing the results of such studies. First, a statistic for examining the fit of MPLs, based on judges' ratings, to an IRT model is suggested. Second, three methods for setting the passing score on a test based on item MPLs are analyzed; these analyses, based on theoretical models rather than empirical comparisons among the three methods, suggest that the traditional approach (i.e., setting the passing score on the test equal to the sum of the item MPLs) does not provide the best results. Third, a simple procedure, based on generalizability theory, for examining the sources of error in estimates of the passing score is discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号