首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Four item response theory (IRT) models were compared using data from tests where multiple items were grouped into testlets focused on a common stimulus. In the bi-factor model each item was treated as a function of a primary trait plus a nuisance trait due to the testlet; in the testlet-effects model the slopes in the direction of the testlet traits were constrained within each testlet to be proportional to the slope in the direction of the primary trait; in the polytomous model the item scores were summed into a single score for each testlet; and in the independent-items model the testlet structure was ignored. Using the simulated data, reliability was overestimated somewhat by the independent-items model when the items were not independent within testlets. Under these nonindependent conditions, the independent-items model also yielded greater root mean square error (RMSE) for item difficulty and underestimated the item slopes. When the items within testlets were instead generated to be independent, the bi-factor model yielded somewhat higher RMSE in difficulty and slope. Similar differences between the models were illustrated with real data.  相似文献   

2.
A series of computer simulations were run to measure the relationship between testlet validity and the factors of item pool size and testlet length for both adaptive and linearly constructed testlets. We confirmed the generality of earlier empirical findings (Wainer, Lewis, Kaplan, & Braswell, 1991) that making a testlet adaptive yields only modest increases in aggregate validity because of the peakedness of the typical proficiency distribution.  相似文献   

3.
A single-group (SG) equating with nearly equivalent test forms (SiGNET) design was developed by Grant to equate small-volume tests. Under this design, the scored items for the operational form are divided into testlets or mini tests. An additional testlet is created but not scored for the first form. If the scored testlets are testlets 1–6 and the unscored testlet is testlet 7, then the first form is composed of testlets 1–6 and the second form is composed of testlets 2–7. The seven testlets are administered as a single administered form, and when a sufficient number of examinees have taken the administered form, the second form (testlets 2–7) is equated to the first form (testlets 1–6) using an SG equating design. As evident, this design facilitates the use of an SG equating and allows for the accumulation of data, both of which may reduce equating error. This study compared equatings under the SiGNET and common-item equating designs and found lower equating error for the SiGNET design in very small sample size conditions (e.g., N = 10).  相似文献   

4.
Using a New Statistical Model for Testlets to Score TOEFL   总被引:1,自引:0,他引:1  
Standard item response theory (IRT) models fit to examination responses ignore the fact that sets of items (testlets) often are matched with a single common stimulus (e.g., a reading comprehension passage). In this setting, all items given to an examinee are unlikely to be conditionally independent (given examinee proficiency). Models that assume conditional independence will overestimate the precision with which examinee proficiency is measured. Overstatement of precision may lead to inaccurate inferences as well as prematurely ended examinations in which the stopping rule is based on the estimated standard error of examinee proficiency (e.g., an adaptive test). The standard three parameter IRT model was modified to include an additional random effect for items nested within the same testlet (Wainer, Bradlow, & Du, 2000). This parameter, γ characterizes the amount of local dependence in a testlet.
We fit 86 TOEFL testlets (50 reading comprehension and 36 listening comprehension) with the new model, and obtained a value for the variance of γ for each testlet. We compared the standard parameters (discrimination (a), difficulty (b) and guessing (c)) with what is obtained through traditional modeling. We found that difficulties were well estimated either way, but estimates of both a and c were biased if conditional independence is incorrectly assumed. Of greater import, we found that test information was substantially over-estimated when conditional independence was incorrectly assumed.  相似文献   

5.
How should we think about the concept of the testlet? How can testlets be better incorporated into test score analysis? Can there be a one‐item testlet?  相似文献   

6.
It is not always convenient or appropriate to construct tests in which individual items are fungible. There are situations in which small clusters of items (testlets) are the units that are assembled to create a test. Using data from a test of reading comprehension constructed of four passages with several questions following each passage, we show that local independence fails at the level of the individual questions. The questions following each passage, however, constitute a testlet. We discuss the application to testlet scoring of some multiple-category models originally developed for individual items, In the example examined, the concurrent validity of the testlet scoring equaled or exceeded that of individual-item-level scoring  相似文献   

7.
The presence of nuisance dimensionality is a potential threat to the accuracy of results for tests calibrated using a measurement model such as a factor analytic model or an item response theory model. This article describes a mixture group bifactor model to account for the nuisance dimensionality due to a testlet structure as well as the dimensionality due to differences in patterns of responses. The model can be used for testing whether or not an item functions differently across latent groups in addition to investigating the differential effect of local dependency among items within a testlet. An example is presented comparing test speededness results from a conventional factor mixture model, which ignores the testlet structure, with results from the mixture group bifactor model. Results suggested the 2 models treated the data somewhat differently. Analysis of the item response patterns indicated that the 2-class mixture bifactor model tended to categorize omissions as indicating speededness. With the mixture group bifactor model, more local dependency was present in the speeded than in the nonspeeded class. Evidence from a simulation study indicated the Bayesian estimation method used in this study for the mixture group bifactor model can successfully recover generated model parameters for 1- to 3-group models for tests containing testlets.  相似文献   

8.
ABSTRACT

Testlets, or groups of related items, are commonly included in educational assessments due to their many logistical and conceptual advantages. Despite their advantages, testlets introduce complications into the theory and practice of educational measurement. Responses to items within a testlet tend to be correlated even after controlling for latent ability, which violates the assumption of conditional independence made by traditional item response theory models. The present study used Monte Carlo simulation methods to evaluate the effects of testlet dependency on item and person parameter recovery and classification accuracy. Three calibration models were examined, including the traditional 2PL model with marginal maximum likelihood estimation, a testlet model with Bayesian estimation, and a bi-factor model with limited-information weighted least squares mean and variance adjusted estimation. Across testlet conditions, parameter types, and outcome criteria, the Bayesian testlet model outperformed, or performed equivalently to, the other approaches.  相似文献   

9.
This study investigated differential item functioning (DIF), differential bundle functioning (DBF), and differential test functioning (DTF) across gender of the reading comprehension section of the Graduate School Entrance English Exam in China. The datasets included 10,000 test-takers’ item-level responses to 6 five-item testlets. Both DIF and DBF were examined by using poly-simultaneous item bias test and item-response-theory-likelihood-ratio test, and DTF was investigated with multi-group confirmatory factor analyses (MG-CFA). The results indicated that although none of the 30 items exhibited statistically and practically significant DIF across gender at the item level, 2 testlets were consistently identified as having significant DBF at the testlet level by the two procedures. Nonetheless, DBF does not manifest itself at the overall test score level to produce DTF based on MG-CFA. This suggests that the relationship between item-level DIF and test-level DTF is a complicated issue with the mediating effect of testlets in testlet-based language assessment.  相似文献   

10.
In cognitive diagnostic models (CDMs), a set of fine-grained attributes is required to characterize complex problem solving and provide detailed diagnostic information about an examinee. However, it is challenging to ensure reliable estimation and control computational complexity when The test aims to identify the examinee's attribute profile in a large-scale map of attributes. To address this problem, this study proposes a cognitive diagnostic multistage testing by partitioning hierarchically structured attributes (CD-MST-PH) as a multistage testing for CDM. In CD-MST-PH, multiple testlets can be constructed based on separate attribute groups before testing occurs, which retains the advantages of multistage testing over fully adaptive testing or the on-the-fly approach. Moreover, testlets are offered sequentially and adaptively, thus improving test accuracy and efficiency. An item information measure is proposed to compute the discrimination power of an item for each attribute, and a module assembly method is presented to construct modules anchored at each separate attribute group. Several module selection indices for CD-MST-PH are also proposed by modifying the item selection indices used in cognitive diagnostic computerized adaptive testing. The results of simulation study show that CD-MST-PH can improve test accuracy and efficiency relative to the conventional test without adaptive stages.  相似文献   

11.
The use of accommodations has been widely proposed as a means of including English language learners (ELLs) or limited English proficient (LEP) students in state and districtwide assessments. However, very little experimental research has been done on specific accommodations to determine whether these pose a threat to score comparability. This study examined the effects of linguistic simplification of 4th- and 6th-grade science test items on a state assessment. At each grade level, 4 experimental 10-item testlets were included on operational forms of a statewide science assessment. Two testlets contained regular field-test items, but in a linguistically simplified condition. The testlets were randomly assigned to LEP and non-LEP students through the spiraling of test booklets. For non-LEP students, in 4 t-test analyses of the differences in means for each corresponding testlet, 3 of the mean score comparisons were not significantly different, and the 4th showed the regular version to be slightly easier than the simplified version. Analysis of variance (ANOVA), followed by pairwise comparisons of the testlets, showed no significant differences in the scores of non-LEP students across the 2 item types. Among the 40 items administered in both regular and simplified format, item difficulty did not vary consistently in favor of either format. Qualitative analyses of items that displayed significant differences in p values were not informative, because the differences were typically very small. For LEP students, there was 1 significant difference in student means, and it favored the regular version. However, because the study was conducted in a state with a small number of LEP students, the analyses of LEP student responses lacked statistical power. The results of this study show that linguistic simplification is not helpful to monolingual English-speaking students who receive the accommodation. Therefore, the results provide evidence that linguistic simplification is not a threat to the comparability of scores of LEP and monolingual English-speaking students when offered as an accommodation to LEP students. The study findings may also have implications for the use of linguistic simplification accommodations in science assessments in other states and in content areas other than science.  相似文献   

12.
C‐tests are a specific variant of cloze tests that are considered time‐efficient, valid indicators of general language proficiency. They are commonly analyzed with models of item response theory assuming local item independence. In this article we estimated local interdependencies for 12 C‐tests and compared the changes in item difficulties, reliability estimates, and person parameter estimates for different modeling approaches: (a) Rasch, (b) testlet, (c) partial credit, and (d) copula models. The results are complemented with findings of a simulation study in which sample size, number of testlets, and strength of residual correlations between items were systematically manipulated. Results are discussed with regard to the pivotal question whether residual dependencies between items are an artifact or part of the construct.  相似文献   

13.
The primary purpose of this study was to investigate the appropriateness and implication of incorporating a testlet definition into the estimation of procedures of the conditional standard error of measurement (SEM) for tests composed of testlets. Another purpose was to investigate the bias in estimates of the conditional SEM when using item-based methods instead of testlet-based methods. Several item-based and testlet-based estimation methods were proposed and compared. In general, item-based estimation methods underestimated the conditional SEM for tests composed for testlets, and the magnitude of this negative bias increased as the degree of conditional dependence among items within testlets increased. However, an item-based method using a generalizability theory model provided good estimates of the conditional SEM under mild violation of the assumptions for measurement modeling. Under moderate or somewhat severe violation, testlet-based methods with item response models provided good estimates.  相似文献   

14.
In teaching, representations are used as ways to illustrate the concepts underlying a specific topic. For example, use symbols (e.g., 1?+?2?=?3) to express the concept of addition. To compare students’ abilities to interpret different representations in mathematics, the symbolic representation (SR) test and the pictorial representation (PR) test were designed, and then administered to 681 sixth graders in Taipei, Taiwan. This study adopts two different modeling perspectives, the testlet perspective and the multi-ability perspective, to analyze this SR and PR test data in the context of item response theory. The main results show that:
  1. Students scored on average significantly higher on the SR test than the PR test.
  2. The effects of the item stem testlets could be large, but they are statistically non-significant; however, the influence of the number of items in the testlet should also be considered.
  3. The nature of the option representations, SR and PR, represents two different mathematics abilities.
  4. The main factor that influences students’ item responses is students’ abilities to interpret SR and PR, and the testlet effects generated from the shared item stem can be ignored.
  5. Regarding the parameter estimates of the best-fitting model: (a) the person ability variance estimates show that the ability distributions on the SR and PR dimension may not be the same, (b) the correlation estimate between the SR and PR dimension indicates that these two abilities are moderately correlated, and (c) the item difficulty estimates for different models are similar.
Suggestions for teaching practice and future studies are provided in the Conclusion.  相似文献   

15.
It is observed that many sorts of difficulties may preclude the uneventful construction of tests by a computerized algorithm, such as those currently in favor in Computerized Adaptive Testing (CAT). In this essay we discuss a number of these problems, as well as some possible avenues of solution. We conclude with the development of the "testlet," a bundle of items that can be arranged either hierarchically or linearly, thus maintaining the efficiency of an adaptive test while keeping the quality control of test construction that is possible currently only with careful expert scrutiny. Performance on the separate testlets is aggregated to yield ability estimates.  相似文献   

16.
In this study, the effectiveness of detection of differential item functioning (DIF) and testlet DIF using SIBTEST and Poly-SIBTEST were examined in tests composed of testlets. An example using data from a reading comprehension test showed that results from SIBTEST and Poly-SIBTEST were not completely consistent in the detection of DIF and testlet DIF. Results from a simulation study indicated that SIBTEST appeared to maintain type I error control for most conditions, except in some instances in which the magnitude of simulated DIF tended to increase. This same pattern was present for the Poly-SIBTEST results, although Poly-SIBTEST demonstrated markedly less control of type I errors. Type I error control with Poly-SIBTEST was lower for those conditions for which the ability was unmatched to test difficulty. The power results for SIBTEST were not adversely affected, when the size and percent of simulated DIF increased. Although Poly-SIBTEST failed to control type I errors in over 85% of the conditions simulated, in those conditions for which type I error control was maintained, Poly-SIBTEST demonstrated higher power than SIBTEST.  相似文献   

17.
《教育实用测度》2013,26(3):241-261
This simulation study compared two procedures to enable an adaptive test to select items in correspondence with a content blueprint. Trait level estimates obtained from testlet-based and constrained adaptive tests administered to 10,000 simulated examinees under two trait distributions and three item pool sizes were compared to the trait level estimates obtained from traditional adaptive tests in terms of mean absolute error, bias, and information. Results indicate that using constrained adaptive testing requires an increase of 5% to 11% in test length over the traditional adaptive test to reach the same error level and, using testlets requires an increase of 43% to 104% in test length over the traditional adaptive test. Given these results, the use of constrained computerized adaptive testing is recommended for situations in which an adaptive test must adhere to particular content specifications.  相似文献   

18.
According to item response theory (IRT), examinee ability estimation is independent of the particular set of test items administered from a calibrated pool. Although the most popular application of this feature of IRT is computerized adaptive (CA) testing, a recently proposed alternative is self-adapted (SA) testing, in which examinees choose the difficulty level of each of their test items. This study compared examinee performance under SA and CA tests, finding that examinees taking the SA test (a) obtained significantly higher ability scores and (b) reported significantly lower posttest state anxiety. The results of this study suggest that SA testing is a desirable format for computer-based testing.  相似文献   

19.
The applications of item response theory (IRT) models assume local item independence and that examinees are independent of each other. When a representative sample for psychometric analysis is selected using a cluster sampling method in a testlet‐based assessment, both local item dependence and local person dependence are likely to be induced. This study proposed a four‐level IRT model to simultaneously account for dual local dependence due to item clustering and person clustering. Model parameter estimation was explored using the Markov Chain Monte Carlo method. Model parameter recovery was evaluated in a simulation study in comparison with three other related models: the Rasch model, the Rasch testlet model, and the three‐level Rasch model for person clustering. In general, the proposed model recovered the item difficulty and person ability parameters with the least total error. The bias in both item and person parameter estimation was not affected but the standard error (SE) was affected. In some simulation conditions, the difference in classification accuracy between models could go up to 11%. The illustration using the real data generally supported model performance observed in the simulation study.  相似文献   

20.
Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several estimation methods for different measurement models using simulation techniques. Three types of estimation approach were conceptualized for generalizability theory (GT) and item response theory (IRT): item score approach (ISA), testlet score approach (TSA), and item-nested-testlet approach (INTA). The magnitudes of overestimation when applying item-based methods ranged from 0.02 to 0.06 and were related to the degrees of dependence among within-testlet items. Reliability estimates from TSA were lower than those from INTA due to the loss of information with IRT approaches. However, this could not be applied in GT. Specified methods in IRT produced higher reliability estimates than those in GT using the same approach. Relatively smaller magnitudes of error in reliability estimates were observed for ISA and for methods in IRT. Thus, it seems reasonable to use TSA as well as INTA for both GT and IRT. However, if there is a relatively large dependence among within-testlet items, INTA should be considered for IRT due to nonnegligible loss of information.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号