首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 47 毫秒
1.
2.
This paper proposes two new item selection methods for cognitive diagnostic computerized adaptive testing: the restrictive progressive method and the restrictive threshold method. They are built upon the posterior weighted Kullback‐Leibler (KL) information index but include additional stochastic components either in the item selection index or in the item selection procedure. Simulation studies show that both methods are successful at simultaneously suppressing overexposed items and increasing the usage of underexposed items. Compared to item selection based upon (1) pure KL information and (2) the Sympson‐Hetter method, the two new methods strike a better balance between item exposure control and measurement accuracy. The two new methods are also compared with Barrada et al.'s (2008) progressive method and proportional method.  相似文献   

3.
During computerized adaptive testing (CAT), items are selected continuously according to the test-taker's estimated ability. The traditional method of attaining the highest efficiency in ability estimation is to select items of maximum Fisher information at the currently estimated ability. Test security has become a problem because high-discrimination items are more likely to be selected and become overexposed. So, there seems to be a tradeoff between high efficiency in ability estimations and balanced usage of items. This series of four studies with simulated data addressed the dilemma by focusing on the notion of whether more or less discriminating items should be used first in CAT. The first study demonstrated that the common maximum information method with Sympson and Hetter (1985) control resulted in the use of more discriminating items first. The remaining studies showed that using items in the reverse order (i.e., less discriminating items first), as described in Chang and Ying's (1999) stratified method had potential advantages: (a) a more balanced item usage and (b) a relatively stable resultant item pool structure with easy and inexpensive management. This stratified method may have ability-estimation efficiency better than or close to that of other methods, particularly for operational item pools when retired items cannot be totally replenished with similar highly discriminating items. It is argued that the judicious selection of items, as in the stratified method, is a more active control of item exposure, which can successfully even out the usage of all items.  相似文献   

4.
In test assembly, a fundamental difference exists between algorithms that select a test sequentially or simultaneously. Sequential assembly allows us to optimize an objective function at the examinee's ability estimate, such as the test information function in computerized adaptive testing. But it leads to the non-trivial problem of how to realize a set of content constraints on the test—a problem more naturally solved by a simultaneous item-selection method. Three main item-selection methods in adaptive testing offer solutions to this dilemma. The spiraling method moves item selection across categories of items in the pool proportionally to the numbers needed from them. Item selection by the weighted-deviations method (WDM) and the shadow test approach (STA) is based on projections of the future consequences of selecting an item. These two methods differ in that the former calculates a projection of a weighted sum of the attributes of the eventual test and the latter a projection of the test itself. The pros and cons of these methods are analyzed. An empirical comparison between the WDM and STA was conducted for an adaptive version of the Law School Admission Test (LSAT), which showed equally good item-exposure rates but violations of some of the constraints and larger bias and inaccuracy of the ability estimator for the WDM.  相似文献   

5.
The goal of the current study was to introduce a new stopping rule for computerized adaptive testing. The predicted standard error reduction stopping rule (PSER) uses the predictive posterior variance to determine the reduction in standard error that would result from the administration of additional items. The performance of the PSER was compared to that of the minimum standard error stopping rule and a modified version of the minimum information stopping rule in a series of simulated adaptive tests, drawn from a number of item pools. Results indicate that the PSER makes efficient use of CAT item pools, administering fewer items when predictive gains in information are small and increasing measurement precision when information is abundant.  相似文献   

6.
Successful administration of computerized adaptive testing (CAT) programs in educational settings requires that test security and item exposure control issues be taken seriously. Developing an item selection algorithm that strikes the right balance between test precision and level of item pool utilization is the key to successful implementation and long‐term quality control of CAT. This study proposed a new item selection method using the “efficiency balanced information” criterion to address issues with the maximum Fisher information method and stratification methods. According to the simulation results, the new efficiency balanced information method had desirable advantages over the other studied item selection methods in terms of improving the optimality of CAT assembly and utilizing items with low a‐values while eliminating the need for item pool stratification.  相似文献   

7.
The use of computerized adaptive testing algorithms for ranking items (e.g., college preferences, career choices) involves two major challenges: unacceptably high computation times (selecting from a large item pool with many dimensions) and biased results (enhanced preferences or intensified examinee responses because of repeated statements across items). To address these issues, we introduce subpool partition strategies for item selection and within-person statement exposure control procedures. Simulations showed that the multinomial method reduces computation time while maintaining measurement precision. Both the freeze and revised Sympson-Hetter online (RSHO) methods controlled the statement exposure rate; RSHO sacrificed some measurement precision but increased pool use. Furthermore, preventing a statement's repetition on consecutive items neither hindered the effectiveness of the freeze or RSHO method nor reduced measurement precision.  相似文献   

8.
APPLICATION OF COMPUTERIZED ADAPTIVE TESTING TO EDUCATIONAL PROBLEMS   总被引:1,自引:0,他引:1  
Three applications of computerized adaptive testing (CAT) to help solve problems encountered in educational settings are described and discussed. Each of these applications makes use of item response theory to select test questions from an item pool to estimate a student's achievement level and its precision. These estimates may then be used in conjunction with certain testing strategies to facilitate certain educational decisions. The three applications considered are (a) adaptive mastery testing for determining whether or not a student has mastered a particular content area, (b) adaptive grading for assigning grades to students, and (c) adaptive self-referenced testing for estimating change in a student's achievement level. Differences between currently used classroom procedures and these CAT procedures are discussed. For the adaptive mastery testing procedure, evidence from a series of studies comparing conventional and adaptive testing procedures is presented showing that the adaptive procedure results in more accurate mastery classifications than do conventional mastery tests, while using fewer test questions.  相似文献   

9.
One of the methods of controlling test security in adaptive testing is imposing random item-ineligibility constraints on the selection of the items with probabilities automatically updated to maintain a predetermined upper bound on the exposure rates. Three major improvements of the method are presented. First, a few modifications to improve the initialization of the method and accelerate the impact of its feedback mechanism on the observed item-exposure rates are introduced. Second, the case of conditional item-exposure control given the uncertainty of examinee's ability parameter is addressed. Third, although rare for a well-designed item pool, when applied in combination with the shadow-test approach to adaptive testing the method may meet occasional infeasibility of the shadow-test model. A big M method is proposed that resolves the issue. The practical advantages of the improvements are illustrated using simulated adaptive testing from a real-world item pool under a variety of conditions.  相似文献   

10.
When a computerized adaptive testing (CAT) version of a test co-exists with its paper-and-pencil (P&P) version, it is important for scores from the CAT version to be comparable to scores from its P&P version. The CAT version may require multiple item pools for test security reasons, and CAT scores based on alternate pools also need to be comparable to each other. In this paper, we review research literature on CAT comparability issues and synthesize issues specific to these two settings. A framework of criteria for evaluating comparability was developed that contains the following three categories of criteria: validity criterion, psychometric property/reliability criterion, and statistical assumption/test administration condition criterion. Methods for evaluating comparability under these criteria as well as various algorithms for improving comparability are described and discussed. Focusing on the psychometric property/reliability criterion, an example using an item pool of ACT Assessment Mathematics items is provided to demonstrate a process for developing comparable CAT versions and for evaluating comparability. This example illustrates how simulations can be used to improve comparability at the early stages of the development of a CAT. The effects of different specifications of practical constraints, such as content balancing and item exposure rate control, and the effects of using alternate item pools are examined. One interesting finding from this study is that a large part of incomparability may be due to the change from number-correct score-based scoring to IRT ability estimation-based scoring. In addition, changes in components of a CAT, such as exposure rate control, content balancing, test length, and item pool size were found to result in different levels of comparability in test scores.  相似文献   

11.
Many computerized testing algorithms require the fitting of some item response theory (IRT) model to examinees' responses to facilitate item selection, the determination of test stopping rules, and classification decisions. Some IRT models are thought to be particularly useful for small volume certification programs that wish to make the transition to computerized adaptive testing (CAT). The one-parameter logistic model (1-PLM) is usually assumed to require a smaller sample size than the three-parameter logistic model (3-PLM) for item parameter calibrations. This study examined the effects of model misspecification on the precision of the decisions made using the sequential probability ratio test (SPRT). For this comparison, the 1-PLM was used to estimate item parameters, even though the items' characteristics were represented by a 3-PLM. Results demonstrated that the 1-PLM produced considerably more decision errors under simulation conditions similar to a real testing environment, compared to the true model and to a fixed-form standard reference set of items.  相似文献   

12.
The purpose of this study was to compare and evaluate three on-line pretest item calibration-scaling methods (the marginal maximum likelihood estimate with one expectation maximization [EM] cycle [OEM] method, the marginal maximum likelihood estimate with multiple EM cycles [MEM] method, and Stocking's Method B) in terms of itern parameter recovery when the item responses to the pretest items in the pool are sparse. Simulations of computerized adaptive tests were used to evaluate the results yielded by the three methods. The MEM method produced the smallest average total error in parameter estimation, and the OEM method yielded the largest total error.  相似文献   

13.
The current study compares the progressive-restricted standard error (PR-SE) exposure control method with the Sympson-Hetter, randomesque, and no exposure control (maximum information) procedures using the generalized partial credit model with fixed- and variable-length CATs and two item pools. The PR-SE method administered the entire item pool for all conditions; whereas the Sympson-Hetter and randomesque procedures did not administer 27%–28% and 14%, respectively, of item pool 1 and about 45%–50% and 27%–29% of item pool 2, respectively, of the items that were not administered. PR-SE also resulted in the smallest amount of mean item overlap averaged across replications. These results were obtained with similar measurement precision compared to the other methods while improving on the utilization of the item pools, except for very low theta levels (less than ?2) for item pool 2, where a mismatch with the trait distribution occurs.  相似文献   

14.
According to item response theory (IRT), examinee ability estimation is independent of the particular set of test items administered from a calibrated pool. Although the most popular application of this feature of IRT is computerized adaptive (CA) testing, a recently proposed alternative is self-adapted (SA) testing, in which examinees choose the difficulty level of each of their test items. This study compared examinee performance under SA and CA tests, finding that examinees taking the SA test (a) obtained significantly higher ability scores and (b) reported significantly lower posttest state anxiety. The results of this study suggest that SA testing is a desirable format for computer-based testing.  相似文献   

15.
《教育实用测度》2013,26(3):241-261
This simulation study compared two procedures to enable an adaptive test to select items in correspondence with a content blueprint. Trait level estimates obtained from testlet-based and constrained adaptive tests administered to 10,000 simulated examinees under two trait distributions and three item pool sizes were compared to the trait level estimates obtained from traditional adaptive tests in terms of mean absolute error, bias, and information. Results indicate that using constrained adaptive testing requires an increase of 5% to 11% in test length over the traditional adaptive test to reach the same error level and, using testlets requires an increase of 43% to 104% in test length over the traditional adaptive test. Given these results, the use of constrained computerized adaptive testing is recommended for situations in which an adaptive test must adhere to particular content specifications.  相似文献   

16.
Increasing use of item pools in large-scale educational assessments calls for an appropriate scaling procedure to achieve a common metric among field-tested items. The present study examines scaling procedures for developing a new item pool under a spiraled block linking design. The three scaling procedures are considered: (a) concurrent calibration, (b) separate calibration with one linking, and (c) separate calibration with three sequential linking. Evaluation across varying sample sizes and item pool sizes suggests that calibrating an item pool simultaneously results in the most stable scaling. The separate calibration with linking procedures produced larger scaling errors as the number of linking steps increased. The Haebara’s item characteristic curve linking resulted in better performances than the test characteristic curve (TCC) linking method. The present article provides an analytic illustration that the test characteristic curve method may fail to find global solutions in polytomous items. Finally, comparison of the single- and mixed-format item pools suggests that the use of polytomous items as the anchor can improve the overall scaling accuracy of the item pools.  相似文献   

17.
We evaluated the efficiency, precision, and concurrent validity of results obtained from adaptive and fired-item music listening tests in three studies: (a) a computer simulation study in which each of 2,200 simulees completed a computerized adaptive tonal memory test, a computerized fired-item tonal memory test constructed from items in the adaptive test pool and two standardized group-administered tonal memory tests; (b) a live testing study in which each of 204 examinees took the computerized adaptive test and the standardized tests; and (c) a live testing study in which randomly equivalent groups took either the computerized adaptive test (n = 86) or the computerized fired-item test (n = 86). The adaptive music test required 50% to 93% fewer items to match the reliability and concurrent validity of the fired-item tests, and it yielded higher levels of reliability and concurrent validity than the fired-item tests when test length was held constant. These findings suggest that computerized adaptive tests, which typically have been limited to visually produced items, may also be well suited for measuring skills that require aurally produced items.  相似文献   

18.
《教育实用测度》2013,26(4):359-375
Many procedures have been developed for selecting the "best" items for a computerized adaptive test. There is a trend toward the use of adaptive testing in applied settings such as licensure tests, program entrance tests, and educational tests. It is useful to consider procedures for item selection and the special needs of applied testing settings to facilitate test design. The current study reviews several classical approaches and alternative approaches to item selection and discusses their relative merit. This study also describes procedures for constrained computerized adaptive testing (C-CAT) that may be added to classical item selection approaches to allow them to be used for applied testing, while maintaining the high measurement precision and short test length that made adaptive testing attractive to practitioners initially.  相似文献   

19.
Computerized adaptive testing (CAT) has gained deserved popularity in the administration of educational and professional assessments, but continues to face test security challenges. To ensure sustained quality assurance and testing integrity, it is imperative to establish and maintain multiple stable item pools that are consistent in terms of psychometric characteristics and content specifications. This study introduces the Honeycomb Pool Assembly (HPA) framework, an innovative solution for the construction of multiple parallel item pools for CAT that maximizes item utilization in the item bank. The HPA framework comprises two stages—cell assembly and pool assembly—and uses a mixed integer programming modeling approach. An empirical study demonstrated HPA's effectiveness in creating a large number of parallel pools using a real-world high-stakes CAT assessment item bank. The HPA framework offers several advantages, including (a) simultaneous creation of multiple parallel pools, (b) simplification of item pool maintenance, and (c) flexibility in establishing statistical and operational constraints. Moreover, it can help testing organizations efficiently manage and monitor the health of their item banks. Thus, the HPA framework is expected to be a valuable tool for testing professionals and organizations to address test security challenges and maintain the integrity of high-stakes CAT assessments.  相似文献   

20.
Several techniques exist to automatically put together a test meeting a number of specifications. In an item bank, the items are stored with their characteristics. A test is constructed by selecting a set of items that fulfills the specifications set by the test assembler. Test assembly problems are often formulated in terms of a model consisting of restrictions and an objective to be maximized or minimized. A problem arises when it is impossible to construct a test from the item pool that meets all specifications, that is, when the model is not feasible. Several methods exist to handle these infeasibility problems. In this article, test assembly models resulting from two practical testing programs were reconstructed to be infeasible. These models were analyzed using methods that forced a solution (Goal Programming, Multiple-Goal Programming, Greedy Heuristic), that analyzed the causes (Relaxed and Ordered Deletion Algorithm (RODA), Integer Randomized Deletion Algorithm (IRDA), Set Covering (SC), and Item Sampling), or that analyzed the causes and used this information to force a solution (Irreducible Infeasible Set-Solver). Specialized methods such as the IRDA and the Irreducible Infeasible Set-Solver performed best. Recommendations about the use of different methods are given.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号