首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Computerized adaptive testing (CAT) has gained deserved popularity in the administration of educational and professional assessments, but continues to face test security challenges. To ensure sustained quality assurance and testing integrity, it is imperative to establish and maintain multiple stable item pools that are consistent in terms of psychometric characteristics and content specifications. This study introduces the Honeycomb Pool Assembly (HPA) framework, an innovative solution for the construction of multiple parallel item pools for CAT that maximizes item utilization in the item bank. The HPA framework comprises two stages—cell assembly and pool assembly—and uses a mixed integer programming modeling approach. An empirical study demonstrated HPA's effectiveness in creating a large number of parallel pools using a real-world high-stakes CAT assessment item bank. The HPA framework offers several advantages, including (a) simultaneous creation of multiple parallel pools, (b) simplification of item pool maintenance, and (c) flexibility in establishing statistical and operational constraints. Moreover, it can help testing organizations efficiently manage and monitor the health of their item banks. Thus, the HPA framework is expected to be a valuable tool for testing professionals and organizations to address test security challenges and maintain the integrity of high-stakes CAT assessments.  相似文献   

2.
This article develops a conceptual framework that addresses score comparability. The intent of the framework is to help identify and organize threats to comparability in a particular assessment situation. Aspects of the testing situations that might threaten score comparability are delineated, procedures for evaluating the degree of score comparability are described, and suggestions are made about how to minimize the effects of potential threats. The situations considered are restricted to those in which test developers intend to (a) be able to use scores on 2 or more tests interchangeably, (b) collect data that allow for the conversion of scores on each of the tests to a common scale, and (c) use the scores to make decisions about individuals. Comparability of scores on alternate forms of performance assessments, adaptive and paper-and-pencil tests, and alternate pools used for computerized adaptive tests are considered within the framework. Aspects of these testing situations that might threaten score comparability and procedures for evaluating the degree of score comparability are described. Suggestions are made about how to minimize the effects of potential threats to comparability.  相似文献   

3.
Successful administration of computerized adaptive testing (CAT) programs in educational settings requires that test security and item exposure control issues be taken seriously. Developing an item selection algorithm that strikes the right balance between test precision and level of item pool utilization is the key to successful implementation and long‐term quality control of CAT. This study proposed a new item selection method using the “efficiency balanced information” criterion to address issues with the maximum Fisher information method and stratification methods. According to the simulation results, the new efficiency balanced information method had desirable advantages over the other studied item selection methods in terms of improving the optimality of CAT assembly and utilizing items with low a‐values while eliminating the need for item pool stratification.  相似文献   

4.
《教育实用测度》2013,26(4):297-312
Certain potential benefits of using item response theory in test construction are discussed and evaluated using the experience and evidence accumulated during 9 years of using a three-parameter model in the construction of major achievement batteries. We also discuss several cautions and limitations in realizing these benefits as well as issues in need of further research. The potential benefits considered are those of getting "sample-free" item calibrations and "item-free" person measurement, automatically equating various tests, decreasing the standard errors of scores without increasing the number of items used by using item pattern scoring, assessing item bias (or differential item functioning) independently of difficulty in a manner consistent with item selection, being able to determine just how adequate a tryout pool of items may be, setting up computer-generated "ideal" tests drawn from pools as targets for test developers, and controlling the standard error of a selected test at any desired set of score levels.  相似文献   

5.
With the proliferation of computers in test delivery today, adaptive testing has become quite popular, especially when examinees must be classified into two categories (pass/fail, master/nonmaster). Several well‐established organisations have provided standards and guidelines for the design and evaluation of educational and psychological testing. The purpose of this paper was not to repeat the guidelines and standards that exist in the literature but to identify and discuss the main evaluation parameters for a computer‐adaptive test (CAT). A number of parameters should be taken into account when evaluating CAT. Key parameters include utility, validity, reliability, satisfaction, usability, reporting, administration, security, and thoseassociated with adaptivity, item pool, and psychometric theory. These parameters are presented and discussed below and form a proposed evaluation model, Evaluation Model of Computer‐Adaptive Testing.  相似文献   

6.
Guidelines are proposed for evaluating a computerized adaptive test. Topics include dimensionality, measurement error, validity, estimation of item parameters, item pool characteristics and human factors. Equating CAT and conventional tests is considered and matters of equity are addressed.  相似文献   

7.
Large-scale assessments often use a computer adaptive test (CAT) for selection of items and for scoring respondents. Such tests often assume a parametric form for the relationship between item responses and the underlying construct. Although semi- and nonparametric response functions could be used, there is scant research on their performance in a CAT. In this work, we compare parametric response functions versus those estimated using kernel smoothing and a logistic function of a monotonic polynomial. Monotonic polynomial items can be used with traditional CAT item selection algorithms that use analytical derivatives. We compared these approaches in CAT simulations with a variety of item selection algorithms. Our simulations also varied the features of the calibration and item pool: sample size, the presence of missing data, and the percentage of nonstandard items. In general, the results support the use of semi- and nonparametric item response functions in a CAT.  相似文献   

8.
9.
The comparison of scores from linguistically different tests is a twofold matter: the adaptation of tests and the comparison of scores. These 2 aspects of measurement invariance intersect at the need to guarantee the psychometric equivalence between the original and adapted versions. In this study, the authors examined comparability in 2 stages. First, they conducted a thorough study of progressive factorial variance through which they defined an anchor test. Second, they defined an observed score-equated function to establish equivalences between the original test and the adapted test; they used a design of common item nonequivalent groups for this purpose.  相似文献   

10.
《教育实用测度》2013,26(4):381-405
In recent years, there has been a large increase in the number of university applicants requesting special accommodations for university entrance exams. The Israeli National Institute for Testing and Evaluation (NITE) administers a Psychometric Entrance Test (comparable to the Scholastic Assessment Test in the United States) to assist universities in Israel in selecting undergraduates. Because universities in Israel do not permit flagging of candidates receiving special testing accommodations, such scores are treated as identical to scores attained under regular testing conditions. The increase in the number of students receiving testing accommodations and the prohibition of flagging have brought into focus certain psychometric issues pertaining to the fairness of testing students with disabilities and the comparability of special and standard testing conditions. To address these issues, NITE has developed a computerized adaptive psychometric test for administration to examinees with disabilities. This article discusses the process of developing the computerized test and ensuring its comparability to the paper-and-pencil test. This article also presents data on the operational computerized test.  相似文献   

11.
This paper illustrates that the psychometric properties of scores and scales that are used with mixed‐format educational tests can impact the use and interpretation of the scores that are reported to examinees. Psychometric properties that include reliability and conditional standard errors of measurement are considered in this paper. The focus is on mixed‐format tests in situations for which raw scores are integer‐weighted sums of item scores. Four associated real‐data examples include (a) effects of weights associated with each item type on reliability, (b) comparison of psychometric properties of different scale scores, (c) evaluation of the equity property of equating, and (d) comparison of the use of unidimensional and multidimensional procedures for evaluating psychometric properties. Throughout the paper, and especially in the conclusion section, the examples are related to issues associated with test interpretation and test use.  相似文献   

12.
As access and reliance on technology continue to increase, so does the use of computerized testing for admissions, licensure/certification, and accountability exams. Nonetheless, full computer‐based test (CBT) implementation can be difficult due to limited resources. As a result, some testing programs offer both CBT and paper‐based test (PBT) administration formats. In such situations, evidence that scores obtained from different formats are comparable must be gathered. In this study, we illustrate how contemporary statistical methods can be used to provide evidence regarding the comparability of CBT and PBT scores at the total test score and item levels. Specifically, we looked at the invariance of test structure and item functioning across test administration mode across subgroups of students defined by SES and sex. Multiple replications of both confirmatory factor analysis and Rasch differential item functioning analyses were used to assess invariance at the factorial and item levels. Results revealed a unidimensional construct with moderate statistical support for strong factorial‐level invariance across SES subgroups, and moderate support of invariance across sex. Issues involved in applying these analyses to future evaluations of the comparability of scores from different versions of a test are discussed.  相似文献   

13.
Modifications of administration and item arrangement of a conventional test can force a match between item difficulty levels and the ability level of the examinee. Although different examinees take different sets of items, the scoring method provides comparable scores for all. Furthermore, the test is self-scoring. These advantages are obtained without some of the usual disadvantages of tailored testing.  相似文献   

14.
Increasing use of item pools in large-scale educational assessments calls for an appropriate scaling procedure to achieve a common metric among field-tested items. The present study examines scaling procedures for developing a new item pool under a spiraled block linking design. The three scaling procedures are considered: (a) concurrent calibration, (b) separate calibration with one linking, and (c) separate calibration with three sequential linking. Evaluation across varying sample sizes and item pool sizes suggests that calibrating an item pool simultaneously results in the most stable scaling. The separate calibration with linking procedures produced larger scaling errors as the number of linking steps increased. The Haebara’s item characteristic curve linking resulted in better performances than the test characteristic curve (TCC) linking method. The present article provides an analytic illustration that the test characteristic curve method may fail to find global solutions in polytomous items. Finally, comparison of the single- and mixed-format item pools suggests that the use of polytomous items as the anchor can improve the overall scaling accuracy of the item pools.  相似文献   

15.
Preventing items in adaptive testing from being over- or underexposed is one of the main problems in computerized adaptive testing. Though the problem of overexposed items can be solved using a probabilistic item-exposure control method, such methods are unable to deal with the problem of underexposed items. Using a system of rotating item pools, on the other hand, is a method that potentially solves both problems. In this method, a master pool is divided into (possibly overlapping) smaller item pools, which are required to have similar distributions of content and statistical attributes. These pools are rotated among the testing sites to realize desirable exposure rates for the items. A test assembly model, motivated by Gulliksen's matched random subtests method, was explored to help solve the problem of dividing a master pool into a set of smaller pools. Different methods to solve the model are proposed. An item pool from the Law School Admission Test was used to evaluate the performances of computerized adaptive tests from systems of rotating item pools constructed using these methods.  相似文献   

16.
Unlike the Meier Art Test of Aesthetic Judgment, published data about the Meier Art Test of Aesthetic Perception (MATAP) are rare. Hence the goals of this study on the MATAP were to provide reliability and validity data and to test an alternate scoring system. With data from three samples, the MATAP exhibited unsatisfactory internal consistency and test-retest reliability. Significant, but low, positive correlations were found with the Illinois Art Ability Test, the Kuder Artistic scale, and biographical items concerning art training and interests. The MATAP correlated negatively with the Child Test of Esthetic Sensitivity. Most art related course grades and all American College Testing Program scores were not significantly related to the MATAP. The psychometric benefits of the alternate scoring system appeared marginal.  相似文献   

17.
测验等值设计新探讨:ETP设计   总被引:1,自引:1,他引:0  
项目反应理论框架下新的基于题库的大型测验的等值设计:等值到题库设计(ETP设计),与其他传统等值设计相比,可以避免传统共同组设计和共同题设计的一些缺点,并能够在保证等值精度的情况下对测验进行等值。在目前许多大型考试已有题库的情况下,ETP设计具有较大的发展空间。  相似文献   

18.
The intent of this research was to find an item selection procedure in the multidimensional computer adaptive testing (CAT) framework that yielded higher precision for both the domain and composite abilities, had a higher usage of the item pool, and controlled the exposure rate. Five multidimensional CAT item selection procedures (minimum angle; volume; minimum error variance of the linear combination; minimum error variance of the composite score with optimized weight; and Kullback‐Leibler information) were studied and compared with two methods for item exposure control (the Sympson‐Hetter procedure and the fixed‐rate procedure, the latter simply refers to putting a limit on the item exposure rate) using simulated data. The maximum priority index method was used for the content constraints. Results showed that the Sympson‐Hetter procedure yielded better precision than the fixed‐rate procedure but had much lower item pool usage and took more time. The five item selection procedures performed similarly under Sympson‐Hetter. For the fixed‐rate procedure, there was a trade‐off between the precision of the ability estimates and the item pool usage: the five procedures had different patterns. It was found that (1) Kullback‐Leibler had better precision but lower item pool usage; (2) minimum angle and volume had balanced precision and item pool usage; and (3) the two methods minimizing the error variance had the best item pool usage and comparable overall score recovery but less precision for certain domains. The priority index for content constraints and item exposure was implemented successfully.  相似文献   

19.
The development of alternate assessments for students with disabilities plays a pivotal role in state and national accountability systems. An important assumption in the use of alternate assessments in these accountability systems is that scores are comparable on different test forms across diverse groups of students over time. The use of test equating is a common way that states attempt to establish score comparability on different test forms. However, equating presents many unique, practical, and technical challenges for alternate assessments. This article provides case studies of equating for two alternate assessments in Michigan and an approach to determine whether or not equating would be preferred to not equating on these assessments. This approach is based on examining equated score and performance-level differences and investigating population invariance across subgroups of students with disabilities. Results suggest that using an equating method with these data appeared to have a minimal impact on proficiency classifications. The population invariance assumption was suspect for some subgroups and equating methods with some large potential differences observed.  相似文献   

20.
The psychometric literature provides little empirical evaluation of examinee test data to assess essential psychometric properties of innovative items. In this study, examinee responses to conventional (e.g., multiple choice) and innovative item formats in a computer-based testing program were analyzed for IRT information with the three-parameter and graded response models. The innovative item types considered in this study provided more information across all levels of ability than multiple-choice items. In addition, accurate timing data captured via computer administration were analyzed to consider the relative efficiency of the multiple choice and innovative item types. As with previous research, multiple-choice items provide more information per unit time. Implications for balancing policy, psychometric, and pragmatic factors in selecting item formats are also discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号