期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Application of IRT Fixed Parameter Calibration to Multiple-Group Test Data

Seonghoon Kim Michael J. Kolen 《教育实用测度》2013,26(4):310-324

ABSTRACT

In applications of item response theory (IRT), fixed parameter calibration (FPC) has been used to estimate the item parameters of a new test form on the existing ability scale of an item pool. The present paper presents an application of FPC to multiple examinee groups test data that are linked to the item pool via anchor items, and investigates the performance of FPC relative to an alternative approach, namely independent 0–1 calibration and scale linking. Two designs for linking to the pool are proposed that involve multiple groups and test forms, for which multiple-group FPC can be effectively used. A real-data study shows that the multiple-group FPC method performs similarly to the alternative method in estimating ability distributions and new item parameters on the scale of the item pool. In addition, a simulation study shows that the multiple-group FPC method performs nearly equally to or better than the alternative method in recovering the underlying ability distributions and the new item parameters. 相似文献

2.

Linking item parameters to a base scale 总被引：1，自引：0，他引：1

Taehoon Kang Nancy S. Petersen 《Asia Pacific Education Review》2012,13(2):311-321

This paper compares three methods of item calibration??concurrent calibration, separate calibration with linking, and fixed item parameter calibration??that are frequently used for linking item parameters to a base scale. Concurrent and separate calibrations were implemented using BILOG-MG. The Stocking and Lord in Appl Psychol Measure 7:201?C210, (1983) characteristic curve method of parameter linking was used in conjunction with separate calibration. The fixed item parameter calibration (FIPC) method was implemented using both BILOG-MG and PARSCALE because the method is carried out differently by the two programs. Both programs use multiple EM cycles, but BILOG-MG does not update the prior ability distribution during FIPC calibration, whereas PARSCALE updates the prior ability distribution multiple times. The methods were compared using simulations based on actual testing program data, and results were evaluated in terms of recovery of the underlying ability distributions, the item characteristic curves, and the test characteristic curves. Factors manipulated in the simulations were sample size, ability distributions, and numbers of common (or fixed) items. The results for concurrent calibration and separate calibration with linking were comparable, and both methods showed good recovery results for all conditions. Between the two fixed item parameter calibration procedures, only the appropriate use of PARSCALE consistently provided item parameter linking results similar to those of the other two methods. 相似文献

3.

Identification and Evaluation of Local Item Dependencies in the Medical College Admissions Test

April L. Zenisky Ronald K. Hambleton Stephen G. Sired 《Journal of Educational Measurement》2002,39(4):291-309

Measurement specialists routinely assume examinee responses to test items are independent of one another. However, previous research has shown that many contemporary tests contain item dependencies and not accounting for these dependencies leads to misleading estimates of item, test, and ability parameters. The goals of the study were (a) to review methods for detecting local item dependence (LID), (b) to discuss the use of testlets to account for LID in context-dependent item sets, (c) to apply LID detection methods and testlet-based item calibrations to data from a large-scale, high-stakes admissions test, and (d) to evaluate the results with respect to test score reliability and examinee proficiency estimation. Item dependencies were found in the test and these were due to test speededness or context dependence (related to passage structure). Also, the results highlight that steps taken to correct for the presence of LID and obtain less biased reliability estimates may impact on the estimation of examinee proficiency. The practical effects of the presence of LID on passage-based tests are discussed, as are issues regarding how to calibrate context-dependent item sets using item response theory. 相似文献

4.

THE EFFECTS OF VIOLATIONS OF UNIDIMENSIONALITY ON THE ESTIMATION OF ITEM AND ABILITY PARAMETERS AND ON ITEM RESPONSE THEORY EQUATING OF THE GRE VERBAL SCALE

NEIL J. DORANS NEAL M. KINGSTON 《Journal of Educational Measurement》1985,22(4):249-262

One of the major assumptions of item response theory (IRT)models is that performance on a set of items is unidimensional, that is, the probability of successful performance by examinees on a set of items can be modeled by a mathematical model that has only one ability parameter. In practice, this strong assumption is likely to be violated. An important pragmatic question to consider is: What are the consequences of these violations? In this research, evidence is provided of violations of unidimensionality on the verbal scale of the GRE Aptitude Test, and the impact of these violations on IRT equating is examined. Previous factor analytic research on the GRE Aptitude Test suggested that two verbal dimensions, discrete verbal (analogies, antonyms, and sentence completions)and reading comprehension, existed. Consequently, the present research involved two separate calibrations (homogeneous) of discrete verbal items and reading comprehension items as well as a single calibration (heterogeneous) of all verbal item types. Thus, each verbal item was calibrated twice and each examinee obtained three ability estimates: reading comprehension, discrete verbal, and all verbal. The comparability of ability estimates based on homogeneous calibrations (reading comprehension or discrete verbal) to each other and to the all-verbal ability estimates was examined. The effects of homogeneity of item calibration pool on estimates of item discrimination were also examined. Then the comparability of IRT equatings based on homogeneous and heterogeneous calibrations was assessed. The effects of calibration homogeneity on ability parameter estimates and discrimination parameter estimates are consistent with the existence of two highly correlated verbal dimensions. IRT equating results indicate that although violations of unidimensionality may have an impact on equating, the effect may not be substantial. 相似文献

5.

Estimating Average Domain Scores

Mary Pommerich W. Alan Nicewander Bradley A. Hanson 《Journal of Educational Measurement》1999,36(3):199-216

A simulation study was performed to determine whether a group's average percent correct in a content domain could be accurately estimated for groups taking a single test form and not the entire domain of items. Six Item Response Theory based domain score estimation methods were evaluated, under conditions of few items per content area perform taken, small domains, and small group sizes. The methods used item responses to a single form taken to estimate examinee or group ability; domain scores were then computed using the ability estimates and domain item characteristics. The IRT-based domain score estimates typically showed greater accuracy and greater consistency across forms taken than observed performance on the form taken. For the smallest group size and least number of items taken, the accuracy of most IRT-based estimates was questionable; however, a procedure that operates on an estimated distribution of group ability showed promise under most conditions. 相似文献

6.

AN EXPERIMENTAL COMPARISON OF ITEM SAMPLING AND EXAMINEE SAMPLING FOR ESTIMATING TEST NORMS

THOMAS R. OWENS DANIEL L. STUFFLEBEAM 《Journal of Educational Measurement》1969,6(2):75-83

An empirical comparison of the accuracy of item sampling and examinee sampling in estimating norm statistics. Item samples were composed of 3, 6, or 12 items selected from a total test of 50 multiple-choice vocabulary questions. Overall, the study findings provided empirical evidence that item sampling is approximately as effective as examinee sampling for estimating the population mean and standard deviation. Contradictory trends occurred for lower ability and higher ability student populations in accuracy of estimated means and standard deviations when the number of items administered increased from 3 to 6 to 12. The findings from this study indicate that the variation of sequences of items occurring in item sampling need not have a significant affect on test performance. 相似文献

7.

Using Natural Language Processing to Predict Item Response Times and Improve Test Construction

Peter Baldwin Victoria Yaneva Janet Mee Brian E. Clauser Le An Ha 《Journal of Educational Measurement》2021,58(1):4-30

In this article, it is shown how item text can be represented by (a) 113 features quantifying the text's linguistic characteristics, (b) 16 measures of the extent to which an information‐retrieval‐based automatic question‐answering system finds an item challenging, and (c) through dense word representations (word embeddings). Using a random forests algorithm, these data then are used to train a prediction model for item response times and predicted response times then are used to assemble test forms. Using empirical data from the United States Medical Licensing Examination, we show that timing demands are more consistent across these specially assembled forms than across forms comprising randomly‐selected items. Because an exam's timing conditions affect examinee performance, this result has implications for exam fairness whenever examinees are compared with each other or against a common standard. 相似文献

8.

A Comparison of Self-Adapted and Computerized Adaptive Tests

Steven L. Wise Barbara S. Plake Phillip L. Johnson Linda L. Roos 《Journal of Educational Measurement》1992,29(4):329-339

According to item response theory (IRT), examinee ability estimation is independent of the particular set of test items administered from a calibrated pool. Although the most popular application of this feature of IRT is computerized adaptive (CA) testing, a recently proposed alternative is self-adapted (SA) testing, in which examinees choose the difficulty level of each of their test items. This study compared examinee performance under SA and CA tests, finding that examinees taking the SA test (a) obtained significantly higher ability scores and (b) reported significantly lower posttest state anxiety. The results of this study suggest that SA testing is a desirable format for computer-based testing. 相似文献

9.

The Impact of Three Factors on the Recovery of Item Parameters for the Three-Parameter Logistic Model

Kyung Yong Kim Won-Chan Lee 《教育实用测度》2017,30(3):228-242

相似文献

10.

Item Response Models for Examinee‐Selected Items

Wen‐Chung Wang Kuan‐Yu Jin Xue‐Lan Qiu Lei Wang 《Journal of Educational Measurement》2012,49(4):419-445

In some tests, examinees are required to choose a fixed number of items from a set of given items to answer. This practice creates a challenge to standard item response models, because more capable examinees may have an advantage by making wiser choices. In this study, we developed a new class of item response models to account for the choice effect of examinee‐selected items. The results of a series of simulation studies showed: (1) that the parameters of the new models were recovered well, (2) the parameter estimates were almost unbiased when the new models were fit to data that were simulated from standard item response models, (3) failing to consider the choice effect yielded shrunken parameter estimates for examinee‐selected items, and (4) even when the missingness mechanism in examinee‐selected items did not follow the item response functions specified in the new models, the new models still yielded a better fit than did standard item response models. An empirical example of a college entrance examination supported the use of the new models: in general, the higher the examinee's ability, the better his or her choice of items. 相似文献

11.

Effect of the Medium of Item Presentation on Examinee Performance and Item Characteristics

Judith A. Spray Terry A. Ackerman Mark D. Reckase James E. Carlson 《Journal of Educational Measurement》1989,26(3):261-271

Studies that have investigated differences in examinee performance on items administered in paper-and-pencil form or on a computer screen have produced equivocal results. Certain item administration procedures were hypothesized to be among the most important variables causing differences in item performance and ultimately in test scores obtained from these different administration media. A study where these item administration procedures were made as identical as possible for each presentation medium is described. In addition, a methodology is presented for studying the difficulty and discrimination of items under each presentation medium as a post hoc procedure. 相似文献

12.

Using Diagnostic Profiles to Describe Borderline Performance in Standard Setting

Gary Skaggs Serge F. Hein Jesse L. M. Wilkins 《Educational Measurement》2020,39(1):45-51

In test-centered standard-setting methods, borderline performance can be represented by many different profiles of strengths and weaknesses. As a result, asking panelists to estimate item or test performance for a hypothetical group study of borderline examinees, or a typical borderline examinee, may be an extremely difficult task and one that can lead to questionable results in setting cut scores. In this study, data collected from a previous standard-setting study are used to deduce panelists’ conceptions of profiles of borderline performance. These profiles are then used to predict cut scores on a test of algebra readiness. The results indicate that these profiles can predict a very wide range of cut scores both within and between panelists. Modifications are proposed to existing training procedures for test-centered methods that can account for the variation in borderline profiles. 相似文献

13.

Two IRT Fixed Parameter Calibration Methods for the Bifactor Model

Kyung Yong Kim 《Journal of Educational Measurement》2020,57(1):29-50

相似文献

14.

Impact of Both Local Item Dependencies and Cut‐Point Locations on Examinee Classifications

下载免费PDF全文

Jonathan D. Rubright 《Educational Measurement》2018,37(3):40-45

Performance assessments, scenario‐based tasks, and other groups of items carry a risk of violating the local item independence assumption made by unidimensional item response theory (IRT) models. Previous studies have identified negative impacts of ignoring such violations, most notably inflated reliability estimates. Still, the influence of this violation on examinee ability estimates has been comparatively neglected. It is known that such item dependencies cause low‐ability examinees to have their scores overestimated and high‐ability examinees' scores underestimated. However, the impact of these biases on examinee classification decisions has been little examined. In addition, because the influence of these dependencies varies along the underlying ability continuum, whether or not the location of the cut‐point is important in regard to correct classifications remains unanswered. This simulation study demonstrates that the strength of item dependencies and the location of an examination systems’ cut‐points both influence the accuracy (i.e., the sensitivity and specificity) of examinee classifications. Practical implications of these results are discussed in terms of false positive and false negative classifications of test takers. 相似文献

15.

A Stepwise Test Characteristic Curve Method to Detect Item Parameter Drift

下载免费PDF全文

Rui Guo Yi Zheng Hua‐Hua Chang 《Journal of Educational Measurement》2015,52(3):280-300

An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items. 相似文献

16.

Correlates of Rapid-Guessing Behavior in Low-Stakes Testing: Implications for Test Development and Measurement Practice

Steven L. Wise Dena A. Pastor Xiaojing J. Kong 《教育实用测度》2013,26(2):185-205

Previous research has shown that rapid-guessing behavior can degrade the validity of test scores from low-stakes proficiency tests. This study examined, using hierarchical generalized linear modeling, examinee and item characteristics for predicting rapid-guessing behavior. Several item characteristics were found significant; items with more text or those occurring later in the test were related to increased rapid guessing, while the inclusion of a graphic in a item was related to decreased rapid guessing. The sole significant examinee predictor was SAT total score. Implications of these results for measurement professionals developing low-stakes tests are discussed. 相似文献

17.

Equivalent Pass/Fail Decisions

John J. Norcini 《Journal of Educational Measurement》1990,27(1):59-66

In competency testing, it is sometimes difficult to properly equate scores of different forms of a test and thereby assure equivalent cutting scores. Under such circumstances, it is possible to set standards separately for each test form and then scale the judgments of the standard setters to achieve equivalent pass/fail decisions. Data from standard setters and examinees for a medical certifying examination were reanalyzed. Cutting score equivalents were derived by applying a linear procedure to the standard-setting results. These were compared against criteria along with the cutting score equivalents derived from typical examination equating procedures. Results indicated that the cutting score equivalents produced by the experts were closer to the criteria than standards derived from examinee performance, especially when the number of examinees used in equating was small. The root mean square error estimate was about 1 item on a 189-item test. 相似文献

18.

Developing and Scoring an Innovative Computerized Writing Assessment

Tim Davey Janet Godwin David Mittelholtz 《Journal of Educational Measurement》1997,34(1):21-41

We describe the development and administration of a recently introduced computer-based test of writing skills. This test asks the examinee to edit a writing passage presented on a computer screen. To do this, the examinee moves a cursor to a suspect section of the passage and chooses from a list of alternative ways o f rewriting that section. Any or all parts o f the passage can be changed, as often as the examinee likes. An able examinee identifies and fixes errors in grammar, organization, and style, whereas a less able examinee may leave errors untouched, replace an error with another error, or even introduce errors where none existed previously. All these response alternatives contrive to present both obvious and subtle scoring difficulties. These difficulties were attacked through the combined use of option weighting and the sequential probability ratio test, the result o f which is to classify examinees into several discrete ability groups. Item calibration was enabled by augmenting sparse pretest samples through data meiosis, in which response vectors were randomly recombined to produce offspring that retained much of the character of their parents. These procedures are described, and operational examples are offered. 相似文献

19.

A COMPARISON OF EXAMINEE SAMPLING AND MULTIPLE MATRIX SAMPLING IN TEST DEVELOPMENT

RASHMI GARG MARVIN W. BOSS JAMES E. CARLSON 《Journal of Educational Measurement》1986,23(2):119-130

For the purpose of obtaining data to use in test development, multiple matrix sampling (MMS) plans were compared to examinee sampling plans. Data were simulated for examinees, sampled from a population with a normal distribution of ability, responding to items selected from an item universe. Three item universes were considered: one that would produce a normal distribution of test scores, one a moderately platykurtic distribution, and one a very platykurtic distribution. When comparing sampling plans, total numbers of observations were held constant. No differences were found among plans in estimating item difficulty. Examinee sampling produced better estimates of item discrimination, test reliability, and test validity. As total number of observations increased, estimates improved considerably, especially for those MMS plans with larger subtest sizes. Larger numbers of observations were needed for tests designed to produce a normal distribution of test scores. With an adequate number of observations, MMS is seen as an alternative to examinee sampling in test development. 相似文献

20.

Patterns of Solution Behavior across Items in Low-Stakes Assessments

Dena A. Pastor 《Educational Assessment》2019,24(3):189-212

The trustworthiness of low-stakes assessment results largely depends on examinee effort, which can be measured by the amount of time examinees devote to items using solution behavior (SB) indices. Because SB indices are calculated for each item, they can be used to understand how examinee motivation changes across items within a test. Latent class analysis (LCA) was used with the SB indices from three low-stakes assessments to explore patterns of solution behavior across items. Across tests, the favored models consisted of two classes, with Class 1 characterized by high and consistent solution behavior (>90% of examinees) and Class 2 by lower and less consistent solution behavior (<10% of examinees). Additional analyses provided supportive validity evidence for the two-class solution with notable differences between classes in self-reported effort, test scores, gender composition, and testing context. Although results were generally similar across the three assessments, striking differences were found in the nature of the solution behavior pattern for Class 2 and the ability of item characteristics to explain the pattern. The variability in the results suggests motivational changes across items may be unique to aspects of the testing situation (e.g., content of the assessment) for less motivated examinees. 相似文献