期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A comparison of difficulty and discrimination values of selected true-false item types

Douglas Barker Robert L Ebel 《Contemporary educational psychology》1982,7(1):35-40

Thirty-eight undergraduate students were randomly assigned one of two alternate forms of a 144-item true-false midterm examination. Whenever a statement appeared on one form as true and positively stated, it appeared on the alternate form as false and negatively stated. Similarly, a false and positively stated item on one form was true and negatively stated on the other. The subject matter of the two forms was identical and the four kinds of true-false items were equally represented on each form. Difficulty and discrimination indices were computed for each of the four item types. The statistical results showed negatively stated items were more difficult, but no more discriminating, than positively stated items. Also, false items were not statistically more difficult than true items, but were significantly more discriminating. It was concluded that test constructors should include more false items than true items in their instruments and that all items should be stated positively. 相似文献

2.

A Stepwise Test Characteristic Curve Method to Detect Item Parameter Drift

下载免费PDF全文

Rui Guo Yi Zheng Hua‐Hua Chang 《Journal of Educational Measurement》2015,52(3):280-300

An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items. 相似文献

3.

AN ITERATIVE ITEM BIAS DETECTION METHOD 总被引：1，自引：0，他引：1

HENK VAN DER FLIER GIDEON J. MELLENBERGH HERMAN J. ADÈR MARINA WIJN 《Journal of Educational Measurement》1984,21(2):131-145

Two strategies for assessing item bias are discussed: methods that compare (transformed) item difficulties unconditional on ability level and methods that compare the probabilities of correct response conditional on ability level. In the present study, the logit model was used to compare the probabilities of correct response to an item by members of two groups, these probabilities being conditional on the observed score. Here the observed score serves as an indicator of ability level. The logit model was iteratively applied: In the Tth iteration, the T items with the highest value of the bias statistic are excluded from the test, and the observed score indicator of ability for the (T + 1)th iteration is computed from the remaining items. This method was applied to simulated data. The results suggest that the iterative logit method is a substantial improvement on the noniterative one, and that the iterative method is very efficient in detecting biased and unbiased items. 相似文献

4.

IRT Estimation of Domain Scores

R. Darrell Bock David Thissen Michele F. Zimowski 《Journal of Educational Measurement》1997,34(3):197-211

In classical test theory, a test is regarded as a sample of items from a domain defined by generating rules or by content, process, and format specifications, l f the items are a random sample of the domain, then the percent-correct score on the test estimates the domain score, that is, the expected percent correct for all items in the domain. When the domain is represented by a large set of calibrated items, as in item banking applications, item response theory (IRT) provides an alternative estimator of the domain score by transformation of the IRT scale score on the test. This estimator has the advantage of not requiring the test items to be a random sample of the domain, and of having a simple standard error. We present here resampling results in real data demonstrating for uni- and multidimensional models that the IRT estimator is also a more accurate predictor of the domain score than is the classical percent-correct score. These results have implications for reporting outcomes of educational qualification testing and assessment. 相似文献

5.

Identifying Promising Items: The Use of Crowdsourcing in the Development of Assessment Instruments

Philip M. Sadler Gerhard Sonnert Harold P. Coyle Kelly A. Miller 《Educational Assessment》2016,21(3):196-214

The psychometrically sound development of assessment instruments requires pilot testing of candidate items as a first step in gauging their quality, typically a time-consuming and costly effort. Crowdsourcing offers the opportunity for gathering data much more quickly and inexpensively than from most targeted populations. In a simulation of a pilot testing protocol, item parameters for 110 life science questions are estimated from 4,043 crowdsourced adult subjects and then compared with those from 20,937 middle school science students. In terms of item discrimination classification (high vs. low), classical test theory yields an acceptable level of agreement (C-statistic = 0.755); item response theory produces excellent results (C-statistic = 0.848). Item response theory also identifies potential anchor items without including any false positives (items with low discrimination in the targeted population). We conclude that the use of crowdsourcing subjects is a reasonable, efficient method for the identification of high-quality items for field testing and for the selection of anchor items to be used for test equating. 相似文献

6.

Building a Unidimensional Test Using Multidimensional Items

Mark D. Reckase Terry A. Ackerman James E. Carlson 《Journal of Educational Measurement》1988,25(3):193-203

This paper demonstrates, both theoretically and empirically, using both simulated and real test data, that sets of items can be selected that meet the unidimensionality assumption of most item response theory models even though they require more than one ability for a correct response. Sets of items that measure the same composite of abilities as defined by multidimensional item response theory are shown to meet the unidimensionality assumption. A method for identifying such item sets is also presented 相似文献

7.

A Method for Maintaining Scale Stability in the Presence of Test Speededness

James A. Wollack Allan S. Cohen Craig S. Wells 《Journal of Educational Measurement》2003,40(4):307-330

Administering tests under time constraints may result in poorly estimated item parameters, particularly for items at the end of the test (Douglas, Kim, Habing, & Gao, 1998; Oshima, 1994). Bolt, Cohen, and Wollack (2002) developed an item response theory mixture model to identify a latent group of examinees for whom a test is overly speeded, and found that item parameter estimates for end-of-test items in the nonspeeded group were similar to estimates for those same items when administered earlier in the test. In this study, we used the Bolt et al. (2002) method to study the effect of removing speeded examinees on the stability of a score scale over an II-year period. Results indicated that using only the nonspeeded examinees for equating and estimating item parameters provided a more unidimensional scale, smaller effects of item parameter drift (including fewer drifting items), and less scale drift (i.e., bias) and variability (i.e., root mean squared errors) when compared to the total group of examinees. 相似文献

8.

Generalization of the Lord‐Wingersky Algorithm to Computing the Distribution of Summed Test Scores Based on Real‐Number Item Scores

Seonghoon Kim 《Journal of Educational Measurement》2013,50(4):381-389

With known item response theory (IRT) item parameters, Lord and Wingersky provided a recursive algorithm for computing the conditional frequency distribution of number‐correct test scores, given proficiency. This article presents a generalized algorithm for computing the conditional distribution of summed test scores involving real‐number item scores. The generalized algorithm is distinct from the Lord‐Wingersky algorithm in that it explicitly incorporates the task of figuring out all possible unique real‐number test scores in each recursion. Some applications of the generalized recursive algorithm, such as IRT test score reliability estimation and IRT proficiency estimation based on summed test scores, are illustrated with a short test by varying scoring schemes for its items. 相似文献

9.

Dimensionality in Compensatory MIRT When Complex Structure Exists: Evaluation of DETECT and NOHARM

Dubravka Svetina Roy Levy 《Journal of Experimental Education》2016,84(2):398-420

This study investigated the effect of complex structure on dimensionality assessment in compensatory multidimensional item response models using DETECT- and NOHARM-based methods. The performance was evaluated via the accuracy of identifying the correct number of dimensions and the ability to accurately recover item groupings using a simple matching similarity (SM) coefficient. The DETECT-based methods yielded higher proportion correct than the NOHARM-based methods in two- and three-dimensional conditions, especially when correlations were ≤.60, data exhibited ≤30% complexity, and sample size was 1,000. As the complexity increased and the sample size decreased, the performance of the methods typically diminished. The NOHARM-based methods were either equally successful or better in recovering item groupings than the DETECT-based methods and were mostly affected by complexity levels. The DETECT-based methods were affected largely by the test length, such that with the increase of the number of items, SM coefficients would decrease substantially. 相似文献

10.

试卷中含有单个高计分主观题时的信度估计方法

杨志明丁港王雯《教育测量与评价(理论版)》2021,(1):44-48

测评信度是衡量考试质量的核心指标之一,但常规的信度估计方法在估计含有单个高计分主观题试卷的信度时并不恰当,因为这种高计分主观题对测验总分方差的影响太大。解决这种问题的一个做法是:在估计出单个高计分主观题信度的基础上,进一步运用分层α系数公式估计整个试卷的测评信度。单个高计分主观题信度的估计方法有两种,即使用重测信度的估计方法,或者使用根据两个随机变量的相关系数会因随机误差的存在而衰减的特点所提出的估计方法。相似文献

11.

Written feedback: Response certitude and durability

Raymond W. Kulhavy William A. Stock Thomas E. Hancock Linda K. Swindell Penny L. Hammrich 《Contemporary educational psychology》1990,15(4)

This study tested assumptions of a servocontrol model of test item feedback. High school students responded to multiple-choice items and rated their certainty of correctness in each response. Next, learners either received feedback on the items or responded again to the same test. The same items were tested again after 1 and 8 days, with the order to alternatives randomized for half of the subjects in each feedback group. The results generally supported the control model and suggest that response certitude estimates can be treated as an index of comprehension. 相似文献

12.

The Effect of Item Response Changes on Scores on an Elementary Reading Achievement Test

《The Journal of educational research》2012,105(3):153-156

Abstract

The effect of changing item responses on scores of elementary school children on a standardized achievement test was studied. Previous research, primarily involving non-standardized instruments and adult samples, indicates that changed responses are more likely to be correct than not. Subjects were 165 third grade students using the Metropolitan Reading Tests. Students received no special instructions regarding changing responses. Changes were identified visually and were independently verified. While frequency of response changes was low, such changes generally improved scores. Sex differences in number and success of changes were non-significant. The relationship between frequency of response change and test score was minimal. Responses to difficult items were changed more frequently with less success than changes on easy items. High scorers made more successful changes than did low scorers. Within the limits of the methodology, results clearly indicated that response changes of elementary students on multiple-choice items tend to improve test scores. 相似文献

13.

Affordances of Item Formats and Their Effects on Test‐Taker Cognition under Uncertainty

Jung Aa Moon Madeleine Keehner Irvin R. Katz 《Educational Measurement》2019,38(1):54-62

The current study investigated how item formats and their inherent affordances influence test‐takers’ cognition under uncertainty. Adult participants solved content‐equivalent math items in multiple‐selection multiple‐choice and four alternative grid formats. The results indicated that participants’ affirmative response tendency (i.e., judge the given information as True) was affected by the presence of a grid, type of grid options, and their visual layouts. The item formats further affected the test scores obtained from the alternatives keyed True and the alternatives keyed False, and their psychometric properties. The current results suggest that the affordances rendered by item design can lead to markedly different test‐taker behaviors and can potentially influence test outcomes. They emphasize that a better understanding of the cognitive implications of item formats could potentially facilitate item design decisions for large‐scale educational assessments. 相似文献

14.

The Scaling of Mixed-Item-Format Tests With the One-Parameter and Two-Parameter Partial Credit Models

Robert C. Sykes Wendy M. Yen 《Journal of Educational Measurement》2000,37(3):221-244

Item response theory scalings were conducted for six tests with mixed item formats. These tests differed in their proportions of constructed response (c.r.) and multiple choice (m.c.) items and in overall difficulty. The scalings included those based on scores for the c.r. items that had maintained the number of levels as the item rubrics, either produced from single ratings or multiple ratings that were averaged and rounded to the nearest integer, as well as scalings for a single form of c.r. items obtained by summing multiple ratings. A one-parameter (IPPC) or two-parameter (2PPC) partial credit model was used for the c.r. items and the one-parameter logistic (IPL) or three-parameter logistic (3PL) model for the m.c. items, ltem fit was substantially worse with the combination IPL/IPPC model than the 3PL/2PPC model due to the former's restrictive assumptions that there would be no guessing on the m.c. items and equal item discrimination across items and item types. The presence of varying item discriminations resulted in the IPL/IPPC model producing estimates of item information that could be spuriously inflated for c.r. items that had three or more score levels. Information for some items with summed ratings were usually overestimated by 300% or more for the IPL/IPPC model. These inflated information values resulted in under-estbnated standard errors of ability estimates. The constraints posed by the restricted model suggests limitations on the testing contexts in which the IPL/IPPC model can be accurately applied. 相似文献

15.

Item analysis of the Wechsler Intelligence Scale for Children-Revised

Hubert Booney Vance Patricia Gaynor Margaret Coleman 《Psychology in the schools》1977,14(2):132-139

Indices of item diffculty and item discrimination were analyzed for the items comprising the Wechsler Intelligence Scale for Children - Revised as obtained from a group of 142 subjects with Full Scale IQs below 96. Item validities were estimated by computing the biserial correlation between dichotomized item responses and the total weight score. Kendall's tau was computed for each item. The item difficulties for each subtest except Information and Vocabulary are roughly in the same rank order as those obtained by the stadardization group. Evidence from the study indicates that the increase in the number of items on the WISC-R helped to increase its internal Validity. Analysis of the data ragarding the internal consistency of the test indicates that the majority of the items operate as significant discriminators. Changes in the order of that administration and /or revision of the record form would not seem warranted on the basis of the present study. 相似文献

16.

Using Response Time to Detect Item Preknowledge in Computer‐Based Licensure Examinations

Hong Qian Dorota Staniewska Mark Reckase Ada Woo 《Educational Measurement》2016,35(1):38-47

This article addresses the issue of how to detect item preknowledge using item response time data in two computer‐based large‐scale licensure examinations. Item preknowledge is indicated by an unexpected short response time and a correct response. Two samples were used for detecting item preknowledge for each examination. The first sample was from the early stage of the operational test and was used for item calibration. The second sample was from the late stage of the operational test, which may feature item preknowledge. The purpose of this research was to explore whether there was evidence of item preknowledge and compromised items in the second sample using the parameters estimated from the first sample. The results showed that for one nonadaptive operational examination, two items (of 111) were potentially exposed, and two candidates (of 1,172) showed some indications of preknowledge on multiple items. For another licensure examination that featured computerized adaptive testing, there was no indication of item preknowledge or compromised items. Implications for detected aberrant examinees and compromised items are discussed in the article. 相似文献

17.

基于高考英语难题的试题命制技术探讨

程晓堂王瑶《中国考试》2021,(5):63-71

难度不是试题的固有属性,而是考生因素与试题特征之间互动的结果。很多试题分析者倾向于将试题难度偏高的原因仅仅归结于学生未掌握相关知识或技能,而忽视试题本身的特征。通过分析60道难度在0.6以下的高考英语试题,探究其难度来源。结果显示,除考生因素外,难题或偏难题的难度来源也与命题技术有关,比如答案的唯一性与可接受性、考查内容超纲、考点设置与评分标准欠妥等方面的问题。为此,提出考试机构应提高命题水平,加强试题质量监控,确保大规模考试科学选拔人才。相似文献

18.

CTT框架下基于数据分析的高考试题质量评价标准——对2004-2008年高考北京卷的实证研究

赵海燕臧铁军《中国考试》2009,(8)

考试质量的评价是当前教育和考试研究领域的一个重要课题。其中,对试卷和试题的定量评价是考试质量评价的重要基石。本文以高考北京卷的全总体统计分析为基础,从实证的角度出发,提出基于数据的高考试题质量评价标准。主要涉及试题的难度、区分度、选项分析、有效分数区间及分值利用率等方面。分析结果表明,对大规模教育考试试卷与试题的质量评价,要考虑考试类型、学科、题型、分值权重等因素。要根据考试的科目、题型设定不同的难度标准;评价区分度要考虑其分值权重,对选择题可以进一步分析其选项;对多值计分题可以进一步考察其有效分数区间和分值利用率。相似文献

19.

Identification and Evaluation of Local Item Dependencies in the Medical College Admissions Test

April L. Zenisky Ronald K. Hambleton Stephen G. Sired 《Journal of Educational Measurement》2002,39(4):291-309

Measurement specialists routinely assume examinee responses to test items are independent of one another. However, previous research has shown that many contemporary tests contain item dependencies and not accounting for these dependencies leads to misleading estimates of item, test, and ability parameters. The goals of the study were (a) to review methods for detecting local item dependence (LID), (b) to discuss the use of testlets to account for LID in context-dependent item sets, (c) to apply LID detection methods and testlet-based item calibrations to data from a large-scale, high-stakes admissions test, and (d) to evaluate the results with respect to test score reliability and examinee proficiency estimation. Item dependencies were found in the test and these were due to test speededness or context dependence (related to passage structure). Also, the results highlight that steps taken to correct for the presence of LID and obtain less biased reliability estimates may impact on the estimation of examinee proficiency. The practical effects of the presence of LID on passage-based tests are discussed, as are issues regarding how to calibrate context-dependent item sets using item response theory. 相似文献

20.

Estimating Average Domain Scores

Mary Pommerich W. Alan Nicewander Bradley A. Hanson 《Journal of Educational Measurement》1999,36(3):199-216

A simulation study was performed to determine whether a group's average percent correct in a content domain could be accurately estimated for groups taking a single test form and not the entire domain of items. Six Item Response Theory based domain score estimation methods were evaluated, under conditions of few items per content area perform taken, small domains, and small group sizes. The methods used item responses to a single form taken to estimate examinee or group ability; domain scores were then computed using the ability estimates and domain item characteristics. The IRT-based domain score estimates typically showed greater accuracy and greater consistency across forms taken than observed performance on the form taken. For the smallest group size and least number of items taken, the accuracy of most IRT-based estimates was questionable; however, a procedure that operates on an estimated distribution of group ability showed promise under most conditions. 相似文献