期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Modeling Partial Knowledge on Multiple‐Choice Items Using Elimination Testing

Qian Wu Tinne De Laet Rianne Janssen 《Journal of Educational Measurement》2019,56(2):391-414

Single‐best answers to multiple‐choice items are commonly dichotomized into correct and incorrect responses, and modeled using either a dichotomous item response theory (IRT) model or a polytomous one if differences among all response options are to be retained. The current study presents an alternative IRT‐based modeling approach to multiple‐choice items administered with the procedure of elimination testing, which asks test‐takers to eliminate all the response options they consider to be incorrect. The partial credit model is derived for the obtained responses. By extracting more information pertaining to test‐takers’ partial knowledge on the items, the proposed approach has the advantage of providing more accurate estimation of the latent ability. In addition, it may shed some light on the possible answering processes of test‐takers on the items. As an illustration, the proposed approach is applied to a classroom examination of an undergraduate course in engineering science. 相似文献

2.

Item Parameter Estimation Under Conditions of Test Speededness: Application of a Mixture Rasch Model With Ordinal Constraints

Daniel M. Bolt Allan S. Cohen James A. Wollack 《Journal of Educational Measurement》2002,39(4):331-348

When tests are administered under fixed time constraints, test performances can be affected by speededness. Among other consequences, speededness can result in inaccurate parameter estimates in item response theory (IRT) models, especially for items located near the end of tests (Oshima, 1994). This article presents an IRT strategy for reducing contamination in item difficulty estimates due to speededness. Ordinal constraints are applied to a mixture Rasch model (Rost, 1990) so as to distinguish two latent classes of examinees: (a) a "speeded" class, comprised of examinees that had insufficient time to adequately answer end-of-test items, and (b) a "nonspeeded" class, comprised of examinees that had sufficient time to answer all items. The parameter estimates obtained for end-of-test items in the nonspeeded class are shown to more accurately approximate their difficulties when the items are administered at earlier locations on a different form of the test. A mixture model can also be used to estimate the class memberships of individual examinees. In this way, it can be determined whether membership in the speeded class is associated with other student characteristics. Results are reported for gender and ethnicity. 相似文献

3.

Modeling Skipped and Not‐Reached Items Using IRTrees

下载免费PDF全文

Dries Debeer Rianne Janssen Paul De Boeck 《Journal of Educational Measurement》2017,54(3):333-363

When dealing with missing responses, two types of omissions can be discerned: items can be skipped or not reached by the test taker. When the occurrence of these omissions is related to the proficiency process the missingness is nonignorable. The purpose of this article is to present a tree‐based IRT framework for modeling responses and omissions jointly, taking into account that test takers as well as items can contribute to the two types of omissions. The proposed framework covers several existing models for missing responses, and many IRTree models can be estimated using standard statistical software. Further, simulated data is used to show that ignoring missing responses is less robust than often considered. Finally, as an illustration of its applicability, the IRTree approach is applied to data from the 2009 PISA reading assessment. 相似文献

4.

对“选项式英语交际试题”交际性的质疑

孔德惠李潇君《通化师范学院学报》2006,(2):33-37,64

外语测试理论的发展进入交际语用阶段。在此背景下,我国一些大规模考试中出现了“选项式交际试题”。本文对此类试题的交际性提出质疑,认为它只反映了情景真实性,却忽视了交际过程的真实性,测试过程仅停留在语用分析层面,无法测量受试者实际运用语言的能力。本文还指出此类试题与其初衷相悖的先天局限性。考查交际知识固然有意义, 但若将涉及交际知识的选项式客观试题等同于交际语言测试则反映出对交际测试理论的认识存有片面性。在外语作为交际工具而被广泛使用的今天,认清交际语言测试的本质及其意义尤为重要,有助于我们在外语测试实践中针对不同的情况采取合适的测试手段,使外语测试活动更加具备目的性和合理性。相似文献

5.

Impact of Both Local Item Dependencies and Cut‐Point Locations on Examinee Classifications

下载免费PDF全文

Jonathan D. Rubright 《Educational Measurement》2018,37(3):40-45

Performance assessments, scenario‐based tasks, and other groups of items carry a risk of violating the local item independence assumption made by unidimensional item response theory (IRT) models. Previous studies have identified negative impacts of ignoring such violations, most notably inflated reliability estimates. Still, the influence of this violation on examinee ability estimates has been comparatively neglected. It is known that such item dependencies cause low‐ability examinees to have their scores overestimated and high‐ability examinees' scores underestimated. However, the impact of these biases on examinee classification decisions has been little examined. In addition, because the influence of these dependencies varies along the underlying ability continuum, whether or not the location of the cut‐point is important in regard to correct classifications remains unanswered. This simulation study demonstrates that the strength of item dependencies and the location of an examination systems’ cut‐points both influence the accuracy (i.e., the sensitivity and specificity) of examinee classifications. Practical implications of these results are discussed in terms of false positive and false negative classifications of test takers. 相似文献

6.

A Comparison of Self-Adapted and Computerized Adaptive Tests

Steven L. Wise Barbara S. Plake Phillip L. Johnson Linda L. Roos 《Journal of Educational Measurement》1992,29(4):329-339

According to item response theory (IRT), examinee ability estimation is independent of the particular set of test items administered from a calibrated pool. Although the most popular application of this feature of IRT is computerized adaptive (CA) testing, a recently proposed alternative is self-adapted (SA) testing, in which examinees choose the difficulty level of each of their test items. This study compared examinee performance under SA and CA tests, finding that examinees taking the SA test (a) obtained significantly higher ability scores and (b) reported significantly lower posttest state anxiety. The results of this study suggest that SA testing is a desirable format for computer-based testing. 相似文献

7.

A comparison of the approaches of generalizability theory and item response theory in estimating the reliability of test scores for testlet-composed tests

Guemin Lee In-Yong Park 《Asia Pacific Education Review》2012,13(1):47-54

Previous assessments of the reliability of test scores for testlet-composed tests have indicated that item-based estimation methods overestimate reliability. This study was designed to address issues related to the extent to which item-based estimation methods overestimate the reliability of test scores composed of testlets and to compare several estimation methods for different measurement models using simulation techniques. Three types of estimation approach were conceptualized for generalizability theory (GT) and item response theory (IRT): item score approach (ISA), testlet score approach (TSA), and item-nested-testlet approach (INTA). The magnitudes of overestimation when applying item-based methods ranged from 0.02 to 0.06 and were related to the degrees of dependence among within-testlet items. Reliability estimates from TSA were lower than those from INTA due to the loss of information with IRT approaches. However, this could not be applied in GT. Specified methods in IRT produced higher reliability estimates than those in GT using the same approach. Relatively smaller magnitudes of error in reliability estimates were observed for ISA and for methods in IRT. Thus, it seems reasonable to use TSA as well as INTA for both GT and IRT. However, if there is a relatively large dependence among within-testlet items, INTA should be considered for IRT due to nonnegligible loss of information. 相似文献

8.

An NCME Instructional Module on Polytomous Item Response Theory Models

Randall David Penfield 《Educational Measurement》2014,33(1):36-48

A polytomous item is one for which the responses are scored according to three or more categories. Given the increasing use of polytomous items in assessment practices, item response theory (IRT) models specialized for polytomous items are becoming increasingly common. The purpose of this ITEMS module is to provide an accessible overview of polytomous IRT models. The module presents commonly encountered polytomous IRT models, describes their properties, and contrasts their defining principles and assumptions. After completing this module, the reader should have a sound understating of what a polytomous IRT model is, the manner in which the equations of the models are generated from the model's underlying step functions, how widely used polytomous IRT models differ with respect to their definitional properties, and how to interpret the parameters of polytomous IRT models. 相似文献

9.

心理测量理论的新进展:潜在分类模型

甘媛源余嘉元《中国考试》2009,(3)

认知诊断模型是新一代心理测量理论——认知诊断理论的核心。它可分为潜在特质模型和潜在分类模型两大类。其中,潜在分类模型主要用于分析被试的作答过程从而探讨被试的潜在知识结构,克服了CCT和IRT的缺陷,开创了教育与心理测量领域新的里程碑。本文首先介绍作为该类模型基础的规则空间模型,然后集中探讨在此基础上发展起来的较新的潜在分类模型,最后对这类模型进行了评价和展望。相似文献

10.

Minimizing the Influence of Item Parameter Estimation Errors in Test Development: A Comparison of Three Selection Procedures

Mark J. Gierl Dianne Henderson Michael Jodoin Don Klinger 《Journal of Experimental Education》2013,81(3):261-279

In test development, item response theory (IRT) is a method to determine the amount of information that each item (i.e., item information function) and combination of items (i.e., test information function) provide in the estimation of an examinee's ability. Studies investigating the effects of item parameter estimation errors over a range of ability have demonstrated an overestimation of information when the most discriminating items are selected (i.e., item selection based on maximum information). In the present study, the authors examined the influence of item parameter estimation errors across 3 item selection methods—maximum no target, maximum target, and theta maximum—using the 2- and 3-parameter logistic IRT models. Tests created with the maximum no target and maximum target item selection procedures consistently overestimated the test information function. Conversely, tests created using the theta maximum item selection procedure yielded more consistent estimates of the test information function and, at times, underestimated the test information function. Implications for test development are discussed. 相似文献

11.

Sind Modelle der Item-Response-Theorie (IRT) das „Mittel der Wahl“ für die Modellierung von Kompetenzen?

Johannes Hartig Andreas Frey 《Zeitschrift für Erziehungswissenschaft》2013,16(1):47-51

Item response theory (IRT) models can be subsumed under the larger class of statistical models with latent variables. IRT models are increasingly used for the scaling of the responses derived from standardized assessments of competencies. The paper summarizes the strengths of IRT in contrast to more traditional techniques as well as in contrast to alternative models with latent variables (e. g. structural equation modeling). Subsequently, specific limitations of IRT and cases where other methods might be preferable are lined out. 相似文献

12.

Trace Lines for Classification Decisions

《教育实用测度》2013,26(4):311-330

Referral, placement, and retention decisions were analyzed using item response theory (IRT) to investigate whether classification decisions could be placed on the latent continuum of ability normally associated with test items. A second question pertained to the existence of classification differential item functioning (DIF) for the various decisions. When the decisions were calibrated, the resulting "item" parameters were similar to those that might be expected from conventional test items. For classification DIF analyses, referral decisions for ethnicity were found to be functioning differently for Whites versus non-Whites. Analyzing decisions represents a new unit of analysis for IRT and represents a powerful methodology that could be applied to a variety of new problem types. 相似文献

13.

Skilled but unaware of it: CAT undermines a test taker's metacognitive competence

Tuulia M. Ortner Eva Weißkopf Friederike X. R. Gerstenberg 《European Journal of Psychology of Education - EJPE》2013,28(1):37-51

We investigated students' metacognitive experiences with regard to feelings of difficulty (FD), feelings of satisfaction (FS), and estimate of effort (EE), employing either computerized adaptive testing (CAT) or computerized fixed item testing (FIT). In an experimental approach, 174 students in grades 10 to 13 were tested either with a CAT or a FIT version of a matrices test. Data revealed that metacognitive experiences were not related to the resulting test scores for CAT: test takers who took the matrices test in an adaptive mode were paradoxically more satisfied with their performance the worse they had performed in terms of the resulting ability parameter. They also rated the test as easier the lower they had performed, but their estimates of effort were higher the better they had performed. For test takers who took the FIT version, completely different results were revealed. In line with previous results, test takers were supposed to base these experiences on the subjectively estimated percentage of items solved. This moderated mediation hypothesis was in parts confirmed, as the relation between the percentage of items solved and FD, FS, and EE was revealed to be mediated by the estimated percentage of items solved. Results are discussed with reference to feedback acceptance, errant self-estimations, and test fairness with regard to a possible false regulation of effort in lower ability groups when using CAT. 相似文献

14.

IRT Estimation of Domain Scores

R. Darrell Bock David Thissen Michele F. Zimowski 《Journal of Educational Measurement》1997,34(3):197-211

In classical test theory, a test is regarded as a sample of items from a domain defined by generating rules or by content, process, and format specifications, l f the items are a random sample of the domain, then the percent-correct score on the test estimates the domain score, that is, the expected percent correct for all items in the domain. When the domain is represented by a large set of calibrated items, as in item banking applications, item response theory (IRT) provides an alternative estimator of the domain score by transformation of the IRT scale score on the test. This estimator has the advantage of not requiring the test items to be a random sample of the domain, and of having a simple standard error. We present here resampling results in real data demonstrating for uni- and multidimensional models that the IRT estimator is also a more accurate predictor of the domain score than is the classical percent-correct score. These results have implications for reporting outcomes of educational qualification testing and assessment. 相似文献

15.

Stepwise Analysis of Differential Item Functioning Based on Multiple-Group Partial Credit Model

Eiji Muraki 《Journal of Educational Measurement》1999,36(3):217-232

Bock, Muraki, and Pfeiffenberger (1988) proposed a dichotomous item response theory (IRT) model for the detection of differential item functioning (DIF), and they estimated the IRT parameters and the means and standard deviations of the multiple latent trait distributions. This IRT DIF detection method is extended to the partial credit model (Masters, 1982; Muraki, 1993) and presented as one of the multiple-group IRT models. Uniform and non-uniform DIF items and heterogeneous latent trait distributions were used to generate polytomous responses of multiple groups. The DIF method was applied to this simulated data using a stepwise procedure. The standardized DIF measures for slope and item location parameters successfully detected the non-uniform and uniform DIF items as well as recovered the means and standard deviations of the latent trait distributions.This stepwise DIF analysis based on the multiple-group partial credit model was then applied to the National Assessment of Educational Progress (NAEP) writing trend data. 相似文献

16.

Integrating Cognitive and Psychometric Models to Measure Document Literacy

Kathleen Sheehan Robert J. Mislevy 《Journal of Educational Measurement》1990,27(3):255-272

The Survey of Young Adult Literacy conducted in 1985 by the National Assessment of Educational Progress included 63 items that elicited skills in acquiring and using information from written documents. These items were analyzed using two different models: (1) a qualitative cognitive model, which characterized items in terms of the processing tasks they required, and (2) an item response theory (IRT) model, which characterized items difficulties and respondents' proficiencies simply by tendencies toward correct response. This paper demonstrates how a generalization of Fischer and Seheibleehner's Linear Logistic Test Model can be used to integrate information from the cognitive analysis into the IRT analysis, providing a foundation for subsequent item construction, test development, and diagnosis of individuals skill deficiencies. 相似文献

17.

Examinee Non-Effort on Contextualized and Non-Contextualized Mathematics Items in Large-Scale Assessments

Daniel Van Nijlen Rianne Janssen 《教育实用测度》2015,28(1):68-84

In this study it is investigated to what extent contextualized and non-contextualized mathematics test items have a differential impact on examinee effort. Mixture item response theory (IRT) models are applied to two subsets of items from a national assessment on mathematics in the second grade of the pre-vocational track in secondary education in Flanders. One subset focused on elementary arithmetic and consisted of non-contextualized items. Another subset of contextualized items focused on the application of arithmetic in authentic problem-solving situations. Results indicate that differential performance on the subsets is to a large extent due to test effort. The non-contextualized items appear to be much more susceptible to low examinee effort in low-stakes testing situations. However, subgroups of students can be found with regard to the extent to which they show low effort. One can distinguish a compliant, an underachieving, and a dropout group. Group membership is also linked to relevant background characteristics. 相似文献

18.

我国大规模英语水平考试偏重选择题的倾向亟待纠正（三） 总被引：1，自引：0，他引：1

孔德惠《通化师范学院学报》2004,(4):24-27

过分依赖和使用选择题的效果颇显荒诞：在施考眼里。单纯通过客观测试所获得的分数并不能准确反映学习的实际外语水平；而对于参试来说，专门用于对付客观测试的应试能力于他们日后使用外语的实际情形也没有实际的帮助。意识到了问题的严重性，我们应该立即采取措施扭转应试教育的被动局面。笔认为，我们应该根据自身国情和发展需求建立有自己特色的外语评价体系，这是我国外语教育事业健康发展的需要。本第三部分就这一问题提出了几点具体建议。相似文献

19.

Latent Class Detection and Class Assignment: A Comparison of the MAXEIG Taxometric Procedure and Factor Mixture Modeling Approaches

Gitta Lubke Stephen Tueller 《Structural equation modeling》2013,20(4):605-628

Taxometric procedures such as MAXEIG and factor mixture modeling (FMM) are used in latent class clustering, but they have very different sets of strengths and weaknesses. Taxometric procedures, popular in psychiatric and psychopathology applications, do not rely on distributional assumptions. Their sole purpose is to detect the presence of latent classes. The procedures capitalize on the assumption that, due to mean differences between two classes, item covariances within class are smaller than item covariances between the classes. FMM goes beyond class detection and permits the specification of hypothesis-based within-class covariance structures ranging from local independence to multidimensional within-class factor models. In principle, FMM permits the comparison of alternative models using likelihood-based indexes. These advantages come at the price of distributional assumptions. In addition, models are often highly parameterized and susceptible to misspecifications of the within-class covariance structure.

Following an illustration with an empirical data set of binary depression items, the MAXEIG procedure and FMM are compared in a simulation study focusing on class detection and the assignment of subjects to the latent classes. FMM generally outperformed MAXEIG in terms of class detection and class assignment. Substantially different class sizes negatively impacted the performance of both approaches, whereas low class separation was much more problematic for MAXEIG than for the FMM. 相似文献

20.

计算机化自适应测验(CAT)的发展和前景展望(续)

张华华程莹《考试研究》2005,(2)

三、CAT中对的估计(一)MLE(极大似然估计法)假设一个能力水平为θ的被试对n道项目X_1,X_2,…,X_n作答。θ的估计可以通过使(8)式所示的似然函数最大化的方式来得到。令(?)_n为此时所得的θ估计。显然(?)_n也是(9)式的极大似然估计。已知在一定的条件下,(?)_n符合渐进正态,其均值为θ,方差近似为I~(-1)_n((?)_n)。目前的CAT设计大多通过递归方式在被试回答一个新的项目之后得到最新的θ估计,并根据信息最大化法抽取下一个项目。相似文献