首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Using factor analysis, we conducted an assessment of multidimensionality for 6 forms of the Law School Admission Test (LSAT) and found 2 subgroups of items or factors for each of the 6 forms. The main conclusion of the factor analysis component of this study was that the LSAT appears to measure 2 different reasoning abilities: inductive and deductive. The technique of N. J. Dorans & N. M. Kingston (1985) was used to examine the effect of dimensionality on equating. We began by calibrating (with item response theory [IRT] methods) all items on a form to obtain Set I of estimated IRT item parameters. Next, the test was divided into 2 homogeneous subgroups of items, each having been determined to represent a different ability (i.e., inductive or deductive reasoning). The items within these subgroups were then recalibrated separately to obtain item parameter estimates, and then combined into Set II. The estimated item parameters and true-score equating tables for Sets I and II corresponded closely.  相似文献   

2.
IRT Equating Methods   总被引:1,自引:0,他引:1  
The purpose of this instructional module is to provide the basis for understanding the process of score equating through the use of item response theory (IRT). A context is provided for addressing the merits of IRT equating methods. The mechanics of IRT equating and the need to place parameter estimates from separate calibration runs on the same scale are discussed. Some procedures for placing parameter estimates on a common scale are presented. In addition, IRT true-score equating is discussed in some detail. A discussion of the practical advantages derived from IRT equating is offered at the end of the module.  相似文献   

3.
Test preparation activities were determined for a large representative sample of Graduate Record Examination (GRE) Aptitude Test takers. About 3% of these examinees had attended formal coaching programs for one or more sections of the test.
After adjusting for differences in the background characteristics of coached and uncoached students, effects on test scores were related to the length and the type of programs offered. The effects on GRE verbal ability scores were not significantly related to the amount of coaching examinees received, and quantitative coaching effects increased slightly but not significantly with additional coaching. Effects on analytical ability scores, on the other hand, were related significantly to the length of coaching programs, through improved performance on two analytical item types, which have since been deleted from the test.
Overall, the data suggest that, when compared with the two highly susceptible item types that have been removed from the GRE Aptitude Test, the test item types in the current version of the test (now called the GRE General Test) appear to show relatively little susceptibility to formal coaching experiences of the kinds considered here.  相似文献   

4.
Because parameter estimates from different calibration runs under the IRT model are linearly related, a linear equation can convert IRT parameter estimates onto another scale metric without changing the probability of a correct response (Kolen & Brennan, 1995, 2004). This study was designed to explore a new approach to finding a linear equation by fixing C-parameters for anchor items in IRT equating. A rationale for fixing C-parameters for anchor items in IRT equating can be established from the fact that the C-parameters are not affected by any linear transformation. This new approach can avoid the difficulty in getting accurate C-parameters for anchor items embedded in the application of the IRT model. Based upon our findings in this study, we would recommend using the new approach to fix C-parameters for anchor items in IRT equating. This work was supported by a Korea Research Foundation Grant funded by the Korean Government (MOEHRD, Basic Research  相似文献   

5.
In this study, the authors explored the importance of item difficulty (equated delta) as a predictor of differential item functioning (DIF) of Black versus matched White examinees for four verbal item types (analogies, antonyms, sentence completions, reading comprehension) using 13 GRE-disclosed forms (988 verbal items) and 11 SAT-disclosed forms (935 verbal items). The average correlation across test forms for each item type (and often the correlation for each individual test form as well) revealed a significant relationship between item difficulty and DIF value for both GRE and SAT. The most important finding indicates that for hard items, Black examinees perform differentially better than matched ability White examinees for each of the four item types and for both the GRE and SAT tests! The results further suggest that the amount of verbal context is an important determinant of the magnitude of the relationship between item difficulty and differential performance of Black versus matched White examinees. Several hypotheses accounting for this result were explored.  相似文献   

6.
A formal analysis of the effects of item deletion on equating/scaling functions and reported score distributions is presented. There are two components of the present analysis: analytical and empirical. The analytical decomposition demonstrates how the effects of item characteristics, test properties, individual examinee responses, and rounding rules combine to produce the item deletion effect on the equating/scaling function and candidate scores, In addition to demonstrating how the deleted item's psychometric characteristics can affect the equating function, the analytical component of the report examines the effects of not scoring versus scoring all options correct, the effects of re-equating versus not re-equating, and the interaction between the decision to re-equate or to not re-equate and the scoring option chosen for the flawed item. The empirical portion of the report uses data from the May 1982 administration of the SA T, which contained the circles item, to illustrate the effects of item deletion on reported score distributions and equating functions. The empirical data verify what the analytical decomposition predicts.  相似文献   

7.
Various applications of item response theory often require linking to achieve a common scale for item parameter estimates obtained from different groups. This article used a simulation to examine the relative performance of four different item response theory (IRT) linking procedures in a random groups equating design: concurrent calibration with multiple groups, separate calibration with the Stocking-Lord method, separate calibration with the Haebara method, and proficiency transformation. The simulation conditions used in this article included three sampling designs, two levels of sample size, and two levels of the number of items. In general, the separate calibration procedures performed better than the concurrent calibration and proficiency transformation procedures, even though some inconsistent results were observed across different simulation conditions. Some advantages and disadvantages of the linking procedures are discussed.  相似文献   

8.
The equating performance of two internal anchor test structures—miditests and minitests—is studied for four IRT equating methods using simulated data. Originally proposed by Sinharay and Holland, miditests are anchors that have the same mean difficulty as the overall test but less variance in item difficulties. Four popular IRT equating methods were tested, and both the means and SDs of the true ability of the group to be equated were varied. We evaluate equating accuracy marginally and conditional on true ability. Our results suggest miditests perform about as well as traditional minitests for most conditions. Findings are discussed in terms of comparability to the typical minitest design and the trade‐off between accuracy and flexibility in test construction.  相似文献   

9.
This article considers potential problems that can arise in estimating a unidimensional item response theory (IRT) model when some test items are multidimensional (i.e., show a complex factorial structure). More specifically, this study examines (1) the consequences of model misfit on IRT item parameter estimates due to unintended minor item‐level multidimensionality, and (2) whether a Projection IRT model can provide a useful remedy. A real‐data example is used to illustrate the problem and also is used as a base model for a simulation study. The results suggest that ignoring item‐level multidimensionality might lead to inflated item discrimination parameter estimates when the proportion of multidimensional test items to unidimensional test items is as low as 1:5. The Projection IRT model appears to be a useful tool for updating unidimensional item parameter estimates of multidimensional test items for a purified unidimensional interpretation.  相似文献   

10.
The analytically derived asymptotic standard errors (SEs) of maximum likelihood (ML) item estimates can be approximated by a mathematical function without examinees' responses to test items, and the empirically determined SEs of marginal maximum likelihood estimation (MMLE)/Bayesian item estimates can be obtained when the same set of items is repeatedly estimated from the simulation (or resampling) test data. The latter method will result in rather stable and accurate SE estimates as the number of replications increases, but requires cumbersome and time-consuming calculations. Instead of using the empirically determined method, the adequacy of using the analytical-based method in predicting the SEs for item parameter estimates was examined by comparing results produced from both approaches. The results indicated that the SEs yielded from both approaches were, in most cases, very similar, especially when they were applied to a generalized partial credit model. This finding encourages test practitioners and researchers to apply the analytically asymptotic SEs of item estimates to the context of item-linking studies, as well as to the method of quantifying the SEs of equating scores for the item response theory (IRT) true-score method. Three-dimensional graphical presentation for the analytical SEs of item estimates as the bivariate function of item difficulty together with item discrimination was also provided for a better understanding of several frequently used IRT models.  相似文献   

11.
This study investigates a sequence of item response theory (IRT) true score equatings based on various scale transformation approaches and evaluates equating accuracy and consistency over time. The results show that the biases and sample variances for the IRT true score equating (both direct and indirect) are quite small (except for the mean/sigma method). The biases and sample variances for the equating functions based on the characteristic curve methods and concurrent calibrations for adjacent forms are smaller than the biases and variances for the equating functions based on the moment methods. In addition, the IRT true score equating is also compared to the chained equipercentile equating, and we observe that the sample variances for the chained equipercentile equating are much smaller than the variances for the IRT true score equating with an exception at the low scores.  相似文献   

12.
Using a New Statistical Model for Testlets to Score TOEFL   总被引:1,自引:0,他引:1  
Standard item response theory (IRT) models fit to examination responses ignore the fact that sets of items (testlets) often are matched with a single common stimulus (e.g., a reading comprehension passage). In this setting, all items given to an examinee are unlikely to be conditionally independent (given examinee proficiency). Models that assume conditional independence will overestimate the precision with which examinee proficiency is measured. Overstatement of precision may lead to inaccurate inferences as well as prematurely ended examinations in which the stopping rule is based on the estimated standard error of examinee proficiency (e.g., an adaptive test). The standard three parameter IRT model was modified to include an additional random effect for items nested within the same testlet (Wainer, Bradlow, & Du, 2000). This parameter, γ characterizes the amount of local dependence in a testlet.
We fit 86 TOEFL testlets (50 reading comprehension and 36 listening comprehension) with the new model, and obtained a value for the variance of γ for each testlet. We compared the standard parameters (discrimination (a), difficulty (b) and guessing (c)) with what is obtained through traditional modeling. We found that difficulties were well estimated either way, but estimates of both a and c were biased if conditional independence is incorrectly assumed. Of greater import, we found that test information was substantially over-estimated when conditional independence was incorrectly assumed.  相似文献   

13.
《Educational Assessment》2013,18(4):295-308
Performance on the reading comprehension (RC) tasks of the Scholastic Assessment Test-I (SAT-I or the "new" SAT), the Enhanced American College Testing Assessment (ACT), and the Graduate Record Examination (GRE) when passages were missing was examined. For the SAT-I and ACT, scores were well above chance and correlated substantially with verbal score on the earlier version of the SAT, indicating that examinees perform similarly with or without passages. Comparable but weaker results were found for the GRE. The findings raise doubts about the construct validity of the RC task. We argue that performance is influenced by the plausibility of item choices with or without the passages and that this, in turn, is the result of the construction of test items with little knowledge of the underlying reading process.  相似文献   

14.
The standardization method for assessing unexpected differential item performance or differential item functioning (DIF) is introduced. The principal findings of the first five studies that have used this approach on the Scholastic Aptitude Test are presented.  相似文献   

15.
As low-stakes testing contexts increase, low test-taking effort may serve as a serious validity threat. One common solution to this problem is to identify noneffortful responses and treat them as missing during parameter estimation via the effort-moderated item response theory (EM-IRT) model. Although this model has been shown to outperform traditional IRT models (e.g., two-parameter logistic [2PL]) in parameter estimation under simulated conditions, prior research has failed to examine its performance under violations to the model’s assumptions. Therefore, the objective of this simulation study was to examine item and mean ability parameter recovery when violating the assumptions that noneffortful responding occurs randomly (Assumption 1) and is unrelated to the underlying ability of examinees (Assumption 2). Results demonstrated that, across conditions, the EM-IRT model provided robust item parameter estimates to violations of Assumption 1. However, bias values greater than 0.20 SDs were observed for the EM-IRT model when violating Assumption 2; nonetheless, these values were still lower than the 2PL model. In terms of mean ability estimates, model results indicated equal performance between the EM-IRT and 2PL models across conditions. Across both models, mean ability estimates were found to be biased by more than 0.25 SDs when violating Assumption 2. However, our accompanying empirical study suggested that this biasing occurred under extreme conditions that may not be present in some operational settings. Overall, these results suggest that the EM-IRT model provides superior item and equal mean ability parameter estimates in the presence of model violations under realistic conditions when compared with the 2PL model.  相似文献   

16.
17.
Item response theory (IRT) methods are generally used to create score scales for large-scale tests. Research has shown that IRT scales are stable across groups and over time. Most studies have focused on items that are dichotomously scored. Now Rasch and other IRT models are used to create scales for tests that include polytomously scored items. When tests are equated across forms, researchers check for the stability of common items before including them in equating procedures. Stability is usually examined in relation to polytomous items' central “location” on the scale without taking into account the stability of the different item scores (step difficulties). We examined the stability of score scales over a 3–5-year period, considering both stability of location values and stability of step difficulties for common item equating. We also investigated possible changes in the scale measured by the tests and systematic scale drift that might not be evident in year-to-year equating. Results across grades and content areas suggest that equating results are comparable whether or not the stability of step difficulties is taken into account. Results also suggest that there may be systematic scale drift that is not visible using year-to-year common item equating.  相似文献   

18.
This article describes an ongoing project to develop a formative, inferential reading comprehension assessment of causal story comprehension. It has three features to enhance classroom use: equated scale scores for progress monitoring within and across grades, a scale score to distinguish among low‐scoring students based on patterns of mistakes, and a reading efficiency index. Instead of two response types for each multiple‐choice item, correct and incorrect, each item has three response types: correct and two incorrect response types. Prior results on reliability, convergent and discriminant validity, and predictive utility of mistake subscores are briefly described. The three‐response‐type structure of items required rethinking the item response theory (IRT) modeling. IRT‐modeling results are presented, and implications for formative assessments and instructional use are discussed.  相似文献   

19.
In test development, item response theory (IRT) is a method to determine the amount of information that each item (i.e., item information function) and combination of items (i.e., test information function) provide in the estimation of an examinee's ability. Studies investigating the effects of item parameter estimation errors over a range of ability have demonstrated an overestimation of information when the most discriminating items are selected (i.e., item selection based on maximum information). In the present study, the authors examined the influence of item parameter estimation errors across 3 item selection methods—maximum no target, maximum target, and theta maximum—using the 2- and 3-parameter logistic IRT models. Tests created with the maximum no target and maximum target item selection procedures consistently overestimated the test information function. Conversely, tests created using the theta maximum item selection procedure yielded more consistent estimates of the test information function and, at times, underestimated the test information function. Implications for test development are discussed.  相似文献   

20.
本文研究的是不同的测试方法-单项选择和信息转移-是否会在阅读理解考试中产生测试方法效应的问题.除对学生的考试成绩(分数)进行分析外,本研究还进一步对试题的难度值进行了分析,而本研究中试题难度是通过项目反应理论(Item Response Theory)计算得到的.结果显示不同测试方法的确会影响题目难度及考生的考试表现,就试题难度而言信息转移比单项选择更难.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号