期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Michael J. Kolen 《Educational Measurement》1988,7(4):29-37

This instructional module is intended to promote a conceptual understanding of test form equating using traditional methods. The purpose of equating and the context in which equating occurs are described. The process of equating is distinguished from the related process of scaling to achieve comparability. Three equating designs are considered, and three equating methods—man, linear, and equipercentile—are described and illustrated. Special attention is given to equating with nonequivalent groups, and to sources of equating error. 相似文献

2.

Relationships of Measurement Error and Prediction Error in Observed‐Score Regression

Tim Moses 《Journal of Educational Measurement》2012,49(4):380-398

The focus of this paper is assessing the impact of measurement errors on the prediction error of an observed‐score regression. Measures are presented and described for decomposing the linear regression's prediction error variance into parts attributable to the true score variance and the error variances of the dependent variable and the predictor variable(s). These measures are demonstrated for regression situations reflecting a range of true score correlations and reliabilities and using one and two predictors. Simulation results also are presented which show that the measures of prediction error variance and its parts are generally well estimated for the considered ranges of true score correlations and reliabilities and for homoscedastic and heteroscedastic data. The final discussion considers how the decomposition might be useful for addressing additional questions about regression functions’ prediction error variances. 相似文献

3.

Local Observed‐Score Kernel Equating

Marie Wiberg Wim J. van der Linden Alina A. von Davier 《Journal of Educational Measurement》2014,51(1):57-74

Three local observed‐score kernel equating methods that integrate methods from the local equating and kernel equating frameworks are proposed. The new methods were compared with their earlier counterparts with respect to such measures as bias—as defined by Lord's criterion of equity—and percent relative error. The local kernel item response theory observed‐score equating method, which can be used for any of the common equating designs, had a small amount of bias, a low percent relative error, and a relatively low kernel standard error of equating, even when the accuracy of the test was reduced. The local kernel equating methods for the nonequivalent groups with anchor test generally had low bias and were quite stable against changes in the accuracy or length of the anchor test. Although all proposed methods showed small percent relative errors, the local kernel equating methods for the nonequivalent groups with anchor test design had somewhat larger standard error of equating than their kernel method counterparts. 相似文献

4.

The Accuracy and Consistency of a Series of IRT True Score Equatings

Deping Li Yanlin Jiang Alina A. von Davier 《Journal of Educational Measurement》2012,49(2):167-189

This study investigates a sequence of item response theory (IRT) true score equatings based on various scale transformation approaches and evaluates equating accuracy and consistency over time. The results show that the biases and sample variances for the IRT true score equating (both direct and indirect) are quite small (except for the mean/sigma method). The biases and sample variances for the equating functions based on the characteristic curve methods and concurrent calibrations for adjacent forms are smaller than the biases and variances for the equating functions based on the moment methods. In addition, the IRT true score equating is also compared to the chained equipercentile equating, and we observe that the sample variances for the chained equipercentile equating are much smaller than the variances for the IRT true score equating with an exception at the low scores. 相似文献

5.

Section Preequating Under the Equivalent Groups Design Without IRT

Hongwen Guo Gautam Puhan 《Journal of Educational Measurement》2014,51(3):301-317

In this article, we introduce a section preequating (SPE) method (linear and nonlinear) under the randomly equivalent groups design. In this equating design, sections of Test X (a future new form) and another existing Test Y (an old form already on scale) are administered. The sections of Test X are equated to Test Y, after adjusting for the imperfect correlation between sections of Test X, to obtain the equated score on the complete form of X. Simulations and a real‐data application show that the proposed SPE method is fairly simple and accurate. 相似文献

6.

Small‐Sample Equating Using a Synthetic Linking Function

Sooyeon Kim Alina A. Von Davier Shelby Haberman 《Journal of Educational Measurement》2008,45(4):325-342

This study addressed the sampling error and linking bias that occur with small samples in a nonequivalent groups anchor test design. We proposed a linking method called the synthetic function, which is a weighted average of the identity function and a traditional equating function (in this case, the chained linear equating function). Specifically, we compared the synthetic, identity, and chained linear functions for various‐sized samples from two types of national assessments. One design used a highly reliable test and an external anchor, and the other used a relatively low‐reliability test and an internal anchor. The results from each of these methods were compared to the criterion equating function derived from the total samples with respect to linking bias and error. The study indicated that the synthetic functions might be a better choice than the chained linear equating method when samples are not large and, as a result, unrepresentative. 相似文献

7.

Examining the prediction of reading comprehension on different multiple‐choice tests

Rune Andreassen Ivar Brten 《Journal of Research in Reading》2010,33(3):263-283

In this study, 180 Norwegian fifth‐grade students with a mean age of 10.5 years were administered measures of word recognition skills, strategic text processing, reading motivation and working memory. Six months later, the same students were given three different multiple‐choice reading comprehension measures. Based on three forced‐order hierarchical multiple regression analyses, results indicated that the unique contribution of measured skills and processes to performance varied across comprehension tests. In particular, when the test consisted of a longer passage, contained a larger proportion of inferential questions and was answered without access to relevant text passages, the relative importance of word recognition skills seemed to be reduced while working memory emerged as a relatively strong, unique positive predictor of comprehension performance. These findings have important practical implications for the assessment of reading comprehension. 相似文献

8.

The Long‐Term Sustainability of IRT Scaling Methods in Mixed‐Format Tests

Lisa A. Keller Ronald K. Hambleton 《Journal of Educational Measurement》2013,50(4):390-407

Due to recent research in equating methodologies indicating that some methods may be more susceptible to the accumulation of equating error over multiple administrations, the sustainability of several item response theory methods of equating over time was investigated. In particular, the paper is focused on two equating methodologies: fixed common item parameter scaling (with two variations, FCIP‐1 and FCIP‐2) and the Stocking and Lord characteristic curve scaling technique in the presence of nonequivalent groups. Results indicated that the improvements made to fixed common item parameter scaling in the FCIP‐2 method were sustained over time. FCIP‐2 and Stocking and Lord characteristic curve scaling performed similarly in many instances and produced more accurate results than FCIP‐1. The relative performance of FCIP‐2 and Stocking and Lord characteristic curve scaling depended on the nature of the change in the ability distribution: Stocking and Lord characteristic curve scaling captured the change in the distribution more accurately than FCIP‐2 when the change was different across the ability distribution; FCIP‐2 captured the changes more accurately when the change was consistent across the ability distribution. 相似文献

9.

纵向量表制作浅谈

刘志明《考试研究》2005,(2)

等值(equating)和纵向量表化(vertical scaling)的功用是建立来自不同考试的分数之间的关系。等值是施用于相同年级,相同性质的试卷,而纵向量表化则用于不同年级而性质相似的试卷。纵向量表化是将不同年级的成绩放置于统一的成长分数量表之中。纵向量表(vertical scale)是一种延伸的分数,其度量跨越和串连不同年级之间,用以评估学生连继性的成就成长(Nitko,2004)。在教学中,学生的进度可以利用纵向量表来监察和评估。而在教育研究上,纵向量表可成为长期跟踪调查(longitudinal study)之有力工具。本文讨论纵向量表化的方法论,包括成长定义(definition of growth),数据收集(data collection)方法,试卷设计和使用项目反应理论(Item Response Theory)的方法以及对制作纵向量表提供一些实际的建议。相似文献

10.

Evaluating Equating Accuracy and Assumptions for Groups That Differ in Performance

Sonya Powers Michael J. Kolen 《Journal of Educational Measurement》2014,51(1):39-56

Accurate equating results are essential when comparing examinee scores across exam forms. Previous research indicates that equating results may not be accurate when group differences are large. This study compared the equating results of frequency estimation, chained equipercentile, item response theory (IRT) true‐score, and IRT observed‐score equating methods. Using mixed‐format test data, equating results were evaluated for group differences ranging from 0 to .75 standard deviations. As group differences increased, equating results became increasingly biased and dissimilar across equating methods. Results suggest that the size of group differences, the likelihood that equating assumptions are violated, and the equating error associated with an equating method should be taken into consideration when choosing an equating method. 相似文献

11.

Score Comparability of a State Reading Assessment Across Selected Groups of Students With Disabilities

《Structural equation modeling》2013,20(2):257-274

This study investigated the factorial invariance of scores from a 7th-grade state reading assessment across general education students and selected groups of students with disabilities. Confirmatory factor analysis was used to assess the fit of a 2-factor model to each of the 4 groups. In addition to overall fit of this model, 5 levels of constraint, including equal factor loadings, intercepts, error variances, factor variances, and factor covariances, were investigated. Invariance across the factor loadings and intercepts was supported across the groups of students with disabilities and general education students. Invariance for these groups was not supported for the error variances. For the students with mental retardation, the lack of fit of the 2-factor model and the observed score results suggested a mismatch between the difficulty level of this test and the ability level of these students. Although the results generally supported the score comparability of the reading assessment across these groups, further research is needed into the nature of the larger error variances for the student with disabilities groups and into accommodations and modifications for the students with mental retardation. 相似文献

12.

Measurement,Sampling, and Equating Errors in Large-Scale Assessments

Margaret Wu 《Educational Measurement》2010,29(4):15-27

In large-scale assessments, such as state-wide testing programs, national sample-based assessments, and international comparative studies, there are many steps involved in the measurement and reporting of student achievement. There are always sources of inaccuracies in each of the steps. It is of interest to identify the source and magnitude of the errors in the measurement process that may threaten the validity of the final results. Assessment designers can then improve the assessment quality by focusing on areas that pose the highest threats to the results. This paper discusses the relative magnitudes of three main sources of error with reference to the objectives of assessment programs: measurement error, sampling error, and equating error. A number of examples from large-scale assessments are used to illustrate these errors and their impact on the results. The paper concludes by making a number of recommendations that could lead to an improvement of the accuracies of large-scale assessment results. 相似文献

13.

Use of Adjustment by Minimum Discriminant Information in Linking Constructed‐Response Test Scores in the Absence of Common Items

Yi‐Hsuan Lee Shelby J. Haberman Neil J. Dorans 《Journal of Educational Measurement》2019,56(2):452-472

In many educational tests, both multiple‐choice (MC) and constructed‐response (CR) sections are used to measure different constructs. In many common cases, security concerns lead to the use of form‐specific CR items that cannot be used for equating test scores, along with MC sections that can be linked to previous test forms via common items. In such cases, adjustment by minimum discriminant information may be used to link CR section scores and composite scores based on both MC and CR sections. This approach is an innovative extension that addresses the long‐standing issue of linking CR test scores across test forms in the absence of common items in educational measurement. It is applied to a series of administrations from an international language assessment with MC sections for receptive skills and CR sections for productive skills. To assess the linking results, harmonic regression is applied to examine the effects of the proposed linking method on score stability, among several analyses for evaluation. 相似文献

14.

Scaling: An ITEMS Module

Ye Tong Michael J. Kolen 《Educational Measurement》2010,29(4):39-48

Scaling is the process of constructing a score scale that associates numbers or other ordered indicators with the performance of examinees. Scaling typically is conducted to aid users in interpreting test results. This module describes different types of raw scores and scale scores, illustrates how to incorporate various sources of information into a score scale, and introduces vertical scaling and its related designs and methodologies as a special type of scaling. After completion of this module, the reader should be able to understand the relationship between various types of raw scores, understand the relationship between raw scores and scale scores, construct a scale with desired properties, evaluate an existing score scale, understand how content and standards information are built into a scale, and understand how vertical scales are developed and used in practice. 相似文献

15.

On Attempting to Do What Lord Said Was Impossible: Commentary on van der Linden's “Some Conceptual Issues in Observed‐Score Equating”

Neil J. Dorans 《Journal of Educational Measurement》2013,50(3):304-314

van der Linden (this issue) uses words differently than Holland and Dorans. This difference in language usage is a source of some confusion in van der Linden's critique of what he calls equipercentile equating. I address these differences in language. van der Linden maintains that there are only two requirements for score equating. I maintain that the requirements he discards have practical utility and are testable. The score equity requirement proposed by Lord suggests that observed score equating was either unnecessary or impossible. Strong equity serves as the fulcrum for van der Linden's thesis. His proposed solution to the equity problem takes inequitable measures and aligns conditional error score distributions, resulting in a family of linking functions, one for each level of θ. In reality, θ is never known. Use of an anchor test as a proxy poses many practical problems, including defensibility. 相似文献

16.

The role of inhibitory functioning in children’s reading skills

Josephine N. Booth James M.E. Boyle 《Educational Psychology in Practice》2009,25(4):339-350

Executive functions, including inhibition, have been implicated in children’s reading ability. This study investigates whether children’s performance on an inhibition task is more indicative of reading ability than a measure of another executive function, that is, planning. Fifty‐three male participants were administered a reading test and tests of inhibition and planning not requiring a verbal response. Regression analyses revealed that only inhibition significantly predicted reading. Previous inconsistencies may reflect the modality of the tasks used to measure inhibition. Therefore non‐verbal measures may have highest utility for educational psychologists. 相似文献

17.

THE IMPACT OF ITEM DELETION ON EQUATING CONVERSIONS AND REPORTED SCORE DISTRIBUTIONS

NEIL J. DORANS 《Journal of Educational Measurement》1986,23(3):245-264

A formal analysis of the effects of item deletion on equating/scaling functions and reported score distributions is presented. There are two components of the present analysis: analytical and empirical. The analytical decomposition demonstrates how the effects of item characteristics, test properties, individual examinee responses, and rounding rules combine to produce the item deletion effect on the equating/scaling function and candidate scores, In addition to demonstrating how the deleted item's psychometric characteristics can affect the equating function, the analytical component of the report examines the effects of not scoring versus scoring all options correct, the effects of re-equating versus not re-equating, and the interaction between the decision to re-equate or to not re-equate and the scoring option chosen for the flawed item. The empirical portion of the report uses data from the May 1982 administration of the SA T, which contained the circles item, to illustrate the effects of item deletion on reported score distributions and equating functions. The empirical data verify what the analytical decomposition predicts. 相似文献

18.

Measuring college students' reading comprehension ability using cloze tests

Rihana Shiri Williams Omer Ari Carmen Nicole Santamaria 《Journal of Research in Reading》2011,34(2):215-231

Recent investigations challenge the construct validity of sustained silent reading tests. Performance of two groups of post‐secondary students (e.g. struggling and non‐struggling) on a sustained silent reading test and two types of cloze test (i.e. maze and open‐ended) was compared in order to identify the test format that contributes greater variance in reading comprehension. One hundred participants were recruited from students enrolled in a preparatory course for a high‐stakes statewide reading examination. Our results suggest that all three measures have good concurrent validity. There was no evidence that open‐ended cloze performance was more related to verbal ability than any other reading measure. Maze performance did the best job at discriminating between our struggling and non‐struggling readers. Implications for reading comprehension assessment in post secondary‐aged adults are discussed. 相似文献

19.

An Approach to Evaluating the Missing Data Assumptions of the Chain and Post-stratification Equating Methods for the NEAT Design

Paul W. Holland Sandip Sinharay Alina A. von Davier Ning Han 《Journal of Educational Measurement》2008,45(1):17-43

Two important types of observed score equating (OSE) methods for the non-equivalent groups with Anchor Test (NEAT) design are chain equating (CE) and post-stratification equating (PSE). CE and PSE reflect two distinctly different ways of using the information provided by the anchor test for computing OSE functions. Both types of methods include linear and nonlinear equating functions. In practical situations, it is known that the PSE and CE methods will give different results when the two groups of examinees differ on the anchor test. However, given that both types of methods are justified as OSE methods by making different assumptions about the missing data in the NEAT design, it is difficult to conclude which, if either, of the two is more correct in a particular situation. This study compares the predictions of the PSE and CE assumptions for the missing data using a special data set for which the usually missing data are available. Our results indicate that in an equating setting where the linking function is decidedly non-linear and CE and PSE ought to be different, both sets of predictions are quite similar but those for CE are slightly more accurate . 相似文献

20.

Checking the Statistical Equivalence of Nearly Identical Test Editions

《教育实用测度》2013,26(3):245-254

A procedure for checking the score equivalence of nearly identical editions of a test is described. This procedure is used early in the score equating process to help determine whether it is necessary to conduct separate equating analyses (using a variety of equating methods) for the two nearly identical versions of the test. The procedure employs the standard error of equating and utilizes graphical representation of score conversion deviation from the identity function in standard error units. Two illustrations of the procedure involving Scholastic Aptitude Test (SAT) data are presented. Advice about what to do if statistical equivalence does not obtain is given in the discussion section. Alternative strategies for assessing score equivalence are also discussed. 相似文献