期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A Comparison of Three Scoring Methods for Tests With Selected-Response and Constructed-Response Items

《Educational Assessment》2013,18(4):317-340

A number of methods for scoring tests with selected-response (SR) and constructed-response (CR) items are available. The selection of a method depends on the requirements of the program, the particular psychometric model and assumptions employed in the analysis of item and score data, and how scores are to be used. This article compares 3 methods: unweighted raw scores, Item Response Theory pattern scores, and weighted raw scores. Student score data from large-scale end-of-course high school tests in Biology and English were used in the comparisons. In the weighted raw score method evaluated in this study, the CR items were weighted so that SR and CR items contributed the same number of points toward the total score. The scoring methods were compared for the total group and for subgroups of students in terms of the resultant scaled score distributions, standard errors of measurement, and proficiency-level classifications. For most of the student ability distribution, the three scoring methods yielded similar results. Some differences in results are noted. Issues to be considered when selecting a scoring method are discussed. 相似文献

2.

Kindergarten Predictors of Math Learning Disability

Michle M. M. Mazzocco Richard E. Thompson 《Learning disabilities research & practice》2005,20(3):142-155

The aim of the present study was to address how to effectively predict mathematics learning disability (MLD). Specifically, we addressed whether cognitive data obtained during kindergarten can effectively predict which children will have MLD in third grade, whether an abbreviated test battery could be as effective as a standard psychoeducational assessment at predicting MLD, and whether the abbreviated battery corresponded to the literature on MLD characteristics. Participants were 226 children who enrolled in a 4‐year prospective longitudinal study during kindergarten. We administered measures of mathematics achievement, formal and informal mathematics ability, visual‐spatial reasoning, and rapid automatized naming and examined which test scores and test items from kindergarten best predicted MLD at grades 2 and 3. Statistical models using standardized scores from the entire test battery correctly classified ～80–83 percent of the participants as having, or not having, MLD. Regression models using scores from only individual test items were less predictive than models containing the standard scores, except for models using a specific subset of test items that dealt with reading numerals, number constancy, magnitude judgments of one‐digit numbers, or mental addition of one‐digit numbers. These models were as accurate in predicting MLD as was the model including the entire set of standard scores from the battery of tests examined. Our findings indicate that it is possible to effectively predict which kindergartners are at risk for MLD, and thus the findings have implications for early screening of MLD. 相似文献

3.

Differentials of a State Reading Assessment: Item Functioning, Distractor Functioning, and Omission Frequency for Disability Categories

Kentaro Kato Ross E. Moen Martha L. Thurlow 《Educational Measurement》2009,28(2):28-40

Large data sets from a state reading assessment for third and fifth graders were analyzed to examine differential item functioning (DIF), differential distractor functioning (DDF), and differential omission frequency (DOF) between students with particular categories of disabilities (speech/language impairments, learning disabilities, and emotional behavior disorders) and students without disabilities. Multinomial logistic regression was employed to compare response characteristic curves (RCCs) of individual test items. Although no evidence for serious test bias was found for the state assessment examined in this study, the results indicated that students in different disability categories showed different patterns of DIF, DDF, and DOF, and that the use of RCCs helps clarify the implications of DIF and DDF. 相似文献

4.

Generalization of the Lord‐Wingersky Algorithm to Computing the Distribution of Summed Test Scores Based on Real‐Number Item Scores

Seonghoon Kim 《Journal of Educational Measurement》2013,50(4):381-389

With known item response theory (IRT) item parameters, Lord and Wingersky provided a recursive algorithm for computing the conditional frequency distribution of number‐correct test scores, given proficiency. This article presents a generalized algorithm for computing the conditional distribution of summed test scores involving real‐number item scores. The generalized algorithm is distinct from the Lord‐Wingersky algorithm in that it explicitly incorporates the task of figuring out all possible unique real‐number test scores in each recursion. Some applications of the generalized recursive algorithm, such as IRT test score reliability estimation and IRT proficiency estimation based on summed test scores, are illustrated with a short test by varying scoring schemes for its items. 相似文献

5.

Physical Disability,Stigma, and Physical Activity in Children

Carolyn J. Barg Brittany D. Armstrong Samuel P. Hetz 《International Journal of Disability, Development & Education》2010,57(4):371-382

Using the stereotype content model as a guiding framework, this study explored whether the stigma that able‐bodied adults have towards children with a physical disability is reduced when the child is portrayed as being active. In a 2 (physical activity status) x 2 (ability status) study design, 178 university students rated a child described in one of four vignettes on 12 dimensions of perceived warmth and competence. Results revealed a main effect of ability status on warmth (p < 0.001) such that children with a physical disability were rated significantly higher in perceived warmth than able‐bodied children, regardless of activity status (d = 0.86). Also, there was a significant interaction (p = 0.02) of ability and activity status on perceived competence, indicating that ratings of perceived competence were significantly higher for active children with a physical disability than for all other children (d = 0.54–0.64). Results suggest that physical activity should be explored as a way to mitigate the stigmatisation of children with a physical disability. 相似文献

6.

THE SELF-SCORING FLEXILEVEL TEST1

FREDERIC M. LORD 《Journal of Educational Measurement》1971,8(3):147-151

Modifications of administration and item arrangement of a conventional test can force a match between item difficulty levels and the ability level of the examinee. Although different examinees take different sets of items, the scoring method provides comparable scores for all. Furthermore, the test is self-scoring. These advantages are obtained without some of the usual disadvantages of tailored testing. 相似文献

7.

The Impact of Anonymization for Automated Essay Scoring

下载免费PDF全文

Mark D. Shermis Sue Lottridge Elijah Mayfield 《Journal of Educational Measurement》2015,52(4):419-436

This study investigated the impact of anonymizing text on predicted scores made by two kinds of automated scoring engines: one that incorporates elements of natural language processing (NLP) and one that does not. Eight data sets (N = 22,029) were used to form both training and test sets in which the scoring engines had access to both text and human rater scores for training, but only the text for the test set. Machine ratings were applied under three conditions: (a) both the training and test were conducted with the original data, (b) the training was modeled on the anonymized data, but the predictions were made on the original data, and (c) both the training and test were conducted on the anonymized text. The first condition served as the baseline for subsequent comparisons on the mean, standard deviation, and quadratic weighted kappa. With one exception, results on scoring scales in the range of 1–6 were not significantly different. The results on scales that were much wider did show significant differences. The conclusion was that anonymizing text for operational use may have a differential impact on machine score predictions for both NLP and non‐NLP applications. 相似文献

8.

Precision of age norms in tests used to assess preschool children

Janet E. Spector 《Psychology in the schools》1999,36(6):459-471

This study investigated normative precision in 14 preschool tests representing four domains: cognitive, language, adaptive behavior, and early academic skills. The purpose was to explore the consequences of using tests with more‐ vs. less‐precise age norms to identify disabilities in preschool children. As expected, on tests with more precise norms, standard scores associated with the same raw score shifted gradually across age groups. On the other hand, tests with less precise norms showed more dramatic standard score shifts across age groups. Examination of the degree of shift found in each test indicated that many preschool tests have norm tables that are potentially problematic for diagnosing disabilities, particularly for children near norm group cut‐off ages. On high stakes tests, an optimal span is one to three months. This standard can be achieved by using interpolation and/or increasing the size of norming samples at the preschool level. © 1999 John Wiley & Sons, Inc. 相似文献

9.

The Scaling of Mixed-Item-Format Tests With the One-Parameter and Two-Parameter Partial Credit Models

Robert C. Sykes Wendy M. Yen 《Journal of Educational Measurement》2000,37(3):221-244

Item response theory scalings were conducted for six tests with mixed item formats. These tests differed in their proportions of constructed response (c.r.) and multiple choice (m.c.) items and in overall difficulty. The scalings included those based on scores for the c.r. items that had maintained the number of levels as the item rubrics, either produced from single ratings or multiple ratings that were averaged and rounded to the nearest integer, as well as scalings for a single form of c.r. items obtained by summing multiple ratings. A one-parameter (IPPC) or two-parameter (2PPC) partial credit model was used for the c.r. items and the one-parameter logistic (IPL) or three-parameter logistic (3PL) model for the m.c. items, ltem fit was substantially worse with the combination IPL/IPPC model than the 3PL/2PPC model due to the former's restrictive assumptions that there would be no guessing on the m.c. items and equal item discrimination across items and item types. The presence of varying item discriminations resulted in the IPL/IPPC model producing estimates of item information that could be spuriously inflated for c.r. items that had three or more score levels. Information for some items with summed ratings were usually overestimated by 300% or more for the IPL/IPPC model. These inflated information values resulted in under-estbnated standard errors of ability estimates. The constraints posed by the restricted model suggests limitations on the testing contexts in which the IPL/IPPC model can be accurately applied. 相似文献

10.

Effects of Assigning Raters to Items

Robert C. Sykes Kyoko Ito Zhen Wang 《Educational Measurement》2008,27(1):47-55

Student responses to a large number of constructed response items in three Math and three Reading tests were scored on two occasions using three ways of assigning raters: single reader scoring, a different reader for each response (item-specific), and three readers each scoring a rater item block (RIB) containing approximately one-third of a student's responses. Multiple group confirmatory factor analyses indicated that the three types of total scores were most frequently tau-equivalent. Factor models fitted on the item responses attributed differences in scores to correlated ratings incurred by the same reader scoring multiple responses. These halo effects contributed to significantly increased single reader mean total scores for three of the tests. The similarity of scores for item-specific and RIB scoring suggests that the effect of rater bias on an examinee's set of responses may be minimized with the use of multiple readers though fewer than the number of items. 相似文献

11.

Relationships between psychometric dimensions of item quality and student ratings of item relevancy and ambiguity

Samuel B. Green Gerald Halpin 《Research in higher education》1977,7(3):281-286

Students rated the quality of the items on a classroom test that had been taken previously. On the same test, psychometric item indices were calculated. The results showed that the student ratings were related to the item difficulty, but not to the item-test correlation. In addition, the better-achieving students tended to rate the items as less ambiguous. Finally, the ambiguity ratings were more highly related to the item-test correlations for the better achieving students. These findings support opinions held by many instructors of students' judgments of item quality. 相似文献

12.

Singaporean Parents’ Curriculum Priorities for Their Children with Disabilities

Levan Lim Tan Ai Girl Marilyn M. Quah 《International Journal of Disability, Development & Education》2000,47(1):77-87

This study examined Singaporean parents’ perspectives on how much they valued major curriculum skill areas for their children with disabilities. Parents were also asked to indicate whether they expected priority skill items within the curriculum areas to be performed with assistance or independently. The results showed that the parents of children with moderate and severe disabilities indicated the highest priority for self-help functional life skills, followed by community-based functional life skills, social relationship skills, and functional academics. Parents of children with mild disabilities indicated the highest priority for self-help functional life skills, followed by community-based functional skills, functional academics, and social relationship skills. The results also showed that parents’ relative ratings of the other skill areas besides self-help skills were influenced by the level of disability of their children. The milder the disability, the higher the relative parental ratings and expectations for independent performance of social relationship skills, functional academics, and community-based life skills. Conversely, the more severe the disability, the lower the relative ratings of these skills and expectations for independent performance compared with self-help functional life skills. 相似文献

13.

Maintaining Equivalent Cut Scores for Small Sample Test Forms

Andrew C. Dwyer 《Journal of Educational Measurement》2016,53(1):3-22

This study examines the effectiveness of three approaches for maintaining equivalent performance standards across test forms with small samples: (1) common‐item equating, (2) resetting the standard, and (3) rescaling the standard. Rescaling the standard (i.e., applying common‐item equating methodology to standard setting ratings to account for systematic differences between standard setting panels) has received almost no attention in the literature. Identity equating was also examined to provide context. Data from a standard setting form of a large national certification test (N examinees = 4,397; N panelists = 13) were split into content‐equivalent subforms with common items, and resampling methodology was used to investigate the error introduced by each approach. Common‐item equating (circle‐arc and nominal weights mean) was evaluated at samples of size 10, 25, 50, and 100. The standard setting approaches (resetting and rescaling the standard) were evaluated by resampling (N = 8) and by simulating panelists (N = 8, 13, and 20). Results were inconclusive regarding the relative effectiveness of resetting and rescaling the standard. Small‐sample equating, however, consistently produced new form cut scores that were less biased and less prone to random error than new form cut scores based on resetting or rescaling the standard. 相似文献

14.

Psychometric Equivalence of Ratings for Repeat Examinees on a Performance Assessment for Physician Licensure

Mark R. Raymond Kimberly A. Swygert Nilufer Kahraman 《Journal of Educational Measurement》2012,49(4):339-361

Although a few studies report sizable score gains for examinees who repeat performance‐based assessments, research has not yet addressed the reliability and validity of inferences based on ratings of repeat examinees on such tests. This study analyzed scores for 8,457 single‐take examinees and 4,030 repeat examinees who completed a 6‐hour clinical skills assessment required for physician licensure. Each examinee was rated in four skill domains: data gathering, communication‐interpersonal skills, spoken English proficiency, and documentation proficiency. Conditional standard errors of measurement computed for single‐take and multiple‐take examinees indicated that ratings were of comparable precision for the two groups within each of the four skill domains; however, conditional errors were larger for low‐scoring examinees regardless of retest status. In addition, on their first attempt multiple‐take examinees exhibited less score consistency across the skill domains but on their second attempt their scores became more consistent. Further, the median correlation between scores on the four clinical skill domains and three external measures was .15 for multiple‐take examinees on their first attempt but increased to .27 for their second attempt, a value, which was comparable to the median correlation of .26 for single‐take examinees. The findings support the validity of inferences based on scores from the second attempt. 相似文献

15.

Agreement on Childhood Disability between Parents and Teachers in Vietnam

Jin Y. Shin Nguyen Viet Nhan Kathleen Crittenden S. Stavros Valenti Hoang Thi Dieu Hong 《International Journal of Disability, Development & Education》2008,55(3):239-249

The purpose of the present study was to examine agreement on childhood disability among the teachers and parents of children with cognitive delays in Vietnam. The participants were 57 teachers in kindergarten programmes (for children 2 to 6 years of age), and 106 mothers and 93 fathers of the children attending these kindergarten programmes. The data were collected using the ABILITIES Index and a demographic information form. The results indicated that teachers rated the children’s level of functioning more severely, especially in the areas of intellectual disabilities and behaviour problems, than mothers and fathers. Logistic regression that examined the factors that predicted the agreement and disagreement among parents and teachers revealed that teachers and parents were more likely to agree when the child’s disability was genetically related or physical. Screening, diagnosis and treatment issues can become more challenging for children with intellectual disabilities who do not have such physical and genetic conditions, especially when the agreement between parents and professionals on the conditions of the children is low. 相似文献

16.

Using Retest Data to Evaluate and Improve Effort‐Moderated Scoring

Steven L. Wise Megan R. Kuhfeld 《Journal of Educational Measurement》2021,58(1):130-149

There has been a growing research interest in the identification and management of disengaged test taking, which poses a validity threat that is particularly prevalent with low‐stakes tests. This study investigated effort‐moderated (E‐M) scoring, in which item responses classified as rapid guesses are identified and excluded from scoring. Using achievement test data composed of test takers who were quickly retested and showed differential degrees of disengagement, three basic findings emerged. First, standard E‐M scoring accounted for roughly one‐third of the score distortion due to differential disengagement. Second, a modified E‐M scoring method that used more liberal time thresholds performed better—accounting for two‐thirds or more of the distortion. Finally, the inability of E‐M scoring to account for all of the score distortion suggests the additional presence of nonrapid item responses that reflect less‐than‐full engagement by some test takers. 相似文献

17.

Evaluating the Consistency of Angoff-Based Cut Scores Using Subsets of Items Within a Generalizability Theory Framework

Priya Kannan Adrienne Sgammato Richard J. Tannenbaum Irvin R. Katz 《教育实用测度》2015,28(3):169-186

The Angoff method requires experts to view every item on the test and make a probability judgment. This can be time consuming when there are large numbers of items on the test. In this study, a G-theory framework was used to determine if a subset of items can be used to make generalizable cut-score recommendations. Angoff ratings (i.e., probability judgments) from previously conducted standard setting studies were used first in a re-sampling study, followed by D-studies. For the re-sampling study, proportionally stratified subsets of items were extracted under various sampling and test-length conditions. The mean cut score, variance components, expected standard error (SE) around the mean cut score, and root-mean-squared deviation (RMSD) across 1,000 replications were estimated at each study condition. The SE and the RMSD decreased as the number of items increased, but this reduction tapered off after approximately 45 items. Subsequently, D-studies were performed on the same datasets. The expected SE was computed at various test lengths. Results from both studies are consistent with previous research indicating that between 40–50 items are sufficient to make generalizable cut score recommendations. 相似文献

18.

Weighting Constructed-Response Items in IRT-Based Exams

《教育实用测度》2013,26(4):257-275

Weighting responses to Constructed-Response (CR) items has been proposed as a way to increase the contribution these items make to the test score when there is insufficient testing time to administer additional CR items. The effect of various types of weighting items of an IRT-based mixed-format writing examination was investigated. Constructed-response items were weighted by increasing their representation according to the test blueprint, by increasing their contribution to the test characteristic curve, by summing the ratings of multiple raters, and by applying optimal weights utilized in IRT pattern scoring. Total score and standard errors of the weighted composite forms of CR and Multiple-Choice (MC) items were compared against each other and against a form containing additional rather than weighted items. Weighting resulted in a slight reduction of test reliability but reduced standard error in portions of the ability scale. 相似文献

19.

Methodological issues and learning disabilities diagnosis in clinical populations.

R W Kamphaus P J Frick B B Lahey 《Journal of learning disabilities》1991,24(10):613-618

Previous research suggests that the diagnosis of a comorbid learning disability is dependent on the method used for making the LD diagnosis. This study investigated that proposition by studying the effects of using three approaches to the assessment of learning disabilities in a sample of 177 six- to thirteen-year-old boys referred to outpatient mental health clinics for behavior problems. The use of these three procedures to diagnose comorbid learning problems produced significantly different results. All methods identified significant numbers of children in the clinical population as learning disabled; however, each method identified children with differing characteristics. Consistent with predictions from measurement theory, the commonly used simple standard score discrepancy method was more likely to identify children with above-average IQs as learning disabled, whereas a regression approach identified learning disabilities more consistently across the ability range. These results were interpreted as supporting the use of regression approaches to diagnose co-occurring learning disabilities, as that method is less likely to be biased by the child's intelligence test score. The implications of the use of each method in research investigations is also discussed. 相似文献

20.

A Stepwise Test Characteristic Curve Method to Detect Item Parameter Drift

下载免费PDF全文

Rui Guo Yi Zheng Hua‐Hua Chang 《Journal of Educational Measurement》2015,52(3):280-300

An important assumption of item response theory is item parameter invariance. Sometimes, however, item parameters are not invariant across different test administrations due to factors other than sampling error; this phenomenon is termed item parameter drift. Several methods have been developed to detect drifted items. However, most of the existing methods were designed to detect drifts in individual items, which may not be adequate for test characteristic curve–based linking or equating. One example is the item response theory–based true score equating, whose goal is to generate a conversion table to relate number‐correct scores on two forms based on their test characteristic curves. This article introduces a stepwise test characteristic curve method to detect item parameter drift iteratively based on test characteristic curves without needing to set any predetermined critical values. Comparisons are made between the proposed method and two existing methods under the three‐parameter logistic item response model through simulation and real data analysis. Results show that the proposed method produces a small difference in test characteristic curves between administrations, an accurate conversion table, and a good classification of drifted and nondrifted items and at the same time keeps a large amount of linking items. 相似文献