期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Evaluating the Predictive Value of Growth Prediction Models

Daniel L. Murphy Matthew N. Gaertner 《Educational Measurement》2014,33(2):5-13

This study evaluates four growth prediction models—projection, student growth percentile, trajectory, and transition table—commonly used to forecast (and give schools credit for) middle school students' future proficiency. Analyses focused on vertically scaled summative mathematics assessments, and two performance standards conditions (high rigor and low rigor) were examined. Results suggest that, when “status plus growth” is the accountability metric a state uses to reward or sanction schools, growth prediction models offer value above and beyond status‐only accountability systems in most, but not all, circumstances. Predictive growth models offer little value beyond status‐only systems if the future target proficiency cut score is rigorous. Conversely, certain models (e.g., projection) provide substantial additional value when the future target cut score is relatively low. In general, growth prediction models' predictive value is limited by a lack of power to detect students who are truly on‐track. Limitations and policy implications are discussed, including the utility of growth projection models in assessment and accountability systems organized around ambitious college‐readiness goals. 相似文献

2.

A Guide to Understanding and Developing Performance‐Level Descriptors

Marianne Perie 《Educational Measurement》2008,27(4):15-29

相似文献

3.

Norm‐ and Criterion‐Referenced Student Growth

D. Betebenner 《Educational Measurement》2009,28(4):42-51

相似文献

4.

Holding Schools Accountable for the Growth of Nonproficient Students: Coordinating Measurement and Accountability 总被引：1，自引：0，他引：1

Jennifer L. Dunn Jessica Allen 《Educational Measurement》2009,28(4):27-41

A key intent of the NCLB growth pilot is to reward low‐status schools who are closing the gap to proficiency. In this article, we demonstrate that the capability of proposed models to identify those schools depends on how the growth model is incorporated into accountability decisions. Six pilot‐approved growth models were applied to vertically scaled mathematics assessment data from a single state collected over 2 years. Student and school classifications were compared across models. Accountability classifications using status and growth to proficiency as defined by each model were considered from two perspectives. The first involved adding the number of students moving toward proficiency to the count of proficient students, while the second involved a multitier accountability system where each school was first held accountable for status and then held accountable for the growth of their nonproficient students. Our findings emphasize the importance of evaluating status and growth independently when attempting to identify low‐status schools with insufficient growth among nonproficient students. 相似文献

5.

Reporting the Percentage of Students above a Cut Score: The Effect of Group Size

Lynne Hollingshead Ruth A. Childs 《Educational Measurement》2011,30(1):36-43

Large‐scale assessment results for schools, school boards/districts, and entire provinces or states are commonly reported as the percentage of students achieving a standard—‐that is, the percentage of students scoring above the cut score that defines the standard on the assessment scale. Recent research has shown that this method of reporting is sensitive to small changes in the cut score, especially when comparing results across years or between groups. This study builds on that work, investigating the effects of reporting group size on the stability of results. In Part 1 of this study, Grade 6 students’ results on Ontario's 2008 and 2009 Junior Assessments of Reading, Writing and Mathematics were compared, by school, for different sizes of schools. In Part 2, samples of students’ results on the 2009 assessment were randomly drawn and compared, for 10 group sizes, to estimate the variability in results due to sampling error. The results showed that the percentage of students above a cut score (PAC) was unstable for small schools and small randomly drawn groups. 相似文献

6.

Using Rating Augmentation to Expand the Scale of an Analytic Rubric

Jim Penny Robert L. Johnson Belita Gordon 《Journal of Experimental Education》2013,81(3):269-287

A method of expanding a rating scale 3-fold without the expense of defining additional benchmarks was studied. The authors used an analytic rubric representing 4 domains of writing and composed of 4-point scales to score 120 writing samples from Georgia's 11th-grade Writing Assessment. The raters augmented the scores of papers on which the proficiency levels appeared slightly higher or lower than the benchmark papers at the selected proficiency level by adding a “+” or a “?” to the score. The results of the study indicate that the use of this method of rating augmentation tends to improve most indices of interrater reliability, although the percentage of exact and adjacent agreement decreases because of the increased number of rating possibilities. In addition, there was evidence to suggest that the use of augmentation may produce domain-level scores with sufficient reliability for use with diagnostic feedback to teachers about the performance of students. 相似文献

7.

Schools for the deaf and the No Child Left Behind Act

Cawthon SW 《American annals of the deaf》2004,149(4):314-323

The No Child Left Behind Act of 2001 (NCLB) emphasizes educational accountability for all students. Twenty-eight states have policies to aggregate student participation and proficiency data for schools for the deaf in NCLB reports. The remaining states account for these students in other ways: referring student data to "sending" schools and aggregating data to the district or state level are most prominent. In reports of student assessment results for academic year 2002-2003, three schools for the deaf made "Adequate Yearly Progress" under NCLB: These schools demonstrated at least a 95% participation rate in assessments, and at least 95% of their students met or surpassed state proficiency benchmarks in reading and mathematics. Proficiency levels for other schools varied by report, but were often comparable to those of students with disabilities. Challenges and strategies for capturing the impact of NCLB accountability policies on deaf students are discussed. 相似文献

8.

Setting Standards for English Foreign Language Assessment: Methodology,Validation, and a Degree of Arbitrariness

Simon P. Tiffin‐Richards Hans Anand Pant Olaf Köller 《Educational Measurement》2013,32(2):15-25

Cut‐scores were set by expert judges on assessments of reading and listening comprehension of English as a foreign language (EFL), using the bookmark standard‐setting method to differentiate proficiency levels defined by the Common European Framework of Reference (CEFR). Assessments contained stratified item samples drawn from extensive item pools, calibrated using Rasch models on the basis of examinee responses of a German nationwide assessment of secondary school language performance. The results suggest significant effects of item sampling strategies for the bookmark method on cut‐score recommendations, as well as significant cut‐score judgment revision over cut‐score placement rounds. Results are discussed within a framework of establishing validity evidence supporting cut‐score recommendations using the widely employed bookmark method. 相似文献

9.

Test Development with Performance Standards and Achievement Growth in Mind

Steve Ferrara Dubravka Svetina Sylvia Skucha Anne H. Davidson 《Educational Measurement》2011,30(4):3-15

相似文献

10.

The Impact of Vertical Scaling Decisions on Growth Interpretations

Derek C. Briggs Jonathan P. Weeks 《Educational Measurement》2009,28(4):3-14

Most growth models implicitly assume that test scores have been vertically scaled. What may not be widely appreciated are the different choices that must be made when creating a vertical score scale. In this paper empirical patterns of growth in student achievement are compared as a function of different approaches to creating a vertical scale. Longitudinal item‐level data from a standardized reading test are analyzed for two cohorts of students between Grades 3 and 6 and Grades 4 and 7 for the entire state of Colorado from 2003 to 2006. Eight different vertical scales were established on the basis of choices made for three key variables: Item Response Theory modeling approach, linking approach, and ability estimation approach. It is shown that interpretations of empirical growth patterns appear to depend upon the extent to which a vertical scale has been effectively “stretched” or “compressed” by the psychometric decisions made to establish it. While all of the vertical scales considered show patterns of decelerating growth across grade levels, there is little evidence of scale shrinkage. 相似文献

11.

Estimating High School GPA Weighting Parameters With a Graded Response Model

John Hansen Philip Sadler Gerhard Sonnert 《Educational Measurement》2019,38(1):16-24

The high school grade point average (GPA) is often adjusted to account for nominal indicators of course rigor, such as “honors” or “advanced placement.” Adjusted GPAs—also known as weighted GPAs—are frequently used for computing students’ rank in class and in the college admission process. Despite the high stakes attached to GPA, weighting policies vary considerably across states and high schools. Previous methods of estimating weighting parameters have used regression models with college course performance as the dependent variable. We discuss and demonstrate the suitability of the graded response model for estimating GPA weighting parameters and evaluating traditional weighting schemes. In our sample, which was limited to self‐reported performance in high school mathematics courses, we found that commonly used policies award more than twice the bonus points necessary to create parity for standard and advanced courses. 相似文献

12.

Placism in NCLB—How Rural Children are Left Behind

Lorna Jimerson 《Equity & Excellence in Education》2013,46(3):211-219

No Child Left Behind (NCLB) has been proclaimed by some as a reform that will improve education for students from all backgrounds, in all locations. The main components of NCLB, however, are biased against students in small and rural schools. This bias, called “placism,” discriminates against people based on where they live. This rural incompatibility is evident in NCLB's accountability provisions, sanctions, and highly qualified teacher provisions. Problems in these areas are the result of ignoring, or distorting, the realities of rural schooling. The accountability provisions are constructed so that small schools will frequently be incorrectly labeled as failing. The sanctions, inappropriate for rural areas, fail to provide solutions to existing rural challenges. The “highly qualified” teacher provisions make it more difficult, not easier, for rural districts to attract and retain competent teachers. Unless these injustices are corrected, NCLB will serve to decrease educational quality for rural students. 相似文献

13.

Formative Information Using Student Growth Percentiles for the Quantification of English Language Learners’ Progress in Language Acquisition

Husein Taherbhai Kimberly O’Malley 《教育实用测度》2013,26(3):196-213

English language learners (ELLs) are the fastest growing subgroup in American schools. These students, by a provision in the reauthorization of the Elementary and Secondary Education Act, are to be supported in their quest for language proficiency through the creation of systems that more effectively measure ELLs’ progress across years. In the past, ELLs’ progress has been based on students’ prior scores measuring the same construct. To disentangle effectiveness from achievement, the reporting has generally targeted mean-group activity. In contrast, student growth percentiles (SGPs) provide a comparison of students’ growth with others who have the same achievement score history. By examining the construct measured by an English language proficiency test as manifested in student scores in Speaking, Listening, Reading and Writing, this article outlines the use of SGPs in providing information on how much each student needs to grow, which will allow educators to more effectively apply differential formative instructional strategies. 相似文献

14.

Psychometric Equivalence of Ratings for Repeat Examinees on a Performance Assessment for Physician Licensure

Mark R. Raymond Kimberly A. Swygert Nilufer Kahraman 《Journal of Educational Measurement》2012,49(4):339-361

Although a few studies report sizable score gains for examinees who repeat performance‐based assessments, research has not yet addressed the reliability and validity of inferences based on ratings of repeat examinees on such tests. This study analyzed scores for 8,457 single‐take examinees and 4,030 repeat examinees who completed a 6‐hour clinical skills assessment required for physician licensure. Each examinee was rated in four skill domains: data gathering, communication‐interpersonal skills, spoken English proficiency, and documentation proficiency. Conditional standard errors of measurement computed for single‐take and multiple‐take examinees indicated that ratings were of comparable precision for the two groups within each of the four skill domains; however, conditional errors were larger for low‐scoring examinees regardless of retest status. In addition, on their first attempt multiple‐take examinees exhibited less score consistency across the skill domains but on their second attempt their scores became more consistent. Further, the median correlation between scores on the four clinical skill domains and three external measures was .15 for multiple‐take examinees on their first attempt but increased to .27 for their second attempt, a value, which was comparable to the median correlation of .26 for single‐take examinees. The findings support the validity of inferences based on scores from the second attempt. 相似文献

15.

Population Invariance and the Equatability of Tests: Basic Theory and The Linear Case 总被引：1，自引：0，他引：1

Neil J. Dorans Paul W. Holland 《Journal of Educational Measurement》2000,37(4):281-306

How does the fact that two tests should not be equated manifest itself? This paper addresses this question through the study of the degree to which equating functions fail to exhibit population invariance across subpopulations. Equating fimctions are supposed to be population invariant by definition. But, when two tests are not equatable, it is possible that the linking functions, used to connect the scores of one to the scores of the other, are not invariant across different populations of examinees. While no acceptable equating function is ever completely population invariant, in the situations where equating is usually performed we believe that the dependence of the equating function on the population used to compute it is usually small enough to be ignored. We introduce two root‐mean‐square difference measures of the degree to which the functions used to link two tests computed on different subpopulations differ from the linking function computed for the whole population. We also introduce the system of “parallel‐linear” linking functions for multiple subpopulations and show that, for this system, our measure of population invariance can be computed easily from the standardized mean differences between the scores of the subpopulations on the two tests. For the parallel‐linear case, we develop a correlation‐based upper bound on our measure that holds for all systems of subpopulations. We illustrate these ideas using data from the SAT I and from a concordance study of several combinations of ACT and SAT I scores, In the appendices, we give some theoretical results bearing on the other equating “requirements” of “same construct,”“same reliability” and one aspect of Lord's concept of equity. 相似文献

16.

Increasing Content Knowledge and Self‐Efficacy of High School Educators through an Online Course in Food Science

Andrea M. Liceaga Tameshia S. Ballard Levon T. Esters 《Journal of Food Science Education》2014,13(2):28-32

Purdue Univ.'s College of Agriculture developed an Advanced Life Sciences (ALS) program in partnership with several high schools across Indiana. As part of ALS, secondary educators take an introductory food science (FS) course (ALS‐Foods) and teach it at their high school. High school students taking the ALS‐Foods receive dual credit for an introductory course required for all FS majors at Purdue. The goal of this project was to develop an online course to improve content knowledge and self‐efficacy of secondary educators in the field of FS. The course was offered over a 3‐wk period and consisted of 3 learning modules focused on food chemistry, food microbiology, and food processing. Modules included class activities, videos, study questions, and teaching tools. Participants were assessed on content knowledge through written assignments, quizzes, and a final examination. Twenty secondary educators from several states were enrolled. Overall, content knowledge increased significantly (P < 0.05) across all 3 modules after completing the course. Highest scores were in food microbiology/safety (84%), followed by food processing (76%) and food chemistry (70%). A precourse survey indicated that the majority (>80%) of participants felt they had “no‐confidence” to “little‐confidence” in teaching FS concepts related to the 3 modules. Upon completing the course, the confidence level of all participants increased to “some‐confidence” or “complete confidence.” By strengthening the knowledge level of secondary educators, they will be better prepared to teach FS and subsequently, more high school students could be exposed to FS and consider it as a career. 相似文献

17.

How Should Colleges Treat Multiple Admissions Test Scores?

下载免费PDF全文

Krista Mattern Justine Radunzel Maria Bertling Andrew D. Ho 《Educational Measurement》2018,37(3):11-23

The percentage of students retaking college admissions tests is rising. Researchers and college admissions offices currently use a variety of methods for summarizing these multiple scores. Testing organizations such as ACT and the College Board, interested in validity evidence like correlations with first‐year grade point average (FYGPA), often use the most recent test score available. In contrast, institutions report using a variety of composite scoring methods for applicants with multiple test records, including averaging and taking the maximum subtest score across test occasions (“superscoring”). We compare four scoring methods on two criteria. First, we compare correlations between scores and FYGPA by scoring method. We find them similar (). Second, we compare the extent to which test scores differentially predict FYGPA by scoring method and number of retakes. We find that retakes account for additional variance beyond standardized achievement and positively predict FYGPA across all scoring methods. Superscoring minimizes this differential prediction—although it may seem that superscoring should inflate scores across retakes, this inflation is “true” in that it accounts for the positive effects of retaking for predicting FYGPA. Future research should identity factors related to retesting and consider how they should be used in college admissions. 相似文献

18.

Developing local oral reading fluency cut scores for predicting high‐stakes test performance

下载免费PDF全文

Sally L. Grapin John H. Kranzler Nancy Waldron Diana Joyce‐Beaulieu James Algina 《Psychology in the schools》2017,54(9):932-946

This study evaluated the classification accuracy of a second grade oral reading fluency curriculum‐based measure (R‐CBM) in predicting third grade state test performance. It also compared the long‐term classification accuracy of local and publisher‐recommended R‐CBM cut scores. Participants were 266 students who were divided into a calibration sample (n = 170) and two cross‐validation samples (n = 46; n = 50), respectively. Using calibration sample data, local fall, winter, and spring R‐CBM cut scores for predicting students’ state test performance were developed using three methods: discriminant analysis (DA), logistic regression (LR), and receiver operating characteristic curve analysis (ROC). The classification accuracy of local and publisher‐recommended cut scores was evaluated across subsamples. Only DA and ROC produced cut scores that maintained adequate sensitivity (≥.70) across cohorts; however, LR and publisher‐recommended scores had higher levels of specificity and overall correct classification. Implications for developing local cut scores are discussed. 相似文献

19.

Examining the association between empathising,systemising, degree subject and gender

Christopher Manson 《Educational studies》2012,38(1):73-88

Systemising is the drive to analyse or construct systems, and can be assessed by a systemising quotient (SQ). Empathising is the drive to identify mental states and respond with an appropriate emotion, and can be assessed by an empathising quotient (EQ). Previous evidence suggests that: (1) males are more drawn to systemise than females, and females are more drawn to empathise than males; and (2) males are more likely to work in science and engineering, or to study science subjects at university. This study found: (1) males score more highly on the SQ, and females score more highly on the EQ; (2) controlling for age and gender, there is an association between degree subject and SQ and EQ scores, with “scientists” scoring higher on the SQ and “artists” scoring more highly on the EQ; and (3) individuals’ scores on EQ and SQ were better predictors of degree subject than gender. 相似文献

20.

The Impact of Examinee Performance Information on Judges’ Cut Scores in Modified Angoff Standard‐Setting Exercises

Melissa J. Margolis Brian E. Clauser 《Educational Measurement》2014,33(1):15-22

This research evaluated the impact of a common modification to Angoff standard‐setting exercises: the provision of examinee performance data. Data from 18 independent standard‐setting panels across three different medical licensing examinations were examined to investigate whether and how the provision of performance information impacted judgments and the resulting cut scores. Results varied by panel but in general indicated that both the variability among the panelists and the resulting cut scores were affected by the data. After the review of performance data, panelist variability generally decreased. In addition, for all panels and examinations pre‐ and post‐data cut scores were significantly different. Investigation of the practical significance of the findings indicated that nontrivial fail rate changes were associated with the cut score changes for a majority of standard‐setting exercises. This study is the first to provide a large‐scale, systematic evaluation of the impact of a common standard setting practice, and the results can provide practitioners with insight into how the practice influences panelist variability and resulting cut scores. 相似文献